-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: Automatically analyze gathered bootstrap logs? #2569
Comments
I like the idea, do we have an idea regarding a) how we want to expose this users? |
Log entries, like in #2567 .
systemd unit failures, like in #2567 . Also, maybe "you had insufficient creds to pull the release image"? Or "your control plane machines never formed and etcd cluster". "I can't even SSH into your control plane machines". I don't think we need an exhaustive set of things, we can just grow this incrementally as we run into issues in CI or the wild.
Anytime someone hits a gather and says "I dunno, installer folks, but here's your gathered tarball", I think we should think about whether there is something we could either be fixing so it doesn't happen again or, when that's not possible (e.g. user provides insufficient pull secret), logging a more approachable summary of the underlying issue. The tarball alone is usually going to be sufficient for debugging, but I think we'll have fewer installer-targeted tickets if we provide summaries where we can. |
@abhinavdahiya recently dived into some CI logs and found a bootloop related issue it seems. Sounds like it required looking at multiple log entries to determine it though. |
When installer fails, and the failure has to do with something external to the cluster, installer should report what went wrong and what the user needs to do to resolve it. Typical things may be resource problems in the cloud such as quota exceeded, not enough hosts available, token bad/expired/missing and such. Internal failings are OpenShift bugs that need to be fixed, not reported by installer. |
I like having them in a subcommand so we can run them automatically on behalf of the user (who maybe got the installer binary without our associated hack dir, which is the case for non-UPI CI jobs). But if folks feel that is too much of a risk, I'd be ok landing them outside the installer binary as a stopgap. Ideally in a separate Go command, to make it easier to compile them into the core installer binary if we decide to go that way in the future. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
If the only thing holding this up is a desire to one-up systemd on summarizing failures, I'm happy to add that to #2567. Is that really all that we need to unblock this? There's also some contention here between "don't spew unsummarized failures at users" and "sometimes the installer will not have a compact summary". So in order of increased utility:
I don't think getting to 4 should block us from moving to 3, but I'm happy to take #2567's failing-unit output to 4 before it lands if you have ideas about how you'd like those failing units summarized. |
What's needed to get #2567 rebased and landed? I think it's valuable and would be a good foundation for us to address some baremetal-specific things (see openshift/enhancements#328). I agree 4 can come later, but if we wanted one neat little bow, we might print something useful if we notice the release-image service failed. |
/lifecycle frozen |
Currently the summarize the cluster install failure does (3) i.e. it shows all the conditions of the cluster operators. Looking at all the BZs that assigned to the installer and people asking why installer failed in the slack shows that (3) has proven to add no major value to directing users to correct operators. So your point that (4) shouldn't block us to do (3) imo is not very useful. I think we need to target (4). |
Want to pick a first thing? Like "Failed to fetch the release image"? |
When bootstrapping fails but we got far enough in to be able to SSH into the bootstrap machine, the installer automatically gathers a log tarball. But the tarball is probably fairly intimidating to users who aren't on the install team or in the position to be frequently debuggers. And in some cases, the results are sufficiently structured that we can point out a specific problem (e.g. CRI-O failed, #2567) or "you had insufficient creds to pull the release image" (#901). Do we want to teach the installer how to find and highlight some of those cases? We have a fair number of them in the last 24 hours of CI:
But currently no way to break those down by underlying cause.
The text was updated successfully, but these errors were encountered: