Bug 1822763
Summary: | Master nodes wait in loop forever while booting up the cluster | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jean-Francois Saucier <jsaucier> | |
Component: | Bare Metal Hardware Provisioning | Assignee: | Dmitry Tantsur <dtantsur> | |
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Raviv Bar-Tal <rbartal> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | beth.white, bfournie, bschmaus, dhellmann, dsafford, dtantsur, openshift-bugs-escalate, rbartal, stbenjam, tschaibl | |
Version: | 4.3.z | Keywords: | Triaged | |
Target Milestone: | --- | |||
Target Release: | 4.5.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Retries for PXE boot have been introduced to cope with environments with networking issues. Additionally, the maximum number of retries has been increased for communication between the bare metal provisioner and the nodes being provisioned.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1827932 (view as bug list) | Environment: | ||
Last Closed: | 2020-07-13 17:26:42 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1827721 | |||
Bug Blocks: | 1827932 |
Description
Jean-Francois Saucier
2020-04-09 19:14:48 UTC
Could we get a screenshot of the console on the master-2 machine to see what it's doing? That would give us clues where things are getting stuck. Could you also provide us the ironic conductor logs from the bootstrap host? sudo podman logs ironic-conductor > Not sure if the node gets stuck in clean-wait and then is up switches to maintenance mode, or if the node switches to maintenance mode, which then results in a fail of the clean-wait. Failing cleaning ends up in maintenance (it's considered a risky situation). Then I suspect BMO may be retrying cleaning without unsetting maintenance (which is doomed to hang). Stephen, could you check if we're always unsetting maintenance on retrying cleaning? Or doing anything with nodes that fail it? > it does appear to be an issue with Ironic (or the BMC) If don't confuse anything, my last conclusion was networking issues. I can take another look, but let's figure out maintenance first. It seems that we need to finish https://github.com/metal3-io/baremetal-operator/pull/289 otherwise cleaning failures are not handled correctly. What I could figure from the logs/videos yesterday: 1) Cleaning actually finished successfully in that two runs. 2) On deployment ironic stopped being able to reach one node and failed after 3 retries. 3) Raising retry count to 30 helped ironic to reach the node, but then the ramdisk got stuck while downloading the image. This made me suspect transient networking problems between the node and ironic. On SSH access to the node: you need to update ironic.conf adding your key AND selinux=0 to [pxe]pxe_append_params as described in http://tripleo.org/install/troubleshooting/troubleshooting-nodes.html#accessing-the-ramdisk. Note that I've never tried it with RHEL 8, but I don't see why it wouldn't work. > It seems that we need to finish https://github.com/metal3-io/baremetal-operator/pull/289 otherwise cleaning failures are not handled correctly.
Indeed, there are failure cases that aren't handled correctly by baremetal-operator to retry, but even if it did retry it's not clear to me that it'd succeed. Do we know why the cleaning is failing? To get it to retry, I think you should be able to delete the BareMetalHost CR and recreate it.
Sorry for the confusion - deleting the BareMetalHost won't help for the masters since you're not even getting that far. Does cleaning fail on every provisioning attempt? Okay, at least on my environment the SSH key setting fails because the quotation mark gets escaped, like sshkey="ssh-rsa... Good news is, it's supposed to be fixed by https://review.opendev.org/#/c/716963/ that merged a week ago and might not have gotten into containers yet. I'm re-trying with a fix now. > Does cleaning fail on every provisioning attempt? The attempts I've been researching have finished cleaning successfully and during deployment one of two things happen: 1) Ironic fails to talk to the ramdisk (all of a sudden, the same conversation works during cleaning) 2) The ramdisk gets stuck downloading the image. The latter may be mitigated if we add retries and timeouts to https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/extensions/standby.py#L73. I was going to propose a patch, but running out of time and brains dealing with the SSH problem. An update on SSH: I can confirm that it works for me after applying the changes from https://review.opendev.org/#/c/716963/1/ironic/common/utils.py to my ironic-conductor container and configuring pxe_append_params like this: [pxe] pxe_append_params = nofb nomodeset vga=normal sshkey="ssh-rsa AAAAB ...." selinux=0 If it's acceptable, I'd recommend trying it and seeing what's going on with networking from inside the machine. Bob (bfournier) may be able to help you, he's in the US time. An update on downloading the image: I've proposed an upstream patch https://review.opendev.org/#/c/722409/ that adds timeout and retries when connecting to an image server. If you're up for repacking the ramdisk (https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#patching-the-deploy-ramdisk) you can give it a try (warning: I haven't had time to test it). Note that it won't help if the download does start but is exceptionally slow. Action items we've determined so far: 1) The immediate priority is to raise [agent]max_command_attempts to something much higher than 3. It will mitigate ironic's inability to reach the ramdisk in case of long networking glitches. It can be done in ironic-image. 2) Enable PXE boot retries in ironic. The feature is already in the packages, we just need to enable it in ironic-image per https://docs.openstack.org/ironic/train/install/configure-pxe.html#pxe-timeouts-tuning. Can be done in the same patch. 3) Introduce retries and timeouts when downloading the image from the ramdisk. The upstream patch https://review.opendev.org/722675 has just been approved, we'll need to get it to the ramdisk image we use. Needs a new bugzilla to track the process. We'll also need to figure out why logging into a ramdisk stopped working. Needs a new bugzilla as well. Long-term we should consider pre-configuring an SSH key for logging into ramdisks. (In reply to Dmitry Tantsur from comment #23) > Action items we've determined so far: > > 1) The immediate priority is to raise [agent]max_command_attempts to > something much higher than 3. It will mitigate ironic's inability to reach > the ramdisk in case of long networking glitches. It can be done in > ironic-image. Given that the context for this is metal3, and operators need to keep trying to reconcile their operands, is there a way to tell ironic to just keep trying forever? We prefer to avoid retrying forever since it gives poor user experience (process just hangs without any insights on what is going on). Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |