Bug 1822763

Summary:	Master nodes wait in loop forever while booting up the cluster
Product:	OpenShift Container Platform	Reporter:	Jean-Francois Saucier <jsaucier>
Component:	Bare Metal Hardware Provisioning	Assignee:	Dmitry Tantsur <dtantsur>
Bare Metal Hardware Provisioning sub component:	ironic	QA Contact:	Raviv Bar-Tal <rbartal>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	beth.white, bfournie, bschmaus, dhellmann, dsafford, dtantsur, openshift-bugs-escalate, rbartal, stbenjam, tschaibl
Version:	4.3.z	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Retries for PXE boot have been introduced to cope with environments with networking issues. Additionally, the maximum number of retries has been increased for communication between the bare metal provisioner and the nodes being provisioned.	Story Points:	---
Clone Of:
Clones:	1827932 (view as bug list)		Environment:
Last Closed:	2020-07-13 17:26:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1827721
Bug Blocks:	1827932

Description Jean-Francois Saucier 2020-04-09 19:14:48 UTC

Description of problem:

While creating the cluster with "oc create cluster", the master nodes are in "still creating" stage for almost 5 hours with no progress.

During the cluster deployment, the master nodes get stuck in maintenance mode and clean-wait state. Removing the maintenance mode manually resume the install process.

This issue was seen in ironic in several environments. Not sure if the node gets stuck in clean-wait and then is up switches to maintenance mode, or if the node switches to maintenance mode, which then results in a fail of the clean-wait.


openstack baremetal node list
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| 342aae36-e195-47dd-ba15-b5807f714c32 | master-2 | None          | power on    | clean wait         | True        |
| b09d50b3-a21d-4509-a315-506c5404321e | master-1 | None          | power on    | available          | False       |
| cc9a39f7-4d05-4ffb-bf49-6954d08e1dcc | master-0 | None          | power on    | available          | False       |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+

openstack baremetal node maintenance unset master-2

openstack baremetal node list
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| 342aae36-e195-47dd-ba15-b5807f714c32 | master-2 | None          | power on    | clean wait         | False       |
| b09d50b3-a21d-4509-a315-506c5404321e | master-1 | None          | power on    | available          | False       |
| cc9a39f7-4d05-4ffb-bf49-6954d08e1dcc | master-0 | None          | power on    | available          | False       |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+

Comment 2 Stephen Benjamin 2020-04-16 16:17:39 UTC

Could we get a screenshot of the console on the master-2 machine to see what it's doing? That would give us clues where things are getting stuck. 

Could you also provide us the ironic conductor logs from the bootstrap host?

  sudo podman logs ironic-conductor

Comment 7 Dmitry Tantsur 2020-04-23 15:28:38 UTC

> Not sure if the node gets stuck in clean-wait and then is up switches to maintenance mode, or if the node switches to maintenance mode, which then results in a fail of the clean-wait.

Failing cleaning ends up in maintenance (it's considered a risky situation). Then I suspect BMO may be retrying cleaning without unsetting maintenance (which is doomed to hang). Stephen, could you check if we're always unsetting maintenance on retrying cleaning? Or doing anything with nodes that fail it?

> it does appear to be an issue with Ironic (or the BMC)

If don't confuse anything, my last conclusion was networking issues. I can take another look, but let's figure out maintenance first.

Comment 8 Dmitry Tantsur 2020-04-23 15:35:41 UTC

It seems that we need to finish https://github.com/metal3-io/baremetal-operator/pull/289 otherwise cleaning failures are not handled correctly.

Comment 9 Dmitry Tantsur 2020-04-23 15:46:04 UTC

What I could figure from the logs/videos yesterday:

1) Cleaning actually finished successfully in that two runs.
2) On deployment ironic stopped being able to reach one node and failed after 3 retries.
3) Raising retry count to 30 helped ironic to reach the node, but then the ramdisk got stuck while downloading the image.

This made me suspect transient networking problems between the node and ironic.

On SSH access to the node: you need to update ironic.conf adding your key AND selinux=0 to [pxe]pxe_append_params as described in http://tripleo.org/install/troubleshooting/troubleshooting-nodes.html#accessing-the-ramdisk. Note that I've never tried it with RHEL 8, but I don't see why it wouldn't work.

Comment 11 Stephen Benjamin 2020-04-23 16:19:54 UTC

> It seems that we need to finish https://github.com/metal3-io/baremetal-operator/pull/289 otherwise cleaning failures are not handled correctly.

Indeed, there are failure cases that aren't handled correctly by baremetal-operator to retry, but even if it did retry it's not clear to me that it'd succeed. Do we know why the cleaning is failing? To get it to retry, I think you should be able to delete the BareMetalHost CR and recreate it.

Comment 12 Stephen Benjamin 2020-04-23 16:29:46 UTC

Sorry for the confusion - deleting the BareMetalHost won't help for the masters since you're not even getting that far.  Does cleaning fail on every provisioning attempt?

Comment 14 Dmitry Tantsur 2020-04-23 16:53:40 UTC

Okay, at least on my environment the SSH key setting fails because the quotation mark gets escaped, like sshkey=&#34;ssh-rsa... Good news is, it's supposed to be fixed by https://review.opendev.org/#/c/716963/ that merged a week ago and might not have gotten into containers yet. I'm re-trying with a fix now.

Comment 17 Dmitry Tantsur 2020-04-23 17:00:31 UTC

> Does cleaning fail on every provisioning attempt?

The attempts I've been researching have finished cleaning successfully and during deployment one of two things happen:
1) Ironic fails to talk to the ramdisk (all of a sudden, the same conversation works during cleaning)
2) The ramdisk gets stuck downloading the image.

The latter may be mitigated if we add retries and timeouts to https://opendev.org/openstack/ironic-python-agent/src/branch/master/ironic_python_agent/extensions/standby.py#L73. I was going to propose a patch, but running out of time and brains dealing with the SSH problem.

Comment 18 Dmitry Tantsur 2020-04-23 17:09:35 UTC

An update on SSH: I can confirm that it works for me after applying the changes from https://review.opendev.org/#/c/716963/1/ironic/common/utils.py to my ironic-conductor container and configuring pxe_append_params like this:

[pxe] 
pxe_append_params = nofb nomodeset vga=normal sshkey="ssh-rsa AAAAB ...." selinux=0

If it's acceptable, I'd recommend trying it and seeing what's going on with networking from inside the machine. Bob (bfournier) may be able to help you, he's in the US time.

Comment 19 Dmitry Tantsur 2020-04-23 17:27:28 UTC

An update on downloading the image: I've proposed an upstream patch https://review.opendev.org/#/c/722409/ that adds timeout and retries when connecting to an image server. If you're up for repacking the ramdisk (https://docs.openstack.org/ironic/latest/admin/troubleshooting.html#patching-the-deploy-ramdisk) you can give it a try (warning: I haven't had time to test it).

Note that it won't help if the download does start but is exceptionally slow.

Comment 23 Dmitry Tantsur 2020-04-24 14:57:40 UTC

Action items we've determined so far:

1) The immediate priority is to raise [agent]max_command_attempts to something much higher than 3. It will mitigate ironic's inability to reach the ramdisk in case of long networking glitches. It can be done in ironic-image.

2) Enable PXE boot retries in ironic. The feature is already in the packages, we just need to enable it in ironic-image per https://docs.openstack.org/ironic/train/install/configure-pxe.html#pxe-timeouts-tuning. Can be done in the same patch.

3) Introduce retries and timeouts when downloading the image from the ramdisk. The upstream patch https://review.opendev.org/722675 has just been approved, we'll need to get it to the ramdisk image we use. Needs a new bugzilla to track the process.

We'll also need to figure out why logging into a ramdisk stopped working. Needs a new bugzilla as well. Long-term we should consider pre-configuring an SSH key for logging into ramdisks.

Comment 27 Doug Hellmann 2020-04-27 20:04:16 UTC

(In reply to Dmitry Tantsur from comment #23)
> Action items we've determined so far:
> 
> 1) The immediate priority is to raise [agent]max_command_attempts to
> something much higher than 3. It will mitigate ironic's inability to reach
> the ramdisk in case of long networking glitches. It can be done in
> ironic-image.

Given that the context for this is metal3, and operators need to keep trying to reconcile their operands, is there a way to tell ironic to just keep trying forever?

Comment 28 Dmitry Tantsur 2020-04-28 09:18:25 UTC

We prefer to avoid retrying forever since it gives poor user experience (process just hangs without any insights on what is going on).

Comment 30 errata-xmlrpc 2020-07-13 17:26:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 31 Red Hat Bugzilla 2024-01-06 04:28:52 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days