1900378 – Infinite loop on provisioning error when scaling up machineset with error in yaml config

Bug 1900378 - Infinite loop on provisioning error when scaling up machineset with error in yaml config

Summary: Infinite loop on provisioning error when scaling up machineset with error in ...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Steven Hardy
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1886327
TreeView+	depends on / blocked

Reported:	2020-11-22 16:39 UTC by Lubov
Modified:	2020-12-14 06:51 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-12-14 06:51:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
metal3-ironic-conductor log (9.21 MB, application/gzip) 2020-11-22 16:39 UTC, Lubov	no flags	Details
example of configuration yaml (581 bytes, text/plain) 2020-11-22 16:42 UTC, Lubov	no flags	Details
View All

Description Lubov 2020-11-22 16:39:31 UTC

Created attachment 1732104 [details]
metal3-ironic-conductor log

Version:
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.7.0-0.nightly-2020-11-21-205026
built from commit 68282c185253d4831514b20623b1717535c5e6f2
release image registry.svc.ci.openshift.org/ocp/release@sha256:5f6e3655e91f66583bcef6e7d980316295ee6f0cf94e66b1ffb2fd9d089629e7

Platform:
IPI baremetal

What happened?
While verifying https://bugzilla.redhat.com/show_bug.cgi?id=1886327:
when scaling up machineset to add a machine using wrong rootDeviceHint, deployment process entered infinite loop: error provisioning -> deprovisioning -> ready -> provisioning 
The corresponding machine reported as Provisioned
NAME                                      PHASE         TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-276ht-worker-0-5n2rn   Provisioned                          3h30m

What did you expect to happen?
After few attempts the process should stop on error

How to reproduce it (as minimally and precisely as possible)?
1. Deploy a cluster with 3 masters and 2 workers 
2. Create configuration yaml file for adding new bmh. Set rootDeviceHints: deviceName to not existing device (see in attachment)
3. Add bmh for worker using the created yaml file.
$ oc create -f new-node2.yaml -n openshift-machine-api
4. Wait till bmh becomes ready
5. Scale up machineset to add the new machine
$ oc scale machineset MACHINESETNAME -n openshift-machine-api --replicas=3

Anything else we need to know?
Attaching metal3-ironic-conductor.log

Comment 1 Lubov 2020-11-22 16:42:01 UTC

Created attachment 1732105 [details]
example of configuration yaml

Comment 2 Lubov 2020-11-24 17:31:27 UTC

There is WA for the problem: scale down machineset, delete BMH, fix the yaml configuration and recreate bmh

Comment 3 Zane Bitter 2020-11-24 17:56:41 UTC

Repeatedly retrying the reprovisioning is expected. We don't currently make any distinction between configuration errors (this will never work) and transient errors - mainly because Ironic cannot be relied on to give us granular enough information about the cause.

What you should see is an increasing error count, and increasing time between retries.

It shouldn't be necessary to delete the BMH to work around this; simply updating with the correct spec should be enough. If it were not, that would be a bug in the baremetal-operator; however at first glance the code appears correct (and this was the subject of several previous bugs, so it should have been fairly thoroughly verified.) Did you attempt to update the BMH in place?

Comment 4 Lubov 2020-11-25 11:19:40 UTC

(In reply to Zane Bitter from comment #3)
> It shouldn't be necessary to delete the BMH to work around this; simply
> updating with the correct spec should be enough. If it were not, that would
> be a bug in the baremetal-operator; however at first glance the code appears
> correct (and this was the subject of several previous bugs, so it should
> have been fairly thoroughly verified.) Did you attempt to update the BMH in
> place?

My bad. You are right, after I fixed device hint for existing BMH configuration by 
$ oc edit BMHNAME
the next attempt to provision the machine succeeded

> Repeatedly retrying the reprovisioning is expected. We don't currently make
> any distinction between configuration errors (this will never work) and
> transient errors - mainly because Ironic cannot be relied on to give us
> granular enough information about the cause.

Should this bz be closed as NOTABUG or WONTFIX/CANTFIX?

Note You need to log in before you can comment on or make changes to this bug.