Bug 1900378

Summary: Infinite loop on provisioning error when scaling up machineset with error in yaml config
Product: OpenShift Container Platform Reporter: Lubov <lshilin>
Component: Bare Metal Hardware ProvisioningAssignee: Steven Hardy <shardy>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Lubov <lshilin>
Status: CLOSED WORKSFORME Docs Contact:
Severity: low    
Priority: low CC: afasano, zbitter
Version: 4.7Keywords: Triaged
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-12-14 06:51:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1886327    
Attachments:
Description Flags
metal3-ironic-conductor log
none
example of configuration yaml none

Description Lubov 2020-11-22 16:39:31 UTC
Created attachment 1732104 [details]
metal3-ironic-conductor log

Version:
$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.7.0-0.nightly-2020-11-21-205026
built from commit 68282c185253d4831514b20623b1717535c5e6f2
release image registry.svc.ci.openshift.org/ocp/release@sha256:5f6e3655e91f66583bcef6e7d980316295ee6f0cf94e66b1ffb2fd9d089629e7

Platform:
IPI baremetal

What happened?
While verifying https://bugzilla.redhat.com/show_bug.cgi?id=1886327:
when scaling up machineset to add a machine using wrong rootDeviceHint, deployment process entered infinite loop: error provisioning -> deprovisioning -> ready -> provisioning 
The corresponding machine reported as Provisioned
NAME                                      PHASE         TYPE   REGION   ZONE   AGE
ocp-edge-cluster-0-276ht-worker-0-5n2rn   Provisioned                          3h30m

What did you expect to happen?
After few attempts the process should stop on error

How to reproduce it (as minimally and precisely as possible)?
1. Deploy a cluster with 3 masters and 2 workers 
2. Create configuration yaml file for adding new bmh. Set rootDeviceHints: deviceName to not existing device (see in attachment)
3. Add bmh for worker using the created yaml file.
$ oc create -f new-node2.yaml -n openshift-machine-api
4. Wait till bmh becomes ready
5. Scale up machineset to add the new machine
$ oc scale machineset MACHINESETNAME -n openshift-machine-api --replicas=3

Anything else we need to know?
Attaching metal3-ironic-conductor.log

Comment 1 Lubov 2020-11-22 16:42:01 UTC
Created attachment 1732105 [details]
example of configuration yaml

Comment 2 Lubov 2020-11-24 17:31:27 UTC
There is WA for the problem: scale down machineset, delete BMH, fix the yaml configuration and recreate bmh

Comment 3 Zane Bitter 2020-11-24 17:56:41 UTC
Repeatedly retrying the reprovisioning is expected. We don't currently make any distinction between configuration errors (this will never work) and transient errors - mainly because Ironic cannot be relied on to give us granular enough information about the cause.

What you should see is an increasing error count, and increasing time between retries.

It shouldn't be necessary to delete the BMH to work around this; simply updating with the correct spec should be enough. If it were not, that would be a bug in the baremetal-operator; however at first glance the code appears correct (and this was the subject of several previous bugs, so it should have been fairly thoroughly verified.) Did you attempt to update the BMH in place?

Comment 4 Lubov 2020-11-25 11:19:40 UTC
(In reply to Zane Bitter from comment #3)
> It shouldn't be necessary to delete the BMH to work around this; simply
> updating with the correct spec should be enough. If it were not, that would
> be a bug in the baremetal-operator; however at first glance the code appears
> correct (and this was the subject of several previous bugs, so it should
> have been fairly thoroughly verified.) Did you attempt to update the BMH in
> place?

My bad. You are right, after I fixed device hint for existing BMH configuration by 
$ oc edit BMHNAME
the next attempt to provision the machine succeeded

> Repeatedly retrying the reprovisioning is expected. We don't currently make
> any distinction between configuration errors (this will never work) and
> transient errors - mainly because Ironic cannot be relied on to give us
> granular enough information about the cause.

Should this bz be closed as NOTABUG or WONTFIX/CANTFIX?