Created attachment 1732104 [details] metal3-ironic-conductor log Version: $ ./openshift-baremetal-install version ./openshift-baremetal-install 4.7.0-0.nightly-2020-11-21-205026 built from commit 68282c185253d4831514b20623b1717535c5e6f2 release image registry.svc.ci.openshift.org/ocp/release@sha256:5f6e3655e91f66583bcef6e7d980316295ee6f0cf94e66b1ffb2fd9d089629e7 Platform: IPI baremetal What happened? While verifying https://bugzilla.redhat.com/show_bug.cgi?id=1886327: when scaling up machineset to add a machine using wrong rootDeviceHint, deployment process entered infinite loop: error provisioning -> deprovisioning -> ready -> provisioning The corresponding machine reported as Provisioned NAME PHASE TYPE REGION ZONE AGE ocp-edge-cluster-0-276ht-worker-0-5n2rn Provisioned 3h30m What did you expect to happen? After few attempts the process should stop on error How to reproduce it (as minimally and precisely as possible)? 1. Deploy a cluster with 3 masters and 2 workers 2. Create configuration yaml file for adding new bmh. Set rootDeviceHints: deviceName to not existing device (see in attachment) 3. Add bmh for worker using the created yaml file. $ oc create -f new-node2.yaml -n openshift-machine-api 4. Wait till bmh becomes ready 5. Scale up machineset to add the new machine $ oc scale machineset MACHINESETNAME -n openshift-machine-api --replicas=3 Anything else we need to know? Attaching metal3-ironic-conductor.log
Created attachment 1732105 [details] example of configuration yaml
There is WA for the problem: scale down machineset, delete BMH, fix the yaml configuration and recreate bmh
Repeatedly retrying the reprovisioning is expected. We don't currently make any distinction between configuration errors (this will never work) and transient errors - mainly because Ironic cannot be relied on to give us granular enough information about the cause. What you should see is an increasing error count, and increasing time between retries. It shouldn't be necessary to delete the BMH to work around this; simply updating with the correct spec should be enough. If it were not, that would be a bug in the baremetal-operator; however at first glance the code appears correct (and this was the subject of several previous bugs, so it should have been fairly thoroughly verified.) Did you attempt to update the BMH in place?
(In reply to Zane Bitter from comment #3) > It shouldn't be necessary to delete the BMH to work around this; simply > updating with the correct spec should be enough. If it were not, that would > be a bug in the baremetal-operator; however at first glance the code appears > correct (and this was the subject of several previous bugs, so it should > have been fairly thoroughly verified.) Did you attempt to update the BMH in > place? My bad. You are right, after I fixed device hint for existing BMH configuration by $ oc edit BMHNAME the next attempt to provision the machine succeeded > Repeatedly retrying the reprovisioning is expected. We don't currently make > any distinction between configuration errors (this will never work) and > transient errors - mainly because Ironic cannot be relied on to give us > granular enough information about the cause. Should this bz be closed as NOTABUG or WONTFIX/CANTFIX?