TRT has noticed an uptick in failures on certain metal jobs. One of the failure cases seems to be worker not getting provisioned. See https://search.ci.openshift.org/?search=6+machines.*node&maxAge=168h&context=1&type=bug%2Bjunit&name=.*4.1.*metal.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job In this particular job: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-ovn-ipv6/1500945414994530304 A worker machine is stuck "provisioning" because there's no available bare metal host: status: conditions: - lastTransitionTime: "2022-03-07T22:21:31Z" message: Instance has not been created reason: InstanceNotCreated severity: Warning status: "False" type: InstanceExists errorMessage: No available BareMetalHost found When looking at at the BMH, you can see ostest-worker-1 and ostest-worker-2 are consumed by the same worker: $ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api ostest-worker-2 provisioned ostest-rbt8c-worker-0-cpkx7 true 17h openshift-machine-api ostest-worker-1 provisioned ostest-rbt8c-worker-0-cpkx7 true 17h openshift-machine-api ostest-master-1 externally provisioned ostest-rbt8c-master-1 true 17h openshift-machine-api ostest-master-2 externally provisioned ostest-rbt8c-master-2 true 17h openshift-machine-api ostest-master-0 externally provisioned ostest-rbt8c-master-0 true 17h openshift-machine-api ostest-worker-0 provisioned ostest-rbt8c-worker-0-8l4tx true 17h I am not sure what logic cluster-api-provider-baremetal is using to assign a BMH to a worker, but there seems to be some kind of race where 2 BMH can be consumed by the same machine.
2022-03-07T22:28:10.310571787Z 2022/03/07 22:28:10 Creating machine ostest-rbt8c-worker-0-cpkx7 2022-03-07T22:28:10.310862634Z 2022/03/07 22:28:10 Host 'ostest-worker-0' matched hostSelector for Machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.310862634Z 2022/03/07 22:28:10 Host 'ostest-worker-1' matched hostSelector for Machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.310862634Z 2022/03/07 22:28:10 Host 'ostest-worker-2' matched hostSelector for Machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.310862634Z 2022/03/07 22:28:10 3 hosts available while choosing host for machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.310862634Z 2022/03/07 22:28:10 Associating machine ostest-rbt8c-worker-0-cpkx7 with host ostest-worker-2 2022-03-07T22:28:10.328955287Z 2022/03/07 22:28:10 setting annotation for ostest-rbt8c-worker-0-cpkx7 to metal3.io/BareMetalHost="openshift-machine-api/ostest-worker-2" 2022-03-07T22:28:10.328955287Z 2022/03/07 22:28:10 setting annotation for ostest-rbt8c-worker-0-cpkx7 to machine.openshift.io/instance-state="available" 2022-03-07T22:28:10.338793000Z W0307 22:28:10.338744 1 controller.go:366] ostest-rbt8c-worker-0-cpkx7: failed to create machine: requeue in: 0s 2022-03-07T22:28:10.338793000Z I0307 22:28:10.338772 1 controller.go:460] Actuator returned requeue-after error: requeue in: 0s 2022-03-07T22:28:10.697696323Z 2022/03/07 22:28:10 Creating machine ostest-rbt8c-worker-0-cpkx7 2022-03-07T22:28:10.698005837Z 2022/03/07 22:28:10 Host 'ostest-worker-2' matched hostSelector for Machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.698005837Z 2022/03/07 22:28:10 Host 'ostest-worker-0' matched hostSelector for Machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.698005837Z 2022/03/07 22:28:10 Host 'ostest-worker-1' matched hostSelector for Machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.698005837Z 2022/03/07 22:28:10 3 hosts available while choosing host for machine 'ostest-rbt8c-worker-0-cpkx7' 2022-03-07T22:28:10.698021319Z 2022/03/07 22:28:10 Associating machine ostest-rbt8c-worker-0-cpkx7 with host ostest-worker-1 2022-03-07T22:28:10.720600833Z 2022/03/07 22:28:10 setting annotation for ostest-rbt8c-worker-0-cpkx7 to metal3.io/BareMetalHost="openshift-machine-api/ostest-worker-1" 2022-03-07T22:28:10.720600833Z 2022/03/07 22:28:10 setting annotation for ostest-rbt8c-worker-0-cpkx7 to machine.openshift.io/instance-state="available" 2022-03-07T22:28:10.725309867Z W0307 22:28:10.725253 1 controller.go:366] ostest-rbt8c-worker-0-cpkx7: failed to create machine: failed to update machine annotation: Operation cannot be fulfilled on machines.machine.openshift.io "ostest-rbt8c-worker-0-cpkx7": the object has been modified; please apply your changes to the latest version and try again 2022-03-07T22:28:10.725339030Z E0307 22:28:10.725305 1 controller.go:304] controller-runtime/manager/controller/machine_controller "msg"="Reconciler error" "error"="failed to update machine annotation: Operation cannot be fulfilled on machines.machine.openshift.io \"ostest-rbt8c-worker-0-cpkx7\": the object has been modified; please apply your changes to the latest version and try again" "name"="ostest-rbt8c-worker-0-cpkx7" "namespace"="openshift-machine-api" It looks like we're requeuing before the cache has been updated with either the Consumer ref in the BMH ostest-worker-2 or the annotation on the Machine, and thus it is selecting another Machine and also claiming that one. We fail to update the annotation, but by then it is too late. It's not clear to me what could possibly have changed recently to cause this? The easiest fix is probably to add a delay at https://github.com/openshift/cluster-api-provider-baremetal/blob/master/pkg/cloud/baremetal/actuators/machine/actuator.go#L672 We don't actually want to requeue at all - we should just let the reconcile be triggered when the change to the annotations shows up in the cache - but we have to return RequeueAfterError{} to prevent the controller deciding we have completed the Create phase prematurely.
PR sitting for a few weeks: https://github.com/openshift/cluster-api-provider-baremetal/pull/167 needs an approver.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069