Description of problem: While testing a small scale environment that can provision 60 SNO clusters as libvirt VMs hosted on separate hypervisors from a bare-metal hub cluster. We have found several times that after applying all the manifests to ZTP provision a SNO that bmh objects are stuck in ready state. This environment is an ipv6/disconnected environment. Version-Release number of selected component (if applicable): OCP Hub cluster 4.8.0-rc.1 SNO clusters 4.8.0-rc.1 ACM - 2.3.0-DOWNSTREAM-2021-07-01-08-39-58 / v2.3.0-148 How reproducible: At a 1s stagger "rate" (1s sleep between oc apply of all manifests for a single SNO) we saw 27/60 clusters get provisioned with 33 stuck in ready state At a 10s stagger "rate" (10s sleep between oc apply) we saw 59/60 clusters get provisioned with 1 stuck in ready state. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: bmh object to not be stuck in ready state and for the SNO cluster to provision regardless of the rate in which we apply manifests Additional info: We can delete and recreate the manifests and the cluster will retry and provision.
On our most recent test where one bmh object was stuck in "ready" state. We witnessed the cluster itself seems to have "retried" after ~65m: oc describe bmh -n sno00056 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Registered 105m metal3-baremetal-controller Registered new host Normal BMCAccessValidated 105m metal3-baremetal-controller Verified access to BMC Normal InspectionSkipped 105m metal3-baremetal-controller disabled by annotation Normal ProfileSet 105m metal3-baremetal-controller Hardware profile set: unknown Normal ProvisioningStarted 40m metal3-baremetal-controller Image provisioning started for https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg Normal ProvisioningComplete 40m metal3-baremetal-controller Image provisioning completed for https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg Normal UpdateHardwareDetails 38m metal3-baremetal-controller Set HardwareDetails from annotation It is unclear as to why there was a large delay in this cluster sno00056 starting its provision.
The bmh stuck in ready state is also happening in our 1K environment using the same AI version (in ACM d/s 2.3.0-DOWNSTREAM-2021-07-01-08-39-58 ) We had as many as 980 stuck. They started provisioning again after almost 4 hours [root@f36-h21-000-r640 clusters]# oc get agentclusterinstall -A --no-headers -o custom-columns=READY:'.status.conditions[?(@.type=="Completed")].reason',msg:'.status.conditions[?(@.type=="Completed")].message' | sort | uniq -c 97 InstallationCompleted The installation has completed: Cluster is installed 1 InstallationFailed The installation has failed: cluster has hosts in error 2 InstallationFailed The installation has failed: Timeout while waiting for cluster version to be available: timed out 1 InstallationInProgress The installation is in progress: Installation in progress 980 InstallationNotStarted The installation has not yet started Snippet of oc get bmh -A sno01065 sno01065 provisioned true sno01066 sno01066 ready false sno01067 sno01067 ready false sno01068 sno01068 ready false sno01069 sno01069 ready false sno01070 sno01070 ready false sno01071 sno01071 ready false
Note that there is a limit of 20 hosts being provisioned simultaneously. However, provisioning with virtualmedia is fast and this should not introduce too much of a delay. There were also known throughput limitations when very large numbers of hosts are created simultaneously. This was improved in 4.8 (see https://github.com/metal3-io/baremetal-operator/issues/905); however, running Metal³ on a host with a small number of CPUs (<8) will limit performance. However, looking through the logs for sno00056 specifically, neither of these problems are in evidence. It reaches the ready state at 15:18:31, and the baremetal-operator dutifully checks it once per minute until 16:24:05, when it is finally instructed to begin provisioning. This suggests the issue is with the assisted-service. This is confirmed by looking at the assisted-service logs. We can see it setting the Image in the BMHAgent at 16:24: time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=8e2c3f9e-91fd-4917-8834-d306cc94013c time="2021-07-02T16:24:07Z" level=info msg="Image URL has been set in the BareMetalHost sno00056/sno00056" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:629" time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile.func1" file="/remote-source/assi The previous attempt at reconciling the BareMetalHost in the BMHAgent controller was at 15:18 and had this result: time="2021-07-02T15:18:33Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted -service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d time="2021-07-02T15:18:33Z" level=info msg="Stopping reconcileBMH: InfraEnv corresponding to the BMH has no image URL available." func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMA CReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:592" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id =6b64fc51-55fa-47f4-8713-28dab7ecf67d The image seems to have become available shortly after: time="2021-07-02T15:18:37Z" level=info msg="ISODownloadURL changed from to https://assisted-service-open-cluster-management..." func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=1105 infra_env=sno00056 infra_env_namespace=sno00056 request_id=cfa008d5-24b7-438a-86c3-81705e2ed497 It appears that after this there was no requeue requested and no watch on the InfraEnv that could trigger re-run of the BMHAgent controller reconcile loop when the image became available. (Alternatively, it is possible that the image really did take this long to become available on the InfraEnv.)
In the logs I see that the last BareMetalHost Reconcile started before the Image was generated was: time="2021-07-02T15:18:33Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d time="2021-07-02T15:18:33Z" level=info msg="Stopping reconcileBMH: InfraEnv corresponding to the BMH has no image URL available." func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:592" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d Then: time="2021-07-02T15:18:37Z" level=info msg="ISODownloadURL changed from to https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=1105 infra_env=sno00056 infra_env_namespace=sno00056 request_id=cfa008d5-24b7-438a-86c3-81705e2ed497 Then, surprisingly, there's no BMH reconcile until: time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=8e2c3f9e-91fd-4917-8834-d306cc94013c That's more than an hour later. I wonder if: // This is a separate check because an existing // InfraEnv with an empty ISODownloadURL means the // global `Reconcile` function should also return // as there is nothing left to do for it. if infraEnv.Status.ISODownloadURL == "" { return false, true, 0, "InfraEnv corresponding to the BMH has no image URL available." } should not actually requeue for a minute, just like if there was a too recent image.
Test images: quay.io/mkowalski/assisted-service-operator-bundle:bz1978739-2 quay.io/mkowalski/assisted-service:bz1978739-2
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759