Bug 1978739
| Summary: | [master] Provisioning SNOs bmh is stuck in ready state | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Alex Krzos <akrzos> | |
| Component: | assisted-installer | Assignee: | Mat Kowalski <mko> | |
| assisted-installer sub component: | assisted-service | QA Contact: | Yuri Obshansky <yobshans> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | high | |||
| Priority: | urgent | CC: | aos-bugs, asegurap, bjacot, cchun, mfilanov, mko, rpittau, zbitter | |
| Version: | 4.8 | Keywords: | Triaged | |
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | AI-Team-Platform | |||
| Fixed In Version: | OCP-Metal-v1.0.23.1 | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1979906 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:38:01 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1979906 | |||
|
Description
Alex Krzos
2021-07-02 15:44:20 UTC
On our most recent test where one bmh object was stuck in "ready" state. We witnessed the cluster itself seems to have "retried" after ~65m: oc describe bmh -n sno00056 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Registered 105m metal3-baremetal-controller Registered new host Normal BMCAccessValidated 105m metal3-baremetal-controller Verified access to BMC Normal InspectionSkipped 105m metal3-baremetal-controller disabled by annotation Normal ProfileSet 105m metal3-baremetal-controller Hardware profile set: unknown Normal ProvisioningStarted 40m metal3-baremetal-controller Image provisioning started for https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg Normal ProvisioningComplete 40m metal3-baremetal-controller Image provisioning completed for https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg Normal UpdateHardwareDetails 38m metal3-baremetal-controller Set HardwareDetails from annotation It is unclear as to why there was a large delay in this cluster sno00056 starting its provision. The bmh stuck in ready state is also happening in our 1K environment using the same AI version (in ACM d/s 2.3.0-DOWNSTREAM-2021-07-01-08-39-58 )
We had as many as 980 stuck. They started provisioning again after almost 4 hours
[root@f36-h21-000-r640 clusters]# oc get agentclusterinstall -A --no-headers -o custom-columns=READY:'.status.conditions[?(@.type=="Completed")].reason',msg:'.status.conditions[?(@.type=="Completed")].message' | sort | uniq -c
97 InstallationCompleted The installation has completed: Cluster is installed
1 InstallationFailed The installation has failed: cluster has hosts in error
2 InstallationFailed The installation has failed: Timeout while waiting for cluster version to be available: timed out
1 InstallationInProgress The installation is in progress: Installation in progress
980 InstallationNotStarted The installation has not yet started
Snippet of oc get bmh -A
sno01065 sno01065 provisioned true
sno01066 sno01066 ready false
sno01067 sno01067 ready false
sno01068 sno01068 ready false
sno01069 sno01069 ready false
sno01070 sno01070 ready false
sno01071 sno01071 ready false
Note that there is a limit of 20 hosts being provisioned simultaneously. However, provisioning with virtualmedia is fast and this should not introduce too much of a delay. There were also known throughput limitations when very large numbers of hosts are created simultaneously. This was improved in 4.8 (see https://github.com/metal3-io/baremetal-operator/issues/905); however, running Metal³ on a host with a small number of CPUs (<8) will limit performance. However, looking through the logs for sno00056 specifically, neither of these problems are in evidence. It reaches the ready state at 15:18:31, and the baremetal-operator dutifully checks it once per minute until 16:24:05, when it is finally instructed to begin provisioning. This suggests the issue is with the assisted-service. This is confirmed by looking at the assisted-service logs. We can see it setting the Image in the BMHAgent at 16:24: time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=8e2c3f9e-91fd-4917-8834-d306cc94013c time="2021-07-02T16:24:07Z" level=info msg="Image URL has been set in the BareMetalHost sno00056/sno00056" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:629" time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile.func1" file="/remote-source/assi The previous attempt at reconciling the BareMetalHost in the BMHAgent controller was at 15:18 and had this result: time="2021-07-02T15:18:33Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted -service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d time="2021-07-02T15:18:33Z" level=info msg="Stopping reconcileBMH: InfraEnv corresponding to the BMH has no image URL available." func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMA CReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:592" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id =6b64fc51-55fa-47f4-8713-28dab7ecf67d The image seems to have become available shortly after: time="2021-07-02T15:18:37Z" level=info msg="ISODownloadURL changed from to https://assisted-service-open-cluster-management..." func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=1105 infra_env=sno00056 infra_env_namespace=sno00056 request_id=cfa008d5-24b7-438a-86c3-81705e2ed497 It appears that after this there was no requeue requested and no watch on the InfraEnv that could trigger re-run of the BMHAgent controller reconcile loop when the image became available. (Alternatively, it is possible that the image really did take this long to become available on the InfraEnv.) In the logs I see that the last BareMetalHost Reconcile started before the Image was generated was: time="2021-07-02T15:18:33Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d time="2021-07-02T15:18:33Z" level=info msg="Stopping reconcileBMH: InfraEnv corresponding to the BMH has no image URL available." func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:592" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d Then: time="2021-07-02T15:18:37Z" level=info msg="ISODownloadURL changed from to https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=1105 infra_env=sno00056 infra_env_namespace=sno00056 request_id=cfa008d5-24b7-438a-86c3-81705e2ed497 Then, surprisingly, there's no BMH reconcile until: time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=8e2c3f9e-91fd-4917-8834-d306cc94013c That's more than an hour later. I wonder if: // This is a separate check because an existing // InfraEnv with an empty ISODownloadURL means the // global `Reconcile` function should also return // as there is nothing left to do for it. if infraEnv.Status.ISODownloadURL == "" { return false, true, 0, "InfraEnv corresponding to the BMH has no image URL available." } should not actually requeue for a minute, just like if there was a too recent image. Test images: quay.io/mkowalski/assisted-service-operator-bundle:bz1978739-2 quay.io/mkowalski/assisted-service:bz1978739-2 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |