1978739 – [master] Provisioning SNOs bmh is stuck in ready state

Bug 1978739 - [master] Provisioning SNOs bmh is stuck in ready state

Summary: [master] Provisioning SNOs bmh is stuck in ready state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	assisted-installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Mat Kowalski
QA Contact:	Yuri Obshansky
Docs Contact:
URL:
Whiteboard:	AI-Team-Platform
Depends On:
Blocks:	1979906
TreeView+	depends on / blocked

Reported:	2021-07-02 15:44 UTC by Alex Krzos
Modified:	2021-10-18 17:38 UTC (History)
CC List:	8 users (show)
Fixed In Version:	OCP-Metal-v1.0.23.1
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1979906 (view as bug list)
Environment:
Last Closed:	2021-10-18 17:38:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift assisted-service pull 2173	0	None	closed	Bug 1978739: Fix requeue logic inside mapInfraEnvToBMH	2021-07-07 14:45:55 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:38:22 UTC

Internal Links: 1979906

Description Alex Krzos 2021-07-02 15:44:20 UTC

Description of problem:
While testing a small scale environment that can provision 60 SNO clusters as libvirt VMs hosted on separate hypervisors from a bare-metal hub cluster.  We have found several times that after applying all the manifests to ZTP provision a SNO that bmh objects are stuck in ready state. This environment is an ipv6/disconnected environment.

Version-Release number of selected component (if applicable):
OCP Hub cluster 4.8.0-rc.1
SNO clusters 4.8.0-rc.1
ACM - 2.3.0-DOWNSTREAM-2021-07-01-08-39-58 / v2.3.0-148


How reproducible:
At a 1s stagger "rate" (1s sleep between oc apply of all manifests for a single SNO) we saw 27/60 clusters get provisioned with 33 stuck in ready state
At a 10s stagger "rate" (10s sleep between oc apply) we saw 59/60 clusters get provisioned with 1 stuck in ready state.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
bmh object to not be stuck in ready state and for the SNO cluster to provision regardless of the rate in which we apply manifests



Additional info:
We can delete and recreate the manifests and the cluster will retry and provision.

Comment 1 Alex Krzos 2021-07-02 17:05:34 UTC

On our most recent test where one bmh object was stuck in "ready" state.  We witnessed the cluster itself seems to have "retried" after ~65m:

oc describe bmh -n sno00056
...
Events:
  Type    Reason                 Age   From                         Message
  ----    ------                 ----  ----                         -------
  Normal  Registered             105m  metal3-baremetal-controller  Registered new host
  Normal  BMCAccessValidated     105m  metal3-baremetal-controller  Verified access to BMC
  Normal  InspectionSkipped      105m  metal3-baremetal-controller  disabled by annotation
  Normal  ProfileSet             105m  metal3-baremetal-controller  Hardware profile set: unknown
  Normal  ProvisioningStarted    40m   metal3-baremetal-controller  Image provisioning started for https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg
  Normal  ProvisioningComplete   40m   metal3-baremetal-controller  Image provisioning completed for https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg
  Normal  UpdateHardwareDetails  38m   metal3-baremetal-controller  Set HardwareDetails from annotation


It is unclear as to why there was a large delay in this cluster sno00056 starting its provision.

Comment 4 Crystal Chun 2021-07-02 22:15:26 UTC

The bmh stuck in ready state is also happening in our 1K environment using the same AI version (in ACM d/s 2.3.0-DOWNSTREAM-2021-07-01-08-39-58 )

We had as many as 980 stuck. They started provisioning again after almost 4 hours

[root@f36-h21-000-r640 clusters]# oc get agentclusterinstall -A --no-headers -o custom-columns=READY:'.status.conditions[?(@.type=="Completed")].reason',msg:'.status.conditions[?(@.type=="Completed")].message'  | sort | uniq -c
     97 InstallationCompleted    The installation has completed: Cluster is installed
      1 InstallationFailed       The installation has failed: cluster has hosts in error
      2 InstallationFailed       The installation has failed: Timeout while waiting for cluster version to be available: timed out
      1 InstallationInProgress   The installation is in progress: Installation in progress
    980 InstallationNotStarted   The installation has not yet started


Snippet of oc get bmh -A
sno01065                sno01065             provisioned                             true
sno01066                sno01066             ready                                   false
sno01067                sno01067             ready                                   false
sno01068                sno01068             ready                                   false
sno01069                sno01069             ready                                   false
sno01070                sno01070             ready                                   false
sno01071                sno01071             ready                                   false

Comment 5 Zane Bitter 2021-07-06 18:57:57 UTC

Note that there is a limit of 20 hosts being provisioned simultaneously. However, provisioning with virtualmedia is fast and this should not introduce too much of a delay.
There were also known throughput limitations when very large numbers of hosts are created simultaneously. This was improved in 4.8 (see https://github.com/metal3-io/baremetal-operator/issues/905); however, running Metal³ on a host with a small number of CPUs (<8) will limit performance.


However, looking through the logs for sno00056 specifically, neither of these problems are in evidence. It reaches the ready state at 15:18:31, and the baremetal-operator dutifully checks it once per minute until 16:24:05, when it is finally instructed to begin provisioning.

This suggests the issue is with the assisted-service. This is confirmed by looking at the assisted-service logs. We can see it setting the Image in the BMHAgent at 16:24:

time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=8e2c3f9e-91fd-4917-8834-d306cc94013c
time="2021-07-02T16:24:07Z" level=info msg="Image URL has been set in the BareMetalHost  sno00056/sno00056" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:629"
time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile ended" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile.func1" file="/remote-source/assi

The previous attempt at reconciling the BareMetalHost in the BMHAgent controller was at 15:18 and had this result:

time="2021-07-02T15:18:33Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted
-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d
time="2021-07-02T15:18:33Z" level=info msg="Stopping reconcileBMH: InfraEnv corresponding to the BMH has no image URL available." func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMA
CReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:592" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id
=6b64fc51-55fa-47f4-8713-28dab7ecf67d

The image seems to have become available shortly after:

time="2021-07-02T15:18:37Z" level=info msg="ISODownloadURL changed from  to https://assisted-service-open-cluster-management..." func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=1105 infra_env=sno00056 infra_env_namespace=sno00056 request_id=cfa008d5-24b7-438a-86c3-81705e2ed497

It appears that after this there was no requeue requested and no watch on the InfraEnv that could trigger re-run of the BMHAgent controller reconcile loop when the image became available. (Alternatively, it is possible that the image really did take this long to become available on the InfraEnv.)

Comment 6 Antoni Segura Puimedon 2021-07-06 21:04:18 UTC

In the logs I see that the last BareMetalHost Reconcile started before the Image was generated was:

time="2021-07-02T15:18:33Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d
time="2021-07-02T15:18:33Z" level=info msg="Stopping reconcileBMH: InfraEnv corresponding to the BMH has no image URL available." func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).reconcileBMH" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:592" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=6b64fc51-55fa-47f4-8713-28dab7ecf67d

Then:

time="2021-07-02T15:18:37Z" level=info msg="ISODownloadURL changed from  to https://assisted-service-open-cluster-management.apps.bm.rdu2.scalelab.redhat.com/api/assisted-install/v1/clusters/3a5726e2-5fc3-47ee-891a-ea7082990a3d/downloads/image?api_key=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiM2E1NzI2ZTItNWZjMy00N2VlLTg5MWEtZWE3MDgyOTkwYTNkIn0.GKYHDNAwWXfLUp24Q5vSVahOT2NTuKJuScvHI2kfC_lO3a2BxlqD3XNZMZKrpSviWy5NWVLt69V-xziw3Y0oVg" func="github.com/openshift/assisted-service/internal/controller/controllers.(*InfraEnvReconciler).updateEnsureISOSuccess" file="/remote-source/assisted-service/app/internal/controller/controllers/infraenv_controller.go:328" go-id=1105 infra_env=sno00056 infra_env_namespace=sno00056 request_id=cfa008d5-24b7-438a-86c3-81705e2ed497

Then, surprisingly, there's no BMH reconcile until:

time="2021-07-02T16:24:07Z" level=info msg="BareMetalHost Reconcile started" func="github.com/openshift/assisted-service/internal/controller/controllers.(*BMACReconciler).Reconcile" file="/remote-source/assisted-service/app/internal/controller/controllers/bmh_agent_controller.go:164" bare_metal_host=sno00056 bare_metal_host_namespace=sno00056 go-id=1101 request_id=8e2c3f9e-91fd-4917-8834-d306cc94013c

That's more than an hour later. I wonder if:

      // This is a separate check because an existing
      // InfraEnv with an empty ISODownloadURL means the
      // global `Reconcile` function should also return
      // as there is nothing left to do for it.
      if infraEnv.Status.ISODownloadURL == "" {
          return false, true, 0, "InfraEnv corresponding to the BMH has no image URL available."
      }

should not actually requeue for a minute, just like if there was a too recent image.

Comment 7 Mat Kowalski 2021-07-07 11:53:30 UTC

Test images:

quay.io/mkowalski/assisted-service-operator-bundle:bz1978739-2
quay.io/mkowalski/assisted-service:bz1978739-2

Comment 12 errata-xmlrpc 2021-10-18 17:38:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.