Description of the problem: Unable to deploy a spoke cluster via ztp on an OCP 4.11 hub - spoke BMHs (on HUB) stuck: "State: inspecting" and BMO container on hub shows: {"level":"info","ts":1652792463.1536887,"logger":"provisioner.ironic","msg":"current provision state","host":"chub1-0~chub1-master-0-0-bmh","lastError":"","current":"manageable","target":""} {"level":"info","ts":1652792463.1537554,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"chub1-0/chub1-master-0-0-bmh","provisioningState":"inspecting","requeue":true,"after":300} Release version: - Latest 2.5 ACM or 2.0 MCE downstream snapshots - OCP 4.11 on hub (4.11.0-0.nightly-2022-05-10-045003) - OCP 4.10 or 4.11 spokes Steps to reproduce: 1. Deploy OCP 4.11 hub with latest ACM 2.5 or MCE 2.0 and Infrastructure Operator 2. Try to deploy spoke using ztp kubeapi crd flow with infrastructure operator Actual results: - BMH stuck "inspecting" - spoke BM nodes never get powered on Expected results: ZTP flow works as expected Additional info:
This is actually already a known issue and should be fixed with: https://issues.redhat.com/browse/MGMT-10004 (MGMT-10004 [In Progress] : Add PreprovisioningImage controller to assisted-service ) Opened BZ just for tracking and will link to the Epic for the above task
While https://issues.redhat.com/browse/MGMT-10004 should fix the issue in ACM 2.6 I think the ZTP flow should work on 4.11 with ACM 2.5 as well. I think we should fix whatever broke the metal3 compatibility with the ZTP flow in ACM 2.5.
Triaging notes: starting with 4.11, the PreprovisioningImage CR controller no longer reconciles images with the InfraEnv attached. This is to enable an integrated ZTP flow via a new controller tracked in https://issues.redhat.com/browse/MGMT-10004. The problem is that BMO expects the image to be reconciled to move the BareMetalHost forward, even if inspection and cleaning are disabled, and live ISO deployment is requested. We need to fix that.
I tested OCP version registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-21-193241 on the hub, with MCE 2.0.1 (ACM 2.5.1) and MCE 2.1 (ACM 2.6) and experienced the same issue on both. The BMH gets stuck "provisioning"and the virtualmedia is never attached to the spoke node.
To reproduce on dev-scripts without any assisted components 1) Build dev-scripts with export NUM_EXTRA_WORKERS=1 # or more 2) Copy any ISO as test.iso to /opt/dev-scripts/ironic/html/images/ 3) Apply this manifest, replacing credentials and the System UUID with ones from dev-scripts/ocp/ostest/extra_baremetalhosts.json: --- apiVersion: v1 kind: Secret metadata: name: ostest-extraworker-0-bmc-secret namespace: openshift-machine-api type: Opaque data: username: YWRtaW4= password: ... --- apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: name: ostest-extraworker-0 namespace: openshift-machine-api annotations: inspect.metal3.io: disabled labels: infraenvs.agent-install.openshift.io: test spec: online: true automatedCleaningMode: disabled bootMACAddress: 00:d9:5c:22:74:3f bmc: address: "redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/b549a45d-ee8a-418a-bbe8-fd434a5b2658" credentialsName: ostest-extraworker-0-bmc-secret image: url: http://172.22.0.1/images/test.iso diskFormat: live-iso Without any fixes, the BMH gets stuck in "inspecting", now in "provisioning". Check with Ironic: $ baremetal node show openshift-machine-api~ostest-extraworker-0 --fields provision_state power_state instance_info +-----------------+----------------------+ | Field | Value | +-----------------+----------------------+ | instance_info | {'capabilities': {}} | | power_state | power off | | provision_state | manageable | +-----------------+----------------------+
Correction: - diskFormat: live-iso + format: live-iso
Not finished yet, the Ironic patch is still pending.
Once the upstream ironic patch merges we still need to backport to the right branch to be pickup in a dowstream build and we will tag for ironic-image
For 4.11 we need https://review.opendev.org/c/openstack/ironic/+/847657/ If CI cooperates we will get this merged asap
Adding Depends On for 2101511 since we need a 4.12 tracker based on slack conversation
https://github.com/openshift/ironic-image/pull/281 contains the rpms for ironic tagged for 4.11 with https://review.opendev.org/c/openstack/ironic/+/847657/
No longer depends on 4.12 BZ QE were able to test with cluster bot, now we just need the staff eng to add labels in the PR
Verified on registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-01-065600 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-01-065600 True False 4h25m Cluster version is 4.11.0-0.nightly-2022-07-01-065600 $ oc get bmh NAME STATE CONSUMER ONLINE ERROR AGE spoke-master-0-0-bmh provisioned true 101m spoke-master-0-1-bmh provisioned true 101m spoke-master-0-2-bmh provisioned true 101m spoke-worker-0-0-bmh provisioned true 101m spoke-worker-0-1-bmh provisioned true 101m $ oc get clusterdeployment NAME INFRAID PLATFORM REGION VERSION CLUSTERTYPE PROVISIONSTATUS POWERSTATE AGE spoke-0 dbd36a41-8766-49da-bf3c-430e77e8f964 agent-baremetal 4.11.0 Provisioned 102m $ oc get aci NAME CLUSTER STATE spoke-0 spoke-0 adding-hosts
*** Bug 2100904 has been marked as a duplicate of this bug. ***
*** Bug 2051533 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069