Bug 2087213
| Summary: | Spoke BMH stuck "inspecting" when deployed via ZTP in 4.11 OCP hub | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Chad Crum <ccrum> | |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Dmitry Tantsur <dtantsur> | |
| Bare Metal Hardware Provisioning sub component: | baremetal-operator | QA Contact: | Chad Crum <ccrum> | |
| Status: | CLOSED ERRATA | Docs Contact: | ||
| Severity: | urgent | |||
| Priority: | high | CC: | calfonso, ccrum, ercohen, imelofer, mcornea, nshidlin, oourfali, sasha, smiron, trwest, tsedovic, yfirst | |
| Version: | 4.11 | Keywords: | TestBlocker, Triaged | |
| Target Milestone: | --- | Flags: | calfonso:
needinfo-
|
|
| Target Release: | 4.11.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2101511 2109125 (view as bug list) | Environment: | ||
| Last Closed: | 2022-08-10 11:12:53 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2109125 | |||
This is actually already a known issue and should be fixed with: https://issues.redhat.com/browse/MGMT-10004 (MGMT-10004 [In Progress] : Add PreprovisioningImage controller to assisted-service ) Opened BZ just for tracking and will link to the Epic for the above task While https://issues.redhat.com/browse/MGMT-10004 should fix the issue in ACM 2.6 I think the ZTP flow should work on 4.11 with ACM 2.5 as well. I think we should fix whatever broke the metal3 compatibility with the ZTP flow in ACM 2.5. Triaging notes: starting with 4.11, the PreprovisioningImage CR controller no longer reconciles images with the InfraEnv attached. This is to enable an integrated ZTP flow via a new controller tracked in https://issues.redhat.com/browse/MGMT-10004. The problem is that BMO expects the image to be reconciled to move the BareMetalHost forward, even if inspection and cleaning are disabled, and live ISO deployment is requested. We need to fix that. I tested OCP version registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-21-193241 on the hub, with MCE 2.0.1 (ACM 2.5.1) and MCE 2.1 (ACM 2.6) and experienced the same issue on both. The BMH gets stuck "provisioning"and the virtualmedia is never attached to the spoke node. To reproduce on dev-scripts without any assisted components
1) Build dev-scripts with
export NUM_EXTRA_WORKERS=1 # or more
2) Copy any ISO as test.iso to /opt/dev-scripts/ironic/html/images/
3) Apply this manifest, replacing credentials and the System UUID with ones from dev-scripts/ocp/ostest/extra_baremetalhosts.json:
---
apiVersion: v1
kind: Secret
metadata:
name: ostest-extraworker-0-bmc-secret
namespace: openshift-machine-api
type: Opaque
data:
username: YWRtaW4=
password: ...
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
name: ostest-extraworker-0
namespace: openshift-machine-api
annotations:
inspect.metal3.io: disabled
labels:
infraenvs.agent-install.openshift.io: test
spec:
online: true
automatedCleaningMode: disabled
bootMACAddress: 00:d9:5c:22:74:3f
bmc:
address: "redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/b549a45d-ee8a-418a-bbe8-fd434a5b2658"
credentialsName: ostest-extraworker-0-bmc-secret
image:
url: http://172.22.0.1/images/test.iso
diskFormat: live-iso
Without any fixes, the BMH gets stuck in "inspecting", now in "provisioning". Check with Ironic:
$ baremetal node show openshift-machine-api~ostest-extraworker-0 --fields provision_state power_state instance_info
+-----------------+----------------------+
| Field | Value |
+-----------------+----------------------+
| instance_info | {'capabilities': {}} |
| power_state | power off |
| provision_state | manageable |
+-----------------+----------------------+
Correction: - diskFormat: live-iso + format: live-iso Not finished yet, the Ironic patch is still pending. Once the upstream ironic patch merges we still need to backport to the right branch to be pickup in a dowstream build and we will tag for ironic-image For 4.11 we need https://review.opendev.org/c/openstack/ironic/+/847657/ If CI cooperates we will get this merged asap Adding Depends On for 2101511 since we need a 4.12 tracker based on slack conversation https://github.com/openshift/ironic-image/pull/281 contains the rpms for ironic tagged for 4.11 with https://review.opendev.org/c/openstack/ironic/+/847657/ No longer depends on 4.12 BZ QE were able to test with cluster bot, now we just need the staff eng to add labels in the PR Verified on registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-01-065600 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-07-01-065600 True False 4h25m Cluster version is 4.11.0-0.nightly-2022-07-01-065600 $ oc get bmh NAME STATE CONSUMER ONLINE ERROR AGE spoke-master-0-0-bmh provisioned true 101m spoke-master-0-1-bmh provisioned true 101m spoke-master-0-2-bmh provisioned true 101m spoke-worker-0-0-bmh provisioned true 101m spoke-worker-0-1-bmh provisioned true 101m $ oc get clusterdeployment NAME INFRAID PLATFORM REGION VERSION CLUSTERTYPE PROVISIONSTATUS POWERSTATE AGE spoke-0 dbd36a41-8766-49da-bf3c-430e77e8f964 agent-baremetal 4.11.0 Provisioned 102m $ oc get aci NAME CLUSTER STATE spoke-0 spoke-0 adding-hosts *** Bug 2100904 has been marked as a duplicate of this bug. *** *** Bug 2051533 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |
Description of the problem: Unable to deploy a spoke cluster via ztp on an OCP 4.11 hub - spoke BMHs (on HUB) stuck: "State: inspecting" and BMO container on hub shows: {"level":"info","ts":1652792463.1536887,"logger":"provisioner.ironic","msg":"current provision state","host":"chub1-0~chub1-master-0-0-bmh","lastError":"","current":"manageable","target":""} {"level":"info","ts":1652792463.1537554,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"chub1-0/chub1-master-0-0-bmh","provisioningState":"inspecting","requeue":true,"after":300} Release version: - Latest 2.5 ACM or 2.0 MCE downstream snapshots - OCP 4.11 on hub (4.11.0-0.nightly-2022-05-10-045003) - OCP 4.10 or 4.11 spokes Steps to reproduce: 1. Deploy OCP 4.11 hub with latest ACM 2.5 or MCE 2.0 and Infrastructure Operator 2. Try to deploy spoke using ztp kubeapi crd flow with infrastructure operator Actual results: - BMH stuck "inspecting" - spoke BM nodes never get powered on Expected results: ZTP flow works as expected Additional info: