Bug 2087213 - Spoke BMH stuck "inspecting" when deployed via ZTP in 4.11 OCP hub
Summary: Spoke BMH stuck "inspecting" when deployed via ZTP in 4.11 OCP hub
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.11
Hardware: x86_64
OS: Linux
high
urgent
Target Milestone: ---
: 4.11.0
Assignee: Dmitry Tantsur
QA Contact: Chad Crum
URL:
Whiteboard:
: 2100904 (view as bug list)
Depends On:
Blocks: 2109125
TreeView+ depends on / blocked
 
Reported: 2022-05-17 15:20 UTC by Chad Crum
Modified: 2022-08-10 11:13 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2101511 2109125 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:12:53 UTC
Target Upstream Version:
Embargoed:
calfonso: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift baremetal-operator pull 229 0 None Merged Bug 2087213: Don't require PreprovisioningImages for older ZTP 2022-06-24 07:54:31 UTC
Github openshift baremetal-operator pull 231 0 None Merged Bug 2087213: Don't require pre-provisioning image for live ISO provisioning 2022-06-24 12:10:29 UTC
Github openshift ironic-image pull 281 0 None Merged Bug 2087213: Ironic Fix for ZTP with 4.11 2022-07-01 06:26:50 UTC
Github stolostron backlog issues 22600 0 None None None 2022-05-17 20:41:04 UTC
OpenStack gerrit 847388 0 None MERGED No deploy_kernel/ramdisk with the ramdisk deploy and no cleaning 2022-06-27 09:12:51 UTC
OpenStack gerrit 847657 0 None MERGED No deploy_kernel/ramdisk with the ramdisk deploy and no cleaning 2022-07-01 06:26:46 UTC
Red Hat Issue Tracker MGMT-9979 0 None None None 2022-05-17 15:30:27 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:13:11 UTC

Description Chad Crum 2022-05-17 15:20:02 UTC
Description of the problem:
Unable to deploy a spoke cluster via ztp on an OCP 4.11 hub - spoke BMHs (on HUB) stuck:

"State:  inspecting"

and BMO container on hub shows:
  {"level":"info","ts":1652792463.1536887,"logger":"provisioner.ironic","msg":"current provision state","host":"chub1-0~chub1-master-0-0-bmh","lastError":"","current":"manageable","target":""}
  {"level":"info","ts":1652792463.1537554,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"chub1-0/chub1-master-0-0-bmh","provisioningState":"inspecting","requeue":true,"after":300}


Release version:

- Latest 2.5 ACM or 2.0 MCE downstream snapshots
- OCP 4.11 on hub (4.11.0-0.nightly-2022-05-10-045003)
- OCP 4.10 or 4.11 spokes

Steps to reproduce:
1. Deploy OCP 4.11 hub with latest ACM 2.5 or MCE 2.0 and Infrastructure Operator
2. Try to deploy spoke using ztp kubeapi crd flow with infrastructure operator

Actual results:

- BMH stuck "inspecting"
- spoke BM nodes never get powered on

Expected results:
ZTP flow works as expected

Additional info:

Comment 1 Chad Crum 2022-05-17 15:21:32 UTC
This is actually already a known issue and should be fixed with: https://issues.redhat.com/browse/MGMT-10004   (MGMT-10004 [In Progress] : Add PreprovisioningImage controller to assisted-service
)

Opened BZ just for tracking and will link to the Epic for the above task

Comment 3 Eran Cohen 2022-05-26 11:38:55 UTC
While https://issues.redhat.com/browse/MGMT-10004 should fix the issue in ACM 2.6 I think the ZTP flow should work on 4.11 with ACM 2.5 as well.
I think we should fix whatever broke the metal3 compatibility with the ZTP flow in ACM 2.5.

Comment 4 Dmitry Tantsur 2022-05-31 11:13:39 UTC
Triaging notes: starting with 4.11, the PreprovisioningImage CR controller no longer reconciles images with the InfraEnv attached. This is to enable an integrated ZTP flow via a new controller tracked in https://issues.redhat.com/browse/MGMT-10004.

The problem is that BMO expects the image to be reconciled to move the BareMetalHost forward, even if inspection and cleaning are disabled, and live ISO deployment is requested. We need to fix that.

Comment 7 Chad Crum 2022-06-21 20:25:57 UTC
I tested OCP version registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-21-193241 on the hub, with MCE 2.0.1 (ACM 2.5.1) and MCE 2.1 (ACM 2.6) and experienced the same issue on both.

The BMH gets stuck "provisioning"and the virtualmedia is never attached to the spoke node.

Comment 12 Dmitry Tantsur 2022-06-23 16:54:50 UTC
To reproduce on dev-scripts without any assisted components

1) Build dev-scripts with

 export NUM_EXTRA_WORKERS=1  # or more

2) Copy any ISO as test.iso to /opt/dev-scripts/ironic/html/images/

3) Apply this manifest, replacing credentials and the System UUID with ones from dev-scripts/ocp/ostest/extra_baremetalhosts.json:

    ---                                                                            
    apiVersion: v1                                                                  
    kind: Secret                                                                    
    metadata:                                                                      
      name: ostest-extraworker-0-bmc-secret                                        
      namespace: openshift-machine-api                                              
    type: Opaque                                                                    
    data:                                                                          
      username: YWRtaW4=                                                            
      password: ...                                                        
    ---                                                                            
    apiVersion: metal3.io/v1alpha1                                                  
    kind: BareMetalHost                                                            
    metadata:                                                                      
      name: ostest-extraworker-0                                                    
      namespace: openshift-machine-api                                              
      annotations:                                                                
        inspect.metal3.io: disabled                                                
      labels:                                                                      
        infraenvs.agent-install.openshift.io: test                                  
    spec:                                                                          
      online: true                                                                  
      automatedCleaningMode: disabled                                              
      bootMACAddress: 00:d9:5c:22:74:3f                                            
      bmc:                                                                        
        address: "redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/b549a45d-ee8a-418a-bbe8-fd434a5b2658"
        credentialsName: ostest-extraworker-0-bmc-secret                            
      image:                                                                      
        url: http://172.22.0.1/images/test.iso                                      
        diskFormat: live-iso


Without any fixes, the BMH gets stuck in "inspecting", now in "provisioning". Check with Ironic:

$ baremetal node show openshift-machine-api~ostest-extraworker-0 --fields provision_state power_state instance_info
+-----------------+----------------------+
| Field           | Value                |
+-----------------+----------------------+
| instance_info   | {'capabilities': {}} |
| power_state     | power off            |
| provision_state | manageable           |
+-----------------+----------------------+

Comment 13 Dmitry Tantsur 2022-06-23 17:05:48 UTC
Correction:

-        diskFormat: live-iso
+        format: live-iso

Comment 15 Dmitry Tantsur 2022-06-24 12:10:53 UTC
Not finished yet, the Ironic patch is still pending.

Comment 17 Iury Gregory Melo Ferreira 2022-06-27 01:49:38 UTC
Once the upstream ironic patch merges we still need to backport to the right branch to be pickup in a dowstream build and we will tag for ironic-image

Comment 18 Iury Gregory Melo Ferreira 2022-06-27 01:58:13 UTC
For 4.11 we need https://review.opendev.org/c/openstack/ironic/+/847657/

If CI cooperates we will get this merged asap

Comment 19 Iury Gregory Melo Ferreira 2022-06-27 16:44:17 UTC
Adding Depends On for 2101511 since we need a 4.12 tracker based on slack conversation

Comment 20 Iury Gregory Melo Ferreira 2022-06-28 18:34:27 UTC
https://github.com/openshift/ironic-image/pull/281 contains the rpms for ironic tagged for 4.11 with https://review.opendev.org/c/openstack/ironic/+/847657/

Comment 23 Iury Gregory Melo Ferreira 2022-06-30 23:27:02 UTC
No longer depends on 4.12 BZ
QE were able to test with cluster bot, now we just need the staff eng to add labels in the PR

Comment 26 Trey West 2022-07-01 18:00:02 UTC
Verified on registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-01-065600

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-01-065600   True        False         4h25m   Cluster version is 4.11.0-0.nightly-2022-07-01-065600
$ oc get bmh
NAME                   STATE         CONSUMER   ONLINE   ERROR   AGE
spoke-master-0-0-bmh   provisioned              true             101m
spoke-master-0-1-bmh   provisioned              true             101m
spoke-master-0-2-bmh   provisioned              true             101m
spoke-worker-0-0-bmh   provisioned              true             101m
spoke-worker-0-1-bmh   provisioned              true             101m
$ oc get clusterdeployment
NAME      INFRAID                                PLATFORM          REGION   VERSION   CLUSTERTYPE   PROVISIONSTATUS   POWERSTATE   AGE
spoke-0   dbd36a41-8766-49da-bf3c-430e77e8f964   agent-baremetal            4.11.0                  Provisioned                    102m
$ oc get aci
NAME      CLUSTER   STATE
spoke-0   spoke-0   adding-hosts

Comment 28 Dmitry Tantsur 2022-07-05 14:49:45 UTC
*** Bug 2100904 has been marked as a duplicate of this bug. ***

Comment 29 Flavio Percoco 2022-07-11 14:12:24 UTC
*** Bug 2051533 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2022-08-10 11:12:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.