Bug 2087213

Summary: Spoke BMH stuck "inspecting" when deployed via ZTP in 4.11 OCP hub
Product: OpenShift Container Platform Reporter: Chad Crum <ccrum>
Component: Bare Metal Hardware ProvisioningAssignee: Dmitry Tantsur <dtantsur>
Bare Metal Hardware Provisioning sub component: baremetal-operator QA Contact: Chad Crum <ccrum>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: high CC: calfonso, ccrum, ercohen, imelofer, mcornea, nshidlin, oourfali, sasha, smiron, trwest, tsedovic, yfirst
Version: 4.11Keywords: TestBlocker, Triaged
Target Milestone: ---Flags: calfonso: needinfo-
Target Release: 4.11.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2101511 2109125 (view as bug list) Environment:
Last Closed: 2022-08-10 11:12:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2109125    

Description Chad Crum 2022-05-17 15:20:02 UTC
Description of the problem:
Unable to deploy a spoke cluster via ztp on an OCP 4.11 hub - spoke BMHs (on HUB) stuck:

"State:  inspecting"

and BMO container on hub shows:
  {"level":"info","ts":1652792463.1536887,"logger":"provisioner.ironic","msg":"current provision state","host":"chub1-0~chub1-master-0-0-bmh","lastError":"","current":"manageable","target":""}
  {"level":"info","ts":1652792463.1537554,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"chub1-0/chub1-master-0-0-bmh","provisioningState":"inspecting","requeue":true,"after":300}


Release version:

- Latest 2.5 ACM or 2.0 MCE downstream snapshots
- OCP 4.11 on hub (4.11.0-0.nightly-2022-05-10-045003)
- OCP 4.10 or 4.11 spokes

Steps to reproduce:
1. Deploy OCP 4.11 hub with latest ACM 2.5 or MCE 2.0 and Infrastructure Operator
2. Try to deploy spoke using ztp kubeapi crd flow with infrastructure operator

Actual results:

- BMH stuck "inspecting"
- spoke BM nodes never get powered on

Expected results:
ZTP flow works as expected

Additional info:

Comment 1 Chad Crum 2022-05-17 15:21:32 UTC
This is actually already a known issue and should be fixed with: https://issues.redhat.com/browse/MGMT-10004   (MGMT-10004 [In Progress] : Add PreprovisioningImage controller to assisted-service
)

Opened BZ just for tracking and will link to the Epic for the above task

Comment 3 Eran Cohen 2022-05-26 11:38:55 UTC
While https://issues.redhat.com/browse/MGMT-10004 should fix the issue in ACM 2.6 I think the ZTP flow should work on 4.11 with ACM 2.5 as well.
I think we should fix whatever broke the metal3 compatibility with the ZTP flow in ACM 2.5.

Comment 4 Dmitry Tantsur 2022-05-31 11:13:39 UTC
Triaging notes: starting with 4.11, the PreprovisioningImage CR controller no longer reconciles images with the InfraEnv attached. This is to enable an integrated ZTP flow via a new controller tracked in https://issues.redhat.com/browse/MGMT-10004.

The problem is that BMO expects the image to be reconciled to move the BareMetalHost forward, even if inspection and cleaning are disabled, and live ISO deployment is requested. We need to fix that.

Comment 7 Chad Crum 2022-06-21 20:25:57 UTC
I tested OCP version registry.ci.openshift.org/ocp/release:4.11.0-0.ci-2022-06-21-193241 on the hub, with MCE 2.0.1 (ACM 2.5.1) and MCE 2.1 (ACM 2.6) and experienced the same issue on both.

The BMH gets stuck "provisioning"and the virtualmedia is never attached to the spoke node.

Comment 12 Dmitry Tantsur 2022-06-23 16:54:50 UTC
To reproduce on dev-scripts without any assisted components

1) Build dev-scripts with

 export NUM_EXTRA_WORKERS=1  # or more

2) Copy any ISO as test.iso to /opt/dev-scripts/ironic/html/images/

3) Apply this manifest, replacing credentials and the System UUID with ones from dev-scripts/ocp/ostest/extra_baremetalhosts.json:

    ---                                                                            
    apiVersion: v1                                                                  
    kind: Secret                                                                    
    metadata:                                                                      
      name: ostest-extraworker-0-bmc-secret                                        
      namespace: openshift-machine-api                                              
    type: Opaque                                                                    
    data:                                                                          
      username: YWRtaW4=                                                            
      password: ...                                                        
    ---                                                                            
    apiVersion: metal3.io/v1alpha1                                                  
    kind: BareMetalHost                                                            
    metadata:                                                                      
      name: ostest-extraworker-0                                                    
      namespace: openshift-machine-api                                              
      annotations:                                                                
        inspect.metal3.io: disabled                                                
      labels:                                                                      
        infraenvs.agent-install.openshift.io: test                                  
    spec:                                                                          
      online: true                                                                  
      automatedCleaningMode: disabled                                              
      bootMACAddress: 00:d9:5c:22:74:3f                                            
      bmc:                                                                        
        address: "redfish-virtualmedia+http://192.168.111.1:8000/redfish/v1/Systems/b549a45d-ee8a-418a-bbe8-fd434a5b2658"
        credentialsName: ostest-extraworker-0-bmc-secret                            
      image:                                                                      
        url: http://172.22.0.1/images/test.iso                                      
        diskFormat: live-iso


Without any fixes, the BMH gets stuck in "inspecting", now in "provisioning". Check with Ironic:

$ baremetal node show openshift-machine-api~ostest-extraworker-0 --fields provision_state power_state instance_info
+-----------------+----------------------+
| Field           | Value                |
+-----------------+----------------------+
| instance_info   | {'capabilities': {}} |
| power_state     | power off            |
| provision_state | manageable           |
+-----------------+----------------------+

Comment 13 Dmitry Tantsur 2022-06-23 17:05:48 UTC
Correction:

-        diskFormat: live-iso
+        format: live-iso

Comment 15 Dmitry Tantsur 2022-06-24 12:10:53 UTC
Not finished yet, the Ironic patch is still pending.

Comment 17 Iury Gregory Melo Ferreira 2022-06-27 01:49:38 UTC
Once the upstream ironic patch merges we still need to backport to the right branch to be pickup in a dowstream build and we will tag for ironic-image

Comment 18 Iury Gregory Melo Ferreira 2022-06-27 01:58:13 UTC
For 4.11 we need https://review.opendev.org/c/openstack/ironic/+/847657/

If CI cooperates we will get this merged asap

Comment 19 Iury Gregory Melo Ferreira 2022-06-27 16:44:17 UTC
Adding Depends On for 2101511 since we need a 4.12 tracker based on slack conversation

Comment 20 Iury Gregory Melo Ferreira 2022-06-28 18:34:27 UTC
https://github.com/openshift/ironic-image/pull/281 contains the rpms for ironic tagged for 4.11 with https://review.opendev.org/c/openstack/ironic/+/847657/

Comment 23 Iury Gregory Melo Ferreira 2022-06-30 23:27:02 UTC
No longer depends on 4.12 BZ
QE were able to test with cluster bot, now we just need the staff eng to add labels in the PR

Comment 26 Trey West 2022-07-01 18:00:02 UTC
Verified on registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-07-01-065600

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-07-01-065600   True        False         4h25m   Cluster version is 4.11.0-0.nightly-2022-07-01-065600
$ oc get bmh
NAME                   STATE         CONSUMER   ONLINE   ERROR   AGE
spoke-master-0-0-bmh   provisioned              true             101m
spoke-master-0-1-bmh   provisioned              true             101m
spoke-master-0-2-bmh   provisioned              true             101m
spoke-worker-0-0-bmh   provisioned              true             101m
spoke-worker-0-1-bmh   provisioned              true             101m
$ oc get clusterdeployment
NAME      INFRAID                                PLATFORM          REGION   VERSION   CLUSTERTYPE   PROVISIONSTATUS   POWERSTATE   AGE
spoke-0   dbd36a41-8766-49da-bf3c-430e77e8f964   agent-baremetal            4.11.0                  Provisioned                    102m
$ oc get aci
NAME      CLUSTER   STATE
spoke-0   spoke-0   adding-hosts

Comment 28 Dmitry Tantsur 2022-07-05 14:49:45 UTC
*** Bug 2100904 has been marked as a duplicate of this bug. ***

Comment 29 Flavio Percoco 2022-07-11 14:12:24 UTC
*** Bug 2051533 has been marked as a duplicate of this bug. ***

Comment 30 errata-xmlrpc 2022-08-10 11:12:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069