Bug 2096944 - BMH stuck in registering on an OCP HUB cluster deployed on openstack with platform: openstack. 'Node' object has no attribute 'verify_step'
Summary: BMH stuck in registering on an OCP HUB cluster deployed on openstack with pla...
Keywords:
Status: CLOSED CANTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.10
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: 4.12.0
Assignee: Jacob Anders
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-06-14 15:27 UTC by Alexander Chuzhoy
Modified: 2022-10-05 02:57 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-10-05 02:56:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack Storyboard 2010091 0 None None None 2022-06-21 03:23:49 UTC
OpenStack gerrit 846859 0 None MERGED Prevent clear_job_queue and reset_idrac failures on older iDRACs 2022-07-13 11:00:43 UTC
OpenStack gerrit 851950 0 None MERGED Modify do_node_verify to avoid state machine stuck 2022-10-04 23:36:14 UTC

Description Alexander Chuzhoy 2022-06-14 15:27:18 UTC
OCP version: 4.10.17
multicluster-engine.v2.0.0


The Hub cluster (3 masters and 2 workers) was deployed on Openstack.

The install-config for the cluster is below:

apiVersion: v1
baseDomain: dno.ccitredhat.com
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    openstack:
      type: ci.memory.medium
  replicas: 2
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    openstack:
      type: ci.memory.medium
  replicas: 3
metadata:
  creationTimestamp: null
  name: ai
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 192.169.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  openstack:
    apiFloatingIP: 10.0.188.56
    apiVIP: 192.169.0.5
    cloud: openstack
    computeFlavor: m1.large
    defaultMachinePlatform:
      type: m1.large
    externalDNS: null
    externalNetwork: shared_net_5
    ingressFloatingIP: 10.0.188.13
    ingressVIP: 192.169.0.7
publish: External
pullSecret: ""
sshKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCytALUAofclqRVw+snWTC/lWk97XfMaozHsOQV37LpbDjgs87sVt3WmcioUjjh9G4AEKVdEOfSKxNJ/NizzeZtZ+Egq3nxjviewZWwnd94aT1dW806etEYPahcad/u8fuJ/XZLKaD4QX3gl8TGiSZVr7lXtWCP9at8DpDdmGJLbP5ObY9T3q9N2wX2VFCn5hPzlhfzRDvWevcQ5i+uOFEJsKMQ4E7juMN75TmZF7ys6x8NjDcW7MVV9PZHfHEaOz18EUptBVpE9IolSVPpqZn+6EfEyWYP+fsDL8H5/tid2bkwBJGTVnUBuojC7Oe9jhPB9zf2KVbLpxk2dJVCfDQWHu3jRON9frlPoYJoKiF5Zdj45+OLZof8MYGKrMhaRbUNmPvYfGYq9G9k0qOpkyoc5+PRufy4LozRc/b4AcOhWrQTKBb8ksFWmwOg+RiWdb3cXvANhUOYBfGrZD6HFspMk3cA94nYfKUzvXbl1x9VDqQfjNVyA5F9siISDQK0euDLnt6u244FfPrvhkuAYiKwm9IKpfJ5H9EPIzPXKeRaq/iop1g/IlcuVrIZwlgrfDr3kKpzBotYqln5otQ5AlbLHC5z+1MieXmvTFWQ0xmQQlzrkswLN/c7wskDkzzc46rHDhyq+IjoiAIL3NmNiUJhjL5XmPsHF6726Zu+md6mTw==
  cardno:000606111718



Attempted to deploy a spoke cluster.




The BMH creation gets stuck in registering and then after a very long time shows  "registration error":
oc get bmh sealusa34.mobius.lab.eng.rdu2.redhat.com
NAME                                       STATE         CONSUMER   ONLINE   ERROR                AGE
sealusa34.mobius.lab.eng.rdu2.redhat.com   registering              true     registration error   21h



oc get bmh sealusa34.mobius.lab.eng.rdu2.redhat.com -o yaml
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  annotations:
    bmac.agent-install.openshift.io/hostname: sealusa34.mobius.lab.eng.rdu2.redhat.com
    bmac.agent-install.openshift.io/role: master
    inspect.metal3.io: disabled
  creationTimestamp: "2022-06-13T17:58:35Z"
  finalizers:
  - baremetalhost.metal3.io
  generation: 1
  labels:
    infraenvs.agent-install.openshift.io: beaker1
  name: sealusa34.mobius.lab.eng.rdu2.redhat.com
  namespace: beaker1
  resourceVersion: "8009082"
  uid: 3842c3d2-fccc-47eb-9776-de8d1752f5fe
spec:
  automatedCleaningMode: disabled
  bmc:
    address: idrac-virtualmedia+https://10.9.78.48/redfish/v1/Systems/System.Embedded.1
    credentialsName: bmc-secret3
    disableCertificateVerification: true
  bootMACAddress: f8:f2:1e:31:66:29
  online: true
  rootDeviceHints:
    deviceName: /dev/sda
status:
  errorCount: 2
  errorMessage: 'Async execution of do_node_verify failed with error: ''Node'' object
    has no attribute ''verify_step'''
  errorType: registration error
  goodCredentials: {}
  hardwareProfile: ""
  lastUpdated: "2022-06-13T18:53:03Z"
  operationHistory:
    deprovision:
      end: null
      start: null
    inspect:
      end: null
      start: null
    provision:
      end: null
      start: null
    register:
      end: null
      start: "2022-06-13T17:58:35Z"
  operationalStatus: error
  poweredOn: false
  provisioning:
    ID: 2abcffd7-e21b-4bc9-a23a-aba3e46ca08c
    bootMode: UEFI
    image:
      url: ""
    state: registering
  triedCredentials:
    credentials:
      name: bmc-secret3
      namespace: beaker1
    credentialsVersion: "7952011"



oc describe bmh sealusa34.mobius.lab.eng.rdu2.redhat.com 
Name:         sealusa34.mobius.lab.eng.rdu2.redhat.com
Namespace:    beaker1
Labels:       infraenvs.agent-install.openshift.io=beaker1
Annotations:  bmac.agent-install.openshift.io/hostname: sealusa34.mobius.lab.eng.rdu2.redhat.com
              bmac.agent-install.openshift.io/role: master
              inspect.metal3.io: disabled
API Version:  metal3.io/v1alpha1
Kind:         BareMetalHost
Metadata:
  Creation Timestamp:  2022-06-13T17:58:35Z
  Finalizers:
    baremetalhost.metal3.io
  Generation:  1
  Managed Fields:
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
          .:
          v:"baremetalhost.metal3.io":
    Manager:      baremetal-operator
    Operation:    Update
    Time:         2022-06-13T17:58:35Z
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:bmac.agent-install.openshift.io/hostname:
          f:bmac.agent-install.openshift.io/role:
          f:inspect.metal3.io:
        f:labels:
          .:
          f:infraenvs.agent-install.openshift.io:
      f:spec:
        .:
        f:automatedCleaningMode:
        f:bmc:
          .:
          f:address:
          f:credentialsName:
          f:disableCertificateVerification:
        f:bootMACAddress:
        f:online:
        f:rootDeviceHints:
          .:
          f:deviceName:
    Manager:      kubectl-create
    Operation:    Update
    Time:         2022-06-13T17:58:35Z
    API Version:  metal3.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:errorCount:
        f:errorMessage:
        f:errorType:
        f:goodCredentials:
        f:hardwareProfile:
        f:lastUpdated:
        f:operationHistory:
          .:
          f:deprovision:
            .:
            f:end:
            f:start:
          f:inspect:
            .:
            f:end:
            f:start:
          f:provision:
            .:
            f:end:
            f:start:
          f:register:
            .:
            f:end:
            f:start:
        f:operationalStatus:
        f:poweredOn:
        f:provisioning:
          .:
          f:ID:
          f:bootMode:
          f:image:
            .:
            f:url:
          f:state:
        f:triedCredentials:
          .:
          f:credentials:
            .:
            f:name:
            f:namespace:
          f:credentialsVersion:
    Manager:         baremetal-operator
    Operation:       Update
    Subresource:     status
    Time:            2022-06-13T18:53:03Z
  Resource Version:  8009082
  UID:               3842c3d2-fccc-47eb-9776-de8d1752f5fe
Spec:
  Automated Cleaning Mode:  disabled
  Bmc:
    Address:                           idrac-virtualmedia+https://10.9.78.48/redfish/v1/Systems/System.Embedded.1
    Credentials Name:                  bmc-secret3
    Disable Certificate Verification:  true
  Boot MAC Address:                    f8:f2:1e:31:66:29
  Online:                              true
  Root Device Hints:
    Device Name:  /dev/sda
Status:
  Error Count:    2
  Error Message:  Async execution of do_node_verify failed with error: 'Node' object has no attribute 'verify_step'
  Error Type:     registration error
  Good Credentials:
  Hardware Profile:  
  Last Updated:      2022-06-13T18:53:03Z
  Operation History:
    Deprovision:
      End:    <nil>
      Start:  <nil>
    Inspect:
      End:    <nil>
      Start:  <nil>
    Provision:
      End:    <nil>
      Start:  <nil>
    Register:
      End:             <nil>
      Start:           2022-06-13T17:58:35Z
  Operational Status:  error
  Powered On:          false
  Provisioning:
    ID:         2abcffd7-e21b-4bc9-a23a-aba3e46ca08c
    Boot Mode:  UEFI
    Image:
      URL:  
    State:  registering
  Tried Credentials:
    Credentials:
      Name:               bmc-secret3
      Namespace:          beaker1
    Credentials Version:  7952011
Events:                   <none>

Comment 1 Alexander Chuzhoy 2022-06-14 15:49:33 UTC
This was attempted against several machines:

1. 
PowerEdge R730
BIOS Version	 2.8.0
Firmware Version 2.50.50.50



2. PowerEdge R730
BIOS Version	2.7.1
Firmware Version 2.52.52.52

Comment 3 Derek Higgins 2022-06-14 16:52:04 UTC
I see this in the conductor logs, its a new one for me...

2022-06-13 18:52:52.199 1 ERROR concurrent.futures ironic.common.exception.RedfishError: Redfish exception occurred. Error: In system 4c4c4544-0032-4710-804d-b3c04f435032 for node 2abcffd7-e21b-4bc9-a23a-aba3e46ca08c all managers failed: clear job queue. Errors: ['Manager 3250434f-c0b3-4d80-4710-00324c4c4544: The attribute Links/Oem/Dell/DellJobService is missing from the resource /redfish/v1/Managers/iDRAC.Embedded.1']

Comment 4 Iury Gregory Melo Ferreira 2022-06-14 22:15:35 UTC
I've asked Dell folks upstream if they are aware about this this specific HW

Comment 5 Iury Gregory Melo Ferreira 2022-06-14 22:28:54 UTC
According to Richard Pioso from Dell we should try to upgrade the machines
The latest BIOS available is 2.13.0 while we have 2.7.1/2.8.0 (from 2018) and the latest iDRAC is 2.83.83.83 while we are using 2.50.50.50/2.52.52.52 (from 2017/2018)

Comment 6 Iury Gregory Melo Ferreira 2022-06-14 22:29:56 UTC
Setting blocker flag - (since I think the FW upgrade would solve this issue)

Comment 7 Jacob Anders 2022-06-15 00:31:59 UTC
Additional triage notes:

I moved it to Ironic as BMO is just passing this on. It's 99.9% a firmware issue. I've encountered this while testing verify-steps during development. In my opinion the only thing we can do here on the Ironic side is perhaps handle this in a more elegant way.

@Iury I am happy to take this one if you like me to - this is related to my upstream verify-steps work. Will look into it a bit during my day.

Comment 8 Alexander Chuzhoy 2022-06-15 13:50:18 UTC
Updated the bios/firmware on a machine:
BIOS Version	
2.13.0
Firmware Version	
2.80.80.80


Still stuck:
oc get bmh -A
NAMESPACE   NAME                                       STATE         CONSUMER   ONLINE   ERROR   AGE
beaker1     sealusa34.mobius.lab.eng.rdu2.redhat.com   registering              true             11h


Updated must-gather:

http://file.rdu.redhat.com/~achuzhoy/bugs/2096944/must-gather2.tgz

Comment 9 Alexander Chuzhoy 2022-06-15 16:06:48 UTC
Same issue reproduces when the hub setup is "platform: none", so not related to platform: openstack

Comment 10 Alexander Chuzhoy 2022-06-15 19:29:20 UTC
The issue seems to be limited to idrac8 machines.
With idrac9 nodes - works as expected.

Comment 11 Iury Gregory Melo Ferreira 2022-06-16 12:00:36 UTC
Hey Sasha,

so from the dell contributors upstream

"""
I checked - in 13G (R730) there is no support for these management functions, in 14G (R740) there is. Generally, I wouldn't use Redfish with 13G, try WS-Man.
while in 13G there is some Redfish support, it does not receive new features and sometimes bugs are not backported too. 13G is still WS-Man world.
e.g., Redfish RAID interface is not working in 13G too.
"""

Virtual Media is a requirement for you, right? I've asked if WS-Man supports virtual media to be sure about it. (before recommending you to switch to it)
The possible workaround I see would be disable run clear_job_queue in the ironic.conf for OCP 4.10 (but this can cause a regression)
Upstream we can work to make conditional and work on backports.

Comment 12 Jacob Anders 2022-06-17 11:54:22 UTC
Taking over this BZ as I reproduced this on Ironic standalone and looked into what's happening in a fair bit of detail. Hope to have a patch up early next week.

Comment 17 Jacob Anders 2022-07-04 05:46:19 UTC
I revisited the patch, should be merge-ready or close.

Comment 18 Jacob Anders 2022-07-13 11:02:49 UTC
https://review.opendev.org/c/openstack/ironic/+/846859 has merged.

Will discuss with the Team whether we want to bump version in ironic-image just for this fix, or is it better to bundle it with other fixes.

Comment 19 Jacob Anders 2022-07-13 12:24:27 UTC
After discussion with the Team, we decided to set this back to ASSIGNED as we're unable to raise a PR to update ironic-images as we are between releases.

The fix has merged upstream but we need to wait for a bit longer to start bringing it downstream.

Also - changing target to 4.12.

Comment 20 Dmitry Tantsur 2022-09-14 15:05:01 UTC
Any progress here? We're getting more reports.

Comment 21 Jacob Anders 2022-09-29 13:10:01 UTC
(In reply to Dmitry Tantsur from comment #20)
> Any progress here? We're getting more reports.

Apologies, I missed this!

I will check what versions have this code available and what backports may be needed and will chase it up in the next few days.

Comment 22 Jacob Anders 2022-10-03 22:11:52 UTC
(In reply to Dmitry Tantsur from comment #20)
> Any progress here? We're getting more reports.

While I fixed this in master, the fix has not been backported to the bugfix branches hence it's not yet available in OCP 4.10 and 4.11. We'll address this now (thank you for creating the upstream backports Dmitry) and hopefully will have the fix out in the next couple weeks.

Comment 23 Jacob Anders 2022-10-05 02:56:49 UTC
Patches resolving this issue 

https://review.opendev.org/c/openstack/ironic/+/846859
https://review.opendev.org/c/openstack/ironic/+/851950

are merged upstream in the master branch.

Now I making sure these patches are backported to all the relevant releases and included in future 4.11 and 4.10 z-streams.

There is a newer bug opened against the same issue in JIRA OCPBUGS project:

https://issues.redhat.com/browse/OCPBUGS-1740

As we are migrating from Bugzilla to JIRA I will close this issue and further work on the 4.10 fix will continue in OCPBUGS-1740.

I also created a separate bug ( https://issues.redhat.com/browse/OCPBUGS-2011 ) to track the 4.11 fix.

Comment 24 Jacob Anders 2022-10-05 02:57:48 UTC
IMPORTANT NOTE: BZ won't let me close this bug properly as a duplicate as the new bug is in JIRA, ideally this should be CLOSED with DUPLICATE of https://issues.redhat.com/browse/OCPBUGS-1740. The current status does not accurately reflect the state of this issue.


Note You need to log in before you can comment on or make changes to this bug.