OCP version: 4.10.17 multicluster-engine.v2.0.0 The Hub cluster (3 masters and 2 workers) was deployed on Openstack. The install-config for the cluster is below: apiVersion: v1 baseDomain: dno.ccitredhat.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: openstack: type: ci.memory.medium replicas: 2 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: openstack: type: ci.memory.medium replicas: 3 metadata: creationTimestamp: null name: ai networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 192.169.0.0/16 networkType: OpenShiftSDN serviceNetwork: - 172.30.0.0/16 platform: openstack: apiFloatingIP: 10.0.188.56 apiVIP: 192.169.0.5 cloud: openstack computeFlavor: m1.large defaultMachinePlatform: type: m1.large externalDNS: null externalNetwork: shared_net_5 ingressFloatingIP: 10.0.188.13 ingressVIP: 192.169.0.7 publish: External pullSecret: "" sshKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQCytALUAofclqRVw+snWTC/lWk97XfMaozHsOQV37LpbDjgs87sVt3WmcioUjjh9G4AEKVdEOfSKxNJ/NizzeZtZ+Egq3nxjviewZWwnd94aT1dW806etEYPahcad/u8fuJ/XZLKaD4QX3gl8TGiSZVr7lXtWCP9at8DpDdmGJLbP5ObY9T3q9N2wX2VFCn5hPzlhfzRDvWevcQ5i+uOFEJsKMQ4E7juMN75TmZF7ys6x8NjDcW7MVV9PZHfHEaOz18EUptBVpE9IolSVPpqZn+6EfEyWYP+fsDL8H5/tid2bkwBJGTVnUBuojC7Oe9jhPB9zf2KVbLpxk2dJVCfDQWHu3jRON9frlPoYJoKiF5Zdj45+OLZof8MYGKrMhaRbUNmPvYfGYq9G9k0qOpkyoc5+PRufy4LozRc/b4AcOhWrQTKBb8ksFWmwOg+RiWdb3cXvANhUOYBfGrZD6HFspMk3cA94nYfKUzvXbl1x9VDqQfjNVyA5F9siISDQK0euDLnt6u244FfPrvhkuAYiKwm9IKpfJ5H9EPIzPXKeRaq/iop1g/IlcuVrIZwlgrfDr3kKpzBotYqln5otQ5AlbLHC5z+1MieXmvTFWQ0xmQQlzrkswLN/c7wskDkzzc46rHDhyq+IjoiAIL3NmNiUJhjL5XmPsHF6726Zu+md6mTw== cardno:000606111718 Attempted to deploy a spoke cluster. The BMH creation gets stuck in registering and then after a very long time shows "registration error": oc get bmh sealusa34.mobius.lab.eng.rdu2.redhat.com NAME STATE CONSUMER ONLINE ERROR AGE sealusa34.mobius.lab.eng.rdu2.redhat.com registering true registration error 21h oc get bmh sealusa34.mobius.lab.eng.rdu2.redhat.com -o yaml apiVersion: metal3.io/v1alpha1 kind: BareMetalHost metadata: annotations: bmac.agent-install.openshift.io/hostname: sealusa34.mobius.lab.eng.rdu2.redhat.com bmac.agent-install.openshift.io/role: master inspect.metal3.io: disabled creationTimestamp: "2022-06-13T17:58:35Z" finalizers: - baremetalhost.metal3.io generation: 1 labels: infraenvs.agent-install.openshift.io: beaker1 name: sealusa34.mobius.lab.eng.rdu2.redhat.com namespace: beaker1 resourceVersion: "8009082" uid: 3842c3d2-fccc-47eb-9776-de8d1752f5fe spec: automatedCleaningMode: disabled bmc: address: idrac-virtualmedia+https://10.9.78.48/redfish/v1/Systems/System.Embedded.1 credentialsName: bmc-secret3 disableCertificateVerification: true bootMACAddress: f8:f2:1e:31:66:29 online: true rootDeviceHints: deviceName: /dev/sda status: errorCount: 2 errorMessage: 'Async execution of do_node_verify failed with error: ''Node'' object has no attribute ''verify_step''' errorType: registration error goodCredentials: {} hardwareProfile: "" lastUpdated: "2022-06-13T18:53:03Z" operationHistory: deprovision: end: null start: null inspect: end: null start: null provision: end: null start: null register: end: null start: "2022-06-13T17:58:35Z" operationalStatus: error poweredOn: false provisioning: ID: 2abcffd7-e21b-4bc9-a23a-aba3e46ca08c bootMode: UEFI image: url: "" state: registering triedCredentials: credentials: name: bmc-secret3 namespace: beaker1 credentialsVersion: "7952011" oc describe bmh sealusa34.mobius.lab.eng.rdu2.redhat.com Name: sealusa34.mobius.lab.eng.rdu2.redhat.com Namespace: beaker1 Labels: infraenvs.agent-install.openshift.io=beaker1 Annotations: bmac.agent-install.openshift.io/hostname: sealusa34.mobius.lab.eng.rdu2.redhat.com bmac.agent-install.openshift.io/role: master inspect.metal3.io: disabled API Version: metal3.io/v1alpha1 Kind: BareMetalHost Metadata: Creation Timestamp: 2022-06-13T17:58:35Z Finalizers: baremetalhost.metal3.io Generation: 1 Managed Fields: API Version: metal3.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:finalizers: .: v:"baremetalhost.metal3.io": Manager: baremetal-operator Operation: Update Time: 2022-06-13T17:58:35Z API Version: metal3.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:bmac.agent-install.openshift.io/hostname: f:bmac.agent-install.openshift.io/role: f:inspect.metal3.io: f:labels: .: f:infraenvs.agent-install.openshift.io: f:spec: .: f:automatedCleaningMode: f:bmc: .: f:address: f:credentialsName: f:disableCertificateVerification: f:bootMACAddress: f:online: f:rootDeviceHints: .: f:deviceName: Manager: kubectl-create Operation: Update Time: 2022-06-13T17:58:35Z API Version: metal3.io/v1alpha1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:errorCount: f:errorMessage: f:errorType: f:goodCredentials: f:hardwareProfile: f:lastUpdated: f:operationHistory: .: f:deprovision: .: f:end: f:start: f:inspect: .: f:end: f:start: f:provision: .: f:end: f:start: f:register: .: f:end: f:start: f:operationalStatus: f:poweredOn: f:provisioning: .: f:ID: f:bootMode: f:image: .: f:url: f:state: f:triedCredentials: .: f:credentials: .: f:name: f:namespace: f:credentialsVersion: Manager: baremetal-operator Operation: Update Subresource: status Time: 2022-06-13T18:53:03Z Resource Version: 8009082 UID: 3842c3d2-fccc-47eb-9776-de8d1752f5fe Spec: Automated Cleaning Mode: disabled Bmc: Address: idrac-virtualmedia+https://10.9.78.48/redfish/v1/Systems/System.Embedded.1 Credentials Name: bmc-secret3 Disable Certificate Verification: true Boot MAC Address: f8:f2:1e:31:66:29 Online: true Root Device Hints: Device Name: /dev/sda Status: Error Count: 2 Error Message: Async execution of do_node_verify failed with error: 'Node' object has no attribute 'verify_step' Error Type: registration error Good Credentials: Hardware Profile: Last Updated: 2022-06-13T18:53:03Z Operation History: Deprovision: End: <nil> Start: <nil> Inspect: End: <nil> Start: <nil> Provision: End: <nil> Start: <nil> Register: End: <nil> Start: 2022-06-13T17:58:35Z Operational Status: error Powered On: false Provisioning: ID: 2abcffd7-e21b-4bc9-a23a-aba3e46ca08c Boot Mode: UEFI Image: URL: State: registering Tried Credentials: Credentials: Name: bmc-secret3 Namespace: beaker1 Credentials Version: 7952011 Events: <none>
This was attempted against several machines: 1. PowerEdge R730 BIOS Version 2.8.0 Firmware Version 2.50.50.50 2. PowerEdge R730 BIOS Version 2.7.1 Firmware Version 2.52.52.52
I see this in the conductor logs, its a new one for me... 2022-06-13 18:52:52.199 1 ERROR concurrent.futures ironic.common.exception.RedfishError: Redfish exception occurred. Error: In system 4c4c4544-0032-4710-804d-b3c04f435032 for node 2abcffd7-e21b-4bc9-a23a-aba3e46ca08c all managers failed: clear job queue. Errors: ['Manager 3250434f-c0b3-4d80-4710-00324c4c4544: The attribute Links/Oem/Dell/DellJobService is missing from the resource /redfish/v1/Managers/iDRAC.Embedded.1']
I've asked Dell folks upstream if they are aware about this this specific HW
According to Richard Pioso from Dell we should try to upgrade the machines The latest BIOS available is 2.13.0 while we have 2.7.1/2.8.0 (from 2018) and the latest iDRAC is 2.83.83.83 while we are using 2.50.50.50/2.52.52.52 (from 2017/2018)
Setting blocker flag - (since I think the FW upgrade would solve this issue)
Additional triage notes: I moved it to Ironic as BMO is just passing this on. It's 99.9% a firmware issue. I've encountered this while testing verify-steps during development. In my opinion the only thing we can do here on the Ironic side is perhaps handle this in a more elegant way. @Iury I am happy to take this one if you like me to - this is related to my upstream verify-steps work. Will look into it a bit during my day.
Updated the bios/firmware on a machine: BIOS Version 2.13.0 Firmware Version 2.80.80.80 Still stuck: oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE beaker1 sealusa34.mobius.lab.eng.rdu2.redhat.com registering true 11h Updated must-gather: http://file.rdu.redhat.com/~achuzhoy/bugs/2096944/must-gather2.tgz
Same issue reproduces when the hub setup is "platform: none", so not related to platform: openstack
The issue seems to be limited to idrac8 machines. With idrac9 nodes - works as expected.
Hey Sasha, so from the dell contributors upstream """ I checked - in 13G (R730) there is no support for these management functions, in 14G (R740) there is. Generally, I wouldn't use Redfish with 13G, try WS-Man. while in 13G there is some Redfish support, it does not receive new features and sometimes bugs are not backported too. 13G is still WS-Man world. e.g., Redfish RAID interface is not working in 13G too. """ Virtual Media is a requirement for you, right? I've asked if WS-Man supports virtual media to be sure about it. (before recommending you to switch to it) The possible workaround I see would be disable run clear_job_queue in the ironic.conf for OCP 4.10 (but this can cause a regression) Upstream we can work to make conditional and work on backports.
Taking over this BZ as I reproduced this on Ironic standalone and looked into what's happening in a fair bit of detail. Hope to have a patch up early next week.
I revisited the patch, should be merge-ready or close.
https://review.opendev.org/c/openstack/ironic/+/846859 has merged. Will discuss with the Team whether we want to bump version in ironic-image just for this fix, or is it better to bundle it with other fixes.
After discussion with the Team, we decided to set this back to ASSIGNED as we're unable to raise a PR to update ironic-images as we are between releases. The fix has merged upstream but we need to wait for a bit longer to start bringing it downstream. Also - changing target to 4.12.
Any progress here? We're getting more reports.
(In reply to Dmitry Tantsur from comment #20) > Any progress here? We're getting more reports. Apologies, I missed this! I will check what versions have this code available and what backports may be needed and will chase it up in the next few days.
(In reply to Dmitry Tantsur from comment #20) > Any progress here? We're getting more reports. While I fixed this in master, the fix has not been backported to the bugfix branches hence it's not yet available in OCP 4.10 and 4.11. We'll address this now (thank you for creating the upstream backports Dmitry) and hopefully will have the fix out in the next couple weeks.
Patches resolving this issue https://review.opendev.org/c/openstack/ironic/+/846859 https://review.opendev.org/c/openstack/ironic/+/851950 are merged upstream in the master branch. Now I making sure these patches are backported to all the relevant releases and included in future 4.11 and 4.10 z-streams. There is a newer bug opened against the same issue in JIRA OCPBUGS project: https://issues.redhat.com/browse/OCPBUGS-1740 As we are migrating from Bugzilla to JIRA I will close this issue and further work on the 4.10 fix will continue in OCPBUGS-1740. I also created a separate bug ( https://issues.redhat.com/browse/OCPBUGS-2011 ) to track the 4.11 fix.
IMPORTANT NOTE: BZ won't let me close this bug properly as a duplicate as the new bug is in JIRA, ideally this should be CLOSED with DUPLICATE of https://issues.redhat.com/browse/OCPBUGS-1740. The current status does not accurately reflect the state of this issue.