Bug 2096944
Summary: | BMH stuck in registering on an OCP HUB cluster deployed on openstack with platform: openstack. 'Node' object has no attribute 'verify_step' | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Alexander Chuzhoy <sasha> |
Component: | Bare Metal Hardware Provisioning | Assignee: | Jacob Anders <janders> |
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> |
Status: | CLOSED CANTFIX | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | ccrum, derekh, imelofer, janders, yprokule |
Version: | 4.10 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.12.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-10-05 02:56:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alexander Chuzhoy
2022-06-14 15:27:18 UTC
This was attempted against several machines: 1. PowerEdge R730 BIOS Version 2.8.0 Firmware Version 2.50.50.50 2. PowerEdge R730 BIOS Version 2.7.1 Firmware Version 2.52.52.52 I see this in the conductor logs, its a new one for me... 2022-06-13 18:52:52.199 1 ERROR concurrent.futures ironic.common.exception.RedfishError: Redfish exception occurred. Error: In system 4c4c4544-0032-4710-804d-b3c04f435032 for node 2abcffd7-e21b-4bc9-a23a-aba3e46ca08c all managers failed: clear job queue. Errors: ['Manager 3250434f-c0b3-4d80-4710-00324c4c4544: The attribute Links/Oem/Dell/DellJobService is missing from the resource /redfish/v1/Managers/iDRAC.Embedded.1'] I've asked Dell folks upstream if they are aware about this this specific HW According to Richard Pioso from Dell we should try to upgrade the machines The latest BIOS available is 2.13.0 while we have 2.7.1/2.8.0 (from 2018) and the latest iDRAC is 2.83.83.83 while we are using 2.50.50.50/2.52.52.52 (from 2017/2018) Setting blocker flag - (since I think the FW upgrade would solve this issue) Additional triage notes: I moved it to Ironic as BMO is just passing this on. It's 99.9% a firmware issue. I've encountered this while testing verify-steps during development. In my opinion the only thing we can do here on the Ironic side is perhaps handle this in a more elegant way. @Iury I am happy to take this one if you like me to - this is related to my upstream verify-steps work. Will look into it a bit during my day. Updated the bios/firmware on a machine: BIOS Version 2.13.0 Firmware Version 2.80.80.80 Still stuck: oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE beaker1 sealusa34.mobius.lab.eng.rdu2.redhat.com registering true 11h Updated must-gather: http://file.rdu.redhat.com/~achuzhoy/bugs/2096944/must-gather2.tgz Same issue reproduces when the hub setup is "platform: none", so not related to platform: openstack The issue seems to be limited to idrac8 machines. With idrac9 nodes - works as expected. Hey Sasha, so from the dell contributors upstream """ I checked - in 13G (R730) there is no support for these management functions, in 14G (R740) there is. Generally, I wouldn't use Redfish with 13G, try WS-Man. while in 13G there is some Redfish support, it does not receive new features and sometimes bugs are not backported too. 13G is still WS-Man world. e.g., Redfish RAID interface is not working in 13G too. """ Virtual Media is a requirement for you, right? I've asked if WS-Man supports virtual media to be sure about it. (before recommending you to switch to it) The possible workaround I see would be disable run clear_job_queue in the ironic.conf for OCP 4.10 (but this can cause a regression) Upstream we can work to make conditional and work on backports. Taking over this BZ as I reproduced this on Ironic standalone and looked into what's happening in a fair bit of detail. Hope to have a patch up early next week. I revisited the patch, should be merge-ready or close. https://review.opendev.org/c/openstack/ironic/+/846859 has merged. Will discuss with the Team whether we want to bump version in ironic-image just for this fix, or is it better to bundle it with other fixes. After discussion with the Team, we decided to set this back to ASSIGNED as we're unable to raise a PR to update ironic-images as we are between releases. The fix has merged upstream but we need to wait for a bit longer to start bringing it downstream. Also - changing target to 4.12. Any progress here? We're getting more reports. (In reply to Dmitry Tantsur from comment #20) > Any progress here? We're getting more reports. Apologies, I missed this! I will check what versions have this code available and what backports may be needed and will chase it up in the next few days. (In reply to Dmitry Tantsur from comment #20) > Any progress here? We're getting more reports. While I fixed this in master, the fix has not been backported to the bugfix branches hence it's not yet available in OCP 4.10 and 4.11. We'll address this now (thank you for creating the upstream backports Dmitry) and hopefully will have the fix out in the next couple weeks. Patches resolving this issue https://review.opendev.org/c/openstack/ironic/+/846859 https://review.opendev.org/c/openstack/ironic/+/851950 are merged upstream in the master branch. Now I making sure these patches are backported to all the relevant releases and included in future 4.11 and 4.10 z-streams. There is a newer bug opened against the same issue in JIRA OCPBUGS project: https://issues.redhat.com/browse/OCPBUGS-1740 As we are migrating from Bugzilla to JIRA I will close this issue and further work on the 4.10 fix will continue in OCPBUGS-1740. I also created a separate bug ( https://issues.redhat.com/browse/OCPBUGS-2011 ) to track the 4.11 fix. IMPORTANT NOTE: BZ won't let me close this bug properly as a duplicate as the new bug is in JIRA, ideally this should be CLOSED with DUPLICATE of https://issues.redhat.com/browse/OCPBUGS-1740. The current status does not accurately reflect the state of this issue. |