Created attachment 1878612 [details] must gather Description of problem: Installing 4.10.13 OCP on HPE nodes is not successful and hits the following inspection error: Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none Version-Release number of selected component (if applicable): 4.10.13 How reproducible: always Steps to Reproduce: 1.On BM cluster with HPE worker nodes install 4.10.13 2. 3. Actual results: [root@registry ~]# oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api master-0 externally provisioned hlxcl6-9t87h-master-0 true 20h openshift-machine-api master-1 externally provisioned hlxcl6-9t87h-master-1 true 20h openshift-machine-api master-2 externally provisioned hlxcl6-9t87h-master-2 true 20h openshift-machine-api worker-0 inspecting true inspection error 20h openshift-machine-api worker-1 inspecting true inspection error 20h and inspecting the failing workers' events: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal InspectionStarted 20m metal3-baremetal-controller Hardware inspection started Normal InspectionError 19m metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none Normal InspectionStarted 14m metal3-baremetal-controller Hardware inspection started Normal InspectionError 13m metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none Expected results: nodes should be provisioned with the ocp version without any complications. Additional info: The must-gather and cluster-info are attached. Note that the installation of 4.9.31 on the same HW was successful.
Looking at your ironic-conductor logs, this is the relevent section 2022-05-10T13:23:22.203835488Z 2022-05-10 13:23:22.203 1 DEBUG sushy.connector [req-b739ab36-bbc7-47bb-a0c1-7d95337786ce ironic-user - - - -] HTTP request: PATCH https://10.46.61.157:443/redfish/v1/Systems/1; headers: {'If-Match': 'W/"9984DECD","9984DECD"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Cd', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111^[[00m 2022-05-10T13:23:22.232058169Z 2022-05-10 13:23:22.230 1 WARNING sushy.exceptions [req-4d193bfc-abc4-4e41-aad7-bafe4efb59f6 ironic-user - - - -] Error response from PATCH https://10.46.61.156:443/redfish/v1/Systems/1 with status code 412 has no JSON body: simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)^[[00m 2022-05-10T13:23:22.232824164Z 2022-05-10 13:23:22.232 1 DEBUG sushy.exceptions [req-4d193bfc-abc4-4e41-aad7-bafe4efb59f6 ironic-user - - - -] HTTP response for PATCH https://10.46.61.156:443/redfish/v1/Systems/1: status code: 412, error: unknown error, extended: none __init__ /usr/lib/python3.6/site-packages/sushy/exceptions.py:122^[[00m 2022-05-10T13:23:22.233815406Z 2022-05-10 13:23:22.232 1 ERROR ironic.drivers.modules.redfish.management [req-4d193bfc-abc4-4e41-aad7-bafe4efb59f6 ironic-user - - - -] Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none: sushy.exceptions.HTTPError: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none^[[00m I believe this may have something to do with the If-Match header, as http 412 is a response to an ETag mismatch and the format look strange 'If-Match': 'W/"9984DECD","9984DECD"' I'll find out if we recently started adding this header What version of iLo / Bios is used on this host?
> I'll find out if we recently started adding this header Yes. This is how Redfish works. In theory, weak etags should not be sent, but Redfish relies on them.
(In reply to Derek Higgins from comment #2) > What version of iLo / Bios is used on this host? Also would it be possible to upgrade both to potentially rule out a problem with the firmware?
(In reply to Derek Higgins from comment #4) > (In reply to Derek Higgins from comment #2) > > What version of iLo / Bios is used on this host? the host has iLO 5 version 2.18 Jun 22 2020 > Also would it be possible to upgrade both to potentially rule > out a problem with the firmware? since the installation of 4.9.31 was successful on the same cluster and hw, it roled out that the f/w is the issue. unless we have 4.10.13 s/w issue when deploying on that f/w, thus we can try upgrading the f/w.
similar BZ was opened lately: https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c3 note that it has the latest f/w version. IIUC it is the python-sushy version that started to get shipped with 4.10.13.
Can you confirm this was fine in 4.10.12 ? I believe the sushy version was bumped in 4.10.13 which could have been the source of the regression
after upgrading the f/w to iLo 5 2.63, the deployment of 4.10.13 was successful and no errors were found in the bmh events: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Registered 73m metal3-baremetal-controller Registered new host Normal InspectionStarted 73m metal3-baremetal-controller Hardware inspection started Normal BMCAccessValidated 73m metal3-baremetal-controller Verified access to BMC Normal InspectionComplete 64m metal3-baremetal-controller Hardware inspection completed Normal ProfileSet 64m metal3-baremetal-controller Hardware profile set: unknown Normal ProvisioningStarted 63m metal3-baremetal-controller Image provisioning started for Normal ProvisioningComplete 59m metal3-baremetal-controller Image provisioning completed for @Derek Higgins, should this be mentioned anywhere in the docs?
(In reply to Shereen Haj Makhoul from comment #10) > @Derek Higgins, should this be mentioned anywhere in the docs? Yes, we should turn this into a doc bug to document iLo 5 2.63 as a minimum version for 4.10 (and above)
I don't think this issue is limited to HPE/iLO, I'm also seeing this issue on Lenovo (latest firmware on a brand new ThinkEdge SE450), with OpenShift 4.10.18: Image provisioning failed: Deploy step deploy.deploy failed with HTTPError: HTTP PATCH https://192.168.1.1:443/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_0.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.8.PreconditionFailed', 'Message': 'The ETag supplied did not match the ETag required to change this resource.'}].
Also tried via simple "redfish://" rather than "redfish-virtualmedia://" and get the following- Error: Image provisioning failed: Deploy step deploy.deploy failed: Redfish exception occurred. Error: Redfish set boot device failed for node 8bcc25e9-d002-44f1-8917-9d450eb2b7c7. Error: HTTP PATCH https://192.168.3.110:443/redfish/v1/Systems/1/Pending returned code 412. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_0.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.8.PreconditionFailed', 'Message': 'The ETag supplied did not match the ETag required to change this resource.'}].
It looks like for some reason Conductor is sending both strong and weak eTag 2022-07-05 09:10:48.348 1 DEBUG sushy.connector [req-f175182f-c7c4-4b8b-82db-9676a123940b ironic-user - - - -] HTTP request: PATCH https://192.168.3.110:443/redfish/v1/Managers/1/VirtualMedia/EXT1; headers: {'If-Match': 'W/"553b2ac3d03428a98ef","553b2ac3d03428a98ef"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Image': 'https://192.168.3.23:6183/redfish/boot-679e256a-7d20-42dd-922d-927bede887de.iso', 'Inserted': True, 'WriteProtected': True}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111 The BMC does not like that and returns HTTP 412 precondition failed error.
Our current hypothesis regarding this particular case (we've been discussing a few similarly looking but separate problems in this BZ) is that upstream patch https://review.opendev.org/c/openstack/sushy/+/818110 (TL;DR this is improving weak eTag handling, allowing Ironic to work with less RedFish compliant hardware) fixed some machines but unfortunately broke other ones which are non-compliant with the Redfish standard in a different way. This includes HPs on older firmware. This patch has merged in 4.10 development cycle and also has been backported to 4.9 and 4.8. So I believe that latest z-stream releases from 4.8 4.9 and 4.10 will all be affected on the specific hardware/firmware combinations (mostly on HP machines) which are susceptible to this issue. Due to this, I am fairly certain latest 4.9.z builds be affected, I'd say I'm 90-95% certain. For 100% certainty we would need to have someone test latest 4.9.z in the lab on a hardware which is known to have this problem. Stating the requirement to run recent firmware on these should certainly prevent the issue from occurring on a significant proportion of the susceptible machines; it is also a good measure to always run recent/latest firmware so it's the right thing to do anyway, this issue just makes this even more important. This is a logical first step and is definitely worth doing in my opinion. As the second step we will aim to provide a fix for machines which are still affected despite running the latest firmware. This includes HP iLO4 servers provisioned via RedFish as well as late model Lenovo servers. We aim to work on this once we're past the upcoming Feature Freeze date for the 4.12 development cycle.
Verified on HP with 2.71 ILO FW