Bug 2084059 - [IPI] OCP 4.10.13 is not deploy-able on HPE machines
Summary: [IPI] OCP 4.10.13 is not deploy-able on HPE machines
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Documentation
Version: 4.10
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Alexandra Molnar
QA Contact: Jad Haj Yahya
Tomas 'Sheldon' Radej
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-11 09:56 UTC by Shereen Haj Makhoul
Modified: 2022-09-13 15:21 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-13 15:21:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
must gather (14.59 MB, application/gzip)
2022-05-11 09:56 UTC, Shereen Haj Makhoul
no flags Details

Description Shereen Haj Makhoul 2022-05-11 09:56:47 UTC
Created attachment 1878612 [details]
must gather

Description of problem:
Installing 4.10.13 OCP on HPE nodes is not successful and hits the following inspection error:
Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none

Version-Release number of selected component (if applicable):
4.10.13

How reproducible:
always

Steps to Reproduce:
1.On BM cluster with HPE worker nodes install 4.10.13 
2.
3.

Actual results:

[root@registry ~]# oc get bmh -A
NAMESPACE               NAME       STATE                    CONSUMER                ONLINE   ERROR              AGE
openshift-machine-api   master-0   externally provisioned   hlxcl6-9t87h-master-0   true                        20h
openshift-machine-api   master-1   externally provisioned   hlxcl6-9t87h-master-1   true                        20h
openshift-machine-api   master-2   externally provisioned   hlxcl6-9t87h-master-2   true                        20h
openshift-machine-api   worker-0   inspecting                                       true     inspection error   20h
openshift-machine-api   worker-1   inspecting                                       true     inspection error   20h

and inspecting the failing workers' events:

Events:
  Type    Reason             Age   From                         Message
  ----    ------             ----  ----                         -------
  Normal  InspectionStarted  20m   metal3-baremetal-controller  Hardware inspection started
  Normal  InspectionError    19m   metal3-baremetal-controller  Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none
  Normal  InspectionStarted  14m   metal3-baremetal-controller  Hardware inspection started
  Normal  InspectionError    13m   metal3-baremetal-controller  Failed to inspect hardware. Reason: unable to start inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none


Expected results:
nodes should be provisioned with the ocp version without any complications.

Additional info:
The must-gather and cluster-info are attached.
Note that the installation of 4.9.31 on the same HW was successful.

Comment 2 Derek Higgins 2022-05-11 12:17:33 UTC
Looking at your ironic-conductor logs, this is the relevent section

2022-05-10T13:23:22.203835488Z 2022-05-10 13:23:22.203 1 DEBUG sushy.connector [req-b739ab36-bbc7-47bb-a0c1-7d95337786ce ironic-user - - - -] HTTP request: PATCH https://10.46.61.157:443/redfish/v1/Systems/1; headers: {'If-Match': 'W/"9984DECD","9984DECD"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Cd', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111^[[00m
2022-05-10T13:23:22.232058169Z 2022-05-10 13:23:22.230 1 WARNING sushy.exceptions [req-4d193bfc-abc4-4e41-aad7-bafe4efb59f6 ironic-user - - - -] Error response from PATCH https://10.46.61.156:443/redfish/v1/Systems/1 with status code 412 has no JSON body: simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)^[[00m
2022-05-10T13:23:22.232824164Z 2022-05-10 13:23:22.232 1 DEBUG sushy.exceptions [req-4d193bfc-abc4-4e41-aad7-bafe4efb59f6 ironic-user - - - -] HTTP response for PATCH https://10.46.61.156:443/redfish/v1/Systems/1: status code: 412, error: unknown error, extended: none __init__ /usr/lib/python3.6/site-packages/sushy/exceptions.py:122^[[00m
2022-05-10T13:23:22.233815406Z 2022-05-10 13:23:22.232 1 ERROR ironic.drivers.modules.redfish.management [req-4d193bfc-abc4-4e41-aad7-bafe4efb59f6 ironic-user - - - -] Redfish set boot device failed for node 97db9d14-b4c9-47af-bdac-04e0cc4e78db. Error: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none: sushy.exceptions.HTTPError: HTTP PATCH https://10.46.61.156:443/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none^[[00m


I believe this may have something to do with the If-Match header, as http 412 is a response to an ETag mismatch
and the format look strange

'If-Match': 'W/"9984DECD","9984DECD"'

I'll find out if we recently started adding this header 
What version of iLo / Bios is used on this host?

Comment 3 Dmitry Tantsur 2022-05-11 12:20:51 UTC
> I'll find out if we recently started adding this header 

Yes. This is how Redfish works. In theory, weak etags should not be sent, but Redfish relies on them.

Comment 4 Derek Higgins 2022-05-11 12:54:33 UTC
(In reply to Derek Higgins from comment #2)
> What version of iLo / Bios is used on this host?

Also would it be possible to upgrade both to potentially rule
out a problem with the firmware?

Comment 5 Shereen Haj Makhoul 2022-05-11 15:18:13 UTC
(In reply to Derek Higgins from comment #4)
> (In reply to Derek Higgins from comment #2)
> > What version of iLo / Bios is used on this host?

the host has iLO 5 version 2.18 Jun 22 2020

> Also would it be possible to upgrade both to potentially rule
> out a problem with the firmware?

since the installation of 4.9.31 was successful on the same cluster and hw, it roled out that the f/w is the issue. unless we have 4.10.13 s/w issue when deploying on that f/w, thus we can try upgrading the f/w.

Comment 7 Shereen Haj Makhoul 2022-05-19 11:03:02 UTC
similar BZ was opened lately:
https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c3

note that it has the latest f/w version. IIUC it is the python-sushy version that started to get shipped with 4.10.13.

Comment 8 Derek Higgins 2022-05-19 11:42:26 UTC
Can you confirm this was fine in 4.10.12 ? 
I believe the sushy version was bumped in 4.10.13 which could have been the source of the regression

Comment 10 Shereen Haj Makhoul 2022-05-19 13:28:39 UTC
after upgrading the f/w to iLo 5 2.63, the deployment of 4.10.13 was successful and no errors were found in the bmh events:
Events:
  Type    Reason                Age   From                         Message
  ----    ------                ----  ----                         -------
  Normal  Registered            73m   metal3-baremetal-controller  Registered new host
  Normal  InspectionStarted     73m   metal3-baremetal-controller  Hardware inspection started
  Normal  BMCAccessValidated    73m   metal3-baremetal-controller  Verified access to BMC
  Normal  InspectionComplete    64m   metal3-baremetal-controller  Hardware inspection completed
  Normal  ProfileSet            64m   metal3-baremetal-controller  Hardware profile set: unknown
  Normal  ProvisioningStarted   63m   metal3-baremetal-controller  Image provisioning started for
  Normal  ProvisioningComplete  59m   metal3-baremetal-controller  Image provisioning completed for

@Derek Higgins, should this be mentioned anywhere in the docs?

Comment 11 Derek Higgins 2022-06-01 08:38:24 UTC
(In reply to Shereen Haj Makhoul from comment #10)
> @Derek Higgins, should this be mentioned anywhere in the docs?

Yes, we should turn this into a doc bug to document iLo 5 2.63 as a minimum version for 4.10 (and above)

Comment 12 Rhys Oxenham 2022-07-04 19:23:46 UTC
I don't think this issue is limited to HPE/iLO, I'm also seeing this issue on Lenovo (latest firmware on a brand new ThinkEdge SE450), with OpenShift 4.10.18:

Image provisioning failed: Deploy step deploy.deploy failed with HTTPError: HTTP PATCH https://192.168.1.1:443/redfish/v1/Managers/1/VirtualMedia/EXT1 returned code 412. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_0.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.8.PreconditionFailed', 'Message': 'The ETag supplied did not match the ETag required to change this resource.'}].

Comment 13 Rhys Oxenham 2022-07-04 19:30:21 UTC
Also tried via simple "redfish://" rather than "redfish-virtualmedia://" and get the following-

Error: Image provisioning failed: Deploy step deploy.deploy failed: Redfish exception occurred. Error: Redfish set boot device failed for node 8bcc25e9-d002-44f1-8917-9d450eb2b7c7. Error: HTTP PATCH https://192.168.3.110:443/redfish/v1/Systems/1/Pending returned code 412. Base.1.8.GeneralError: A general error has occurred. See ExtendedInfo for more information. Extended information: [{'@odata.type': '#Message.v1_1_0.Message', 'Resolution': 'Try the operation again using the appropriate ETag.', 'MessageArgs': [], 'MessageSeverity': 'Critical', 'MessageId': 'Base.1.8.PreconditionFailed', 'Message': 'The ETag supplied did not match the ETag required to change this resource.'}].

Comment 15 Jacob Anders 2022-07-05 11:34:35 UTC
It looks like for some reason Conductor is sending both strong and weak eTag 

2022-07-05 09:10:48.348 1 DEBUG sushy.connector [req-f175182f-c7c4-4b8b-82db-9676a123940b ironic-user - - - -] HTTP request: PATCH https://192.168.3.110:443/redfish/v1/Managers/1/VirtualMedia/EXT1; headers: {'If-Match': 'W/"553b2ac3d03428a98ef","553b2ac3d03428a98ef"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Image': 'https://192.168.3.23:6183/redfish/boot-679e256a-7d20-42dd-922d-927bede887de.iso', 'Inserted': True, 'WriteProtected': True}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:111

The BMC does not like that and returns HTTP 412 precondition failed error.

Comment 20 Jacob Anders 2022-08-24 02:00:37 UTC
Our current hypothesis regarding this particular case (we've been discussing a few similarly looking but separate problems in this BZ) is that upstream patch https://review.opendev.org/c/openstack/sushy/+/818110 (TL;DR this is improving weak eTag handling, allowing Ironic to work with less RedFish compliant hardware) fixed some machines but unfortunately broke other ones which are non-compliant with the Redfish standard in a different way. This includes HPs on older firmware.

This patch has merged in 4.10 development cycle and also has been backported to 4.9 and 4.8. So I believe that latest z-stream releases from 4.8 4.9 and 4.10 will all be affected on the specific hardware/firmware combinations (mostly on HP machines) which are susceptible to this issue.

Due to this, I am fairly certain latest 4.9.z builds be affected, I'd say I'm 90-95% certain. For 100% certainty we would need to have someone test latest 4.9.z in the lab on a hardware which is known to have this problem.

Stating the requirement to run recent firmware on these should certainly prevent the issue from occurring on a significant proportion of the susceptible machines; it is also a good measure to always run recent/latest firmware so it's the right thing to do anyway, this issue just makes this even more important. This is a logical first step and is definitely worth doing in my opinion.

As the second step we will aim to provide a fix for machines which are still affected despite running the latest firmware. This includes HP iLO4 servers provisioned via RedFish as well as late model Lenovo servers. We aim to work on this once we're past the upcoming Feature Freeze date for the 4.12 development cycle.

Comment 21 Jad Haj Yahya 2022-09-07 16:08:17 UTC
Verified on HP with 2.71 ILO FW


Note You need to log in before you can comment on or make changes to this bug.