Bug 2088196 - Redfish set boot device failed for node in OCP 4.8 latest RC
Summary: Redfish set boot device failed for node in OCP 4.8 latest RC
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.8
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: ---
: 4.8.z
Assignee: Jacob Anders
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On: 2088319
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-19 01:27 UTC by Manuel Rodriguez
Modified: 2022-05-25 21:48 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Since adding eTag handling in Ironic (this was implemented during upstream Yoga cycle - patches: https://review.opendev.org/c/openstack/sushy/+/818114 and https://review.opendev.org/c/openstack/sushy/+/818110 ) issues with eTag handling on old firmware versions were increasingly observed, in particular on HP Machines. For example on DL360G10, iLo 5 2.63 or later is required otherwise issues with eTag handling in firmware may prevent Ironic from successfully provisioning the server. It is always recommended to run latest firmware, however in case of eTag issues it is mandatory to upgrade to latest firmware prior to taking any further troubleshooting steps.
Clone Of:
: 2088319 2088716 2088717 (view as bug list)
Environment:
Last Closed: 2022-05-25 21:48:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ironic-image pull 276 0 None Merged Bug 2088196: Backport weak eTag handling fix to OpenShift 4.8 2022-06-01 15:52:38 UTC
OpenStack gerrit 842462 0 None MERGED Handle weak Etags 2022-05-20 09:17:36 UTC
OpenStack gerrit 842544 0 None MERGED Release sushy 3.7.5 for Wallaby 2022-05-20 20:42:44 UTC
Red Hat Product Errata RHSA-2022:2272 0 None None None 2022-05-25 21:48:35 UTC

Description Manuel Rodriguez 2022-05-19 01:27:47 UTC
Description of problem:

During the installation of OCP with IPI the Bootstrap fails with all three master nodes showing the following error:

2022-05-17 19:04:08.101 1 ERROR ironic.drivers.modules.redfish.management [req-18a38ecb-f3b7-4c91-b258-a77f352af553 bootstrap-user - - - -] Redfish set boot device failed for node 5bb35a95-2f24-4693-a0ac-796c79c6af7d. Error: HTTP PATCH https://<IP-Address>/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none: sushy.exceptions.HTTPError: HTTP PATCH https://<IP-Address/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none    

Version-Release number of selected component (if applicable):
OpenShift 4.8.40 RC 2022-05-16
OpenShift 4.9.34 RC 2022-05-17
But if we use previous releases it works:
OpenShift 4.8.39
OpenShift 4.9.33 RC 2022-05-11

How reproducible:


Steps to Reproduce:
1.- Deploy Baremetal OCP with IPI
2. use redfish in the install-config.yaml
bmc:
  address: redfish://<IP-ADDRESS>/redfish/v1/Systems/1

3. Wait for the bootstrap VM to provision, check the install log: .openshift_install.log eventually the error will show up for all master nodes:

time="2022-05-18T11:38:53-04:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to st
art inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 4debd5c8-92c6-41e3-bc69-9e19caa6b8c2. Error: HTTP PATCH https://<IP-Address>/redfish/v1/Systems/1 returned co
de 412. unknown error Extended information: none'"


Actual results:
- initial bootstrap on master nodes fails, OCP cluster is not installed

Expected results:
- bootstrap should succeed, the redfish call must not return an error

Additional info:

- We are using HPE ProLiant DL360 Gen10, but we suspect the problem is independent of the platform. 

Full Logs of ironic-conductor can be found here:
OpenShift 4.8.40 RC 2022-05-16 - https://www.distributed-ci.io/files/599c4982-f2f7-4c2a-b6a8-e2d21015b0eb
OpenShift 4.9.34 RC 2022-05-17 - https://www.distributed-ci.io/files/61dcc3f3-cfbe-4fa0-8993-cb0f24f71e2f

Comment 2 Jacob Anders 2022-05-19 01:45:40 UTC
Setting high priority / urgency cause this may cause regression and breaking updates on this particular hardware.

I suspect backporting this change is related to the issue:

https://opendev.org/openstack/sushy/commit/e4e24f1414c222ea5eeb97c01e4e216bd7f5a285

https://github.com/openshift/ironic-image/pull/275

In particular "reverting" this change (which would mean untagging packages in the prod repo) could be the fastest resolution.

https://github.com/openshift/ironic-image/pull/275/files

I will raise this with the ART team and once I've done it I will try to narrow down why exactly is it causing problems and only in 4.8 and 4.9 (4.10 is fine). Perhaps there is another change that should have been backported together with the one above.

Comment 3 Manuel Rodriguez 2022-05-19 02:19:43 UTC
Also I forgot to mention that we are using latest ILO firmware "Feb 23 2022 - iLO 5 v2.65" for HPE ProLiant DL360 Gen10[1], so we discarded the firmware option, we tried ILO resets too. And we validated OpenShift 4.10.15 RC 2022-05-16 and OpenShift 4.11.0 2022-05-11 are working fine.

[1]https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_d373bd8a5c9948c6aae7bca5dc&tab=revisionHistory

Comment 4 Jacob Anders 2022-05-19 06:10:27 UTC
I think the issue was caused by the fact that we backported https://review.opendev.org/c/openstack/sushy/+/840652 but not https://review.opendev.org/c/openstack/sushy/+/818110.

I created the necessary backports, they are now in review / CI.

Comment 6 Jacob Anders 2022-05-19 11:56:16 UTC
A brief update on the current state of this BZ at the end of the day:
* suspected-missing backports are up for review upstream (links are posted above)
* ART Team have un-tagged the problem packages for 4.8/4.9 so they shouldn't be included in the upcoming z-stream release
* once we are able to test the new packages including the missing backports, we will validate if this is sufficient fix

Comment 7 Jacob Anders 2022-05-20 02:39:16 UTC
Today I tested the proposed fix together with Manuel in his environment, here is what we did:
* deployed OCP 4.8.37 (to ensure the deployment completes successfully to allow testing)
* replaced ironic-image with the one from the release affected by this problem ( 4.8.40 )
* reproduced the problem
* replaced ironic-image again with a custom built 4.8.40 + the missing backport ( https://review.opendev.org/c/openstack/sushy/+/818110 )
* verified that the issue is resolved - we were able to deploy a BMH:

2022-05-20 01:10:36.342 1 DEBUG sushy.connector [req-0ade6685-61b7-4d2d-a58d-5bb3a16a58d8 ironic-user - - - -] HTTP request: PATCH https://[BMC-IP]/redfish/v1/Systems/1; headers: {'If-Match': 'W/"[eTAG]","[eTag]"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Pxe', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:114ESC[00m /usr/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning)
2022-05-20 01:10:37.075 1 DEBUG sushy.connector [req-0ade6685-61b7-4d2d-a58d-5bb3a16a58d8 ironic-user - - - -] HTTP response for PATCH https://[BMC-IP]/redfish/v1/Systems/1: status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:202ESC[00m

IP address and eTag value removed for security reasons. Please note the HTTP 200 response to the request (previously this is where the code was failing) as well as the altered eTag format in the request header.

Nodes upon successful deployment of a worker (note freshly-added worker-3):

kni@[HOSTNAME] [USER]]$ oc get nodesNAME       STATUS                     ROLES                           AGE     VERSION                                                                     
master-0   Ready                      master                          10h     v1.21.8+ee73ea2                                                             master-1   Ready                      master                          10h     v1.21.8+ee73ea2                                                             
master-2   Ready                      master                          10h     v1.21.8+ee73ea2                                                             worker-0   Ready                      loadbalancer,worker,worker-hp   9h      v1.21.8+ee73ea2                                                             
worker-1   Ready,SchedulingDisabled   worker                          9h      v1.21.8+ee73ea2                                                             worker-2   Ready                      loadbalancer,worker,worker-hp   9h      v1.21.8+ee73ea2                                                             
worker-3   Ready                      worker                          2m33s   v1.21.8+ee73ea2 

This test validates that the fix proposed in this BZ successfully resolved the issue.

Comment 8 Jacob Anders 2022-05-20 02:43:43 UTC
Adding pending sushy release (which is a prerequisite for raising OCP PR) for tracking, currently under review.

Comment 9 Jacob Anders 2022-05-20 09:27:13 UTC
Current status: we have a tested fix merged upstream ( see https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c7 ). We are waiting for the release of the library with the fix so that we can raise a downstream PR to include the fix in the Ironic image corresponding to this OCP version.

Comment 10 Jacob Anders 2022-05-23 05:24:46 UTC
OCP 4.8 PR is now raised.

Comment 11 Manuel Rodriguez 2022-05-23 14:41:52 UTC
I confirmed the proposed changes fix the issue. 

To validate we removed a worker node from a running OCP 4.8 cluster, then replaced the ironic image with the one with the fix, and provision a new worker node, before servers were not even powered on because ironic failed to set the boot device, but this time that part was completed and the normal deployment steps were completed too.

Comment 12 Manuel Rodriguez 2022-05-24 00:06:55 UTC
Also just to let you know, we haven't observed this issue in OCP 4.10 or 4.11, we run daily CI jobs in same clusters and the following versions do not seem to be affected:

- 4.10.15 RC - 2022-05-16
- 4.11.0 2022-05-11
- 4.11.0 2022-05-20

Thanks,

Comment 13 Jacob Anders 2022-05-24 08:29:32 UTC
This has been fixed by https://review.opendev.org/c/openstack/sushy/+/842461/ and verified as per https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c11. Setting status to VERIFIED.

Comment 16 errata-xmlrpc 2022-05-25 21:48:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.41 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:2272


Note You need to log in before you can comment on or make changes to this bug.