Description of problem: During the installation of OCP with IPI the Bootstrap fails with all three master nodes showing the following error: 2022-05-17 19:04:08.101 1 ERROR ironic.drivers.modules.redfish.management [req-18a38ecb-f3b7-4c91-b258-a77f352af553 bootstrap-user - - - -] Redfish set boot device failed for node 5bb35a95-2f24-4693-a0ac-796c79c6af7d. Error: HTTP PATCH https://<IP-Address>/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none: sushy.exceptions.HTTPError: HTTP PATCH https://<IP-Address/redfish/v1/Systems/1 returned code 412. unknown error Extended information: none Version-Release number of selected component (if applicable): OpenShift 4.8.40 RC 2022-05-16 OpenShift 4.9.34 RC 2022-05-17 But if we use previous releases it works: OpenShift 4.8.39 OpenShift 4.9.33 RC 2022-05-11 How reproducible: Steps to Reproduce: 1.- Deploy Baremetal OCP with IPI 2. use redfish in the install-config.yaml bmc: address: redfish://<IP-ADDRESS>/redfish/v1/Systems/1 3. Wait for the bootstrap VM to provision, check the install log: .openshift_install.log eventually the error will show up for all master nodes: time="2022-05-18T11:38:53-04:00" level=error msg="Error: could not inspect: could not inspect node, node is currently 'inspect failed' , last error was 'Failed to inspect hardware. Reason: unable to st art inspection: Redfish exception occurred. Error: Redfish set boot device failed for node 4debd5c8-92c6-41e3-bc69-9e19caa6b8c2. Error: HTTP PATCH https://<IP-Address>/redfish/v1/Systems/1 returned co de 412. unknown error Extended information: none'" Actual results: - initial bootstrap on master nodes fails, OCP cluster is not installed Expected results: - bootstrap should succeed, the redfish call must not return an error Additional info: - We are using HPE ProLiant DL360 Gen10, but we suspect the problem is independent of the platform. Full Logs of ironic-conductor can be found here: OpenShift 4.8.40 RC 2022-05-16 - https://www.distributed-ci.io/files/599c4982-f2f7-4c2a-b6a8-e2d21015b0eb OpenShift 4.9.34 RC 2022-05-17 - https://www.distributed-ci.io/files/61dcc3f3-cfbe-4fa0-8993-cb0f24f71e2f
Setting high priority / urgency cause this may cause regression and breaking updates on this particular hardware. I suspect backporting this change is related to the issue: https://opendev.org/openstack/sushy/commit/e4e24f1414c222ea5eeb97c01e4e216bd7f5a285 https://github.com/openshift/ironic-image/pull/275 In particular "reverting" this change (which would mean untagging packages in the prod repo) could be the fastest resolution. https://github.com/openshift/ironic-image/pull/275/files I will raise this with the ART team and once I've done it I will try to narrow down why exactly is it causing problems and only in 4.8 and 4.9 (4.10 is fine). Perhaps there is another change that should have been backported together with the one above.
Also I forgot to mention that we are using latest ILO firmware "Feb 23 2022 - iLO 5 v2.65" for HPE ProLiant DL360 Gen10[1], so we discarded the firmware option, we tried ILO resets too. And we validated OpenShift 4.10.15 RC 2022-05-16 and OpenShift 4.11.0 2022-05-11 are working fine. [1]https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_d373bd8a5c9948c6aae7bca5dc&tab=revisionHistory
I think the issue was caused by the fact that we backported https://review.opendev.org/c/openstack/sushy/+/840652 but not https://review.opendev.org/c/openstack/sushy/+/818110. I created the necessary backports, they are now in review / CI.
A brief update on the current state of this BZ at the end of the day: * suspected-missing backports are up for review upstream (links are posted above) * ART Team have un-tagged the problem packages for 4.8/4.9 so they shouldn't be included in the upcoming z-stream release * once we are able to test the new packages including the missing backports, we will validate if this is sufficient fix
Today I tested the proposed fix together with Manuel in his environment, here is what we did: * deployed OCP 4.8.37 (to ensure the deployment completes successfully to allow testing) * replaced ironic-image with the one from the release affected by this problem ( 4.8.40 ) * reproduced the problem * replaced ironic-image again with a custom built 4.8.40 + the missing backport ( https://review.opendev.org/c/openstack/sushy/+/818110 ) * verified that the issue is resolved - we were able to deploy a BMH: 2022-05-20 01:10:36.342 1 DEBUG sushy.connector [req-0ade6685-61b7-4d2d-a58d-5bb3a16a58d8 ironic-user - - - -] HTTP request: PATCH https://[BMC-IP]/redfish/v1/Systems/1; headers: {'If-Match': 'W/"[eTAG]","[eTag]"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Pxe', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:114ESC[00m /usr/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) 2022-05-20 01:10:37.075 1 DEBUG sushy.connector [req-0ade6685-61b7-4d2d-a58d-5bb3a16a58d8 ironic-user - - - -] HTTP response for PATCH https://[BMC-IP]/redfish/v1/Systems/1: status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:202ESC[00m IP address and eTag value removed for security reasons. Please note the HTTP 200 response to the request (previously this is where the code was failing) as well as the altered eTag format in the request header. Nodes upon successful deployment of a worker (note freshly-added worker-3): kni@[HOSTNAME] [USER]]$ oc get nodesNAME STATUS ROLES AGE VERSION master-0 Ready master 10h v1.21.8+ee73ea2 master-1 Ready master 10h v1.21.8+ee73ea2 master-2 Ready master 10h v1.21.8+ee73ea2 worker-0 Ready loadbalancer,worker,worker-hp 9h v1.21.8+ee73ea2 worker-1 Ready,SchedulingDisabled worker 9h v1.21.8+ee73ea2 worker-2 Ready loadbalancer,worker,worker-hp 9h v1.21.8+ee73ea2 worker-3 Ready worker 2m33s v1.21.8+ee73ea2 This test validates that the fix proposed in this BZ successfully resolved the issue.
Adding pending sushy release (which is a prerequisite for raising OCP PR) for tracking, currently under review.
Current status: we have a tested fix merged upstream ( see https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c7 ). We are waiting for the release of the library with the fix so that we can raise a downstream PR to include the fix in the Ironic image corresponding to this OCP version.
OCP 4.8 PR is now raised.
I confirmed the proposed changes fix the issue. To validate we removed a worker node from a running OCP 4.8 cluster, then replaced the ironic image with the one with the fix, and provision a new worker node, before servers were not even powered on because ironic failed to set the boot device, but this time that part was completed and the normal deployment steps were completed too.
Also just to let you know, we haven't observed this issue in OCP 4.10 or 4.11, we run daily CI jobs in same clusters and the following versions do not seem to be affected: - 4.10.15 RC - 2022-05-16 - 4.11.0 2022-05-11 - 4.11.0 2022-05-20 Thanks,
This has been fixed by https://review.opendev.org/c/openstack/sushy/+/842461/ and verified as per https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c11. Setting status to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.41 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:2272