Bug 2088196
Summary: | Redfish set boot device failed for node in OCP 4.8 latest RC | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Manuel Rodriguez <manrodri> | |
Component: | Bare Metal Hardware Provisioning | Assignee: | Jacob Anders <janders> | |
Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | dtantsur, imelofer, kurathod, mcornea, openshift-bugs-escalate, racedoro, rpittau, shajmakh, tsedovic | |
Version: | 4.8 | Keywords: | OtherQA | |
Target Milestone: | --- | |||
Target Release: | 4.8.z | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: |
Since adding eTag handling in Ironic (this was implemented during upstream Yoga cycle - patches: https://review.opendev.org/c/openstack/sushy/+/818114 and https://review.opendev.org/c/openstack/sushy/+/818110 ) issues with eTag handling on old firmware versions were increasingly observed, in particular on HP Machines. For example on DL360G10, iLo 5 2.63 or later is required otherwise issues with eTag handling in firmware may prevent Ironic from successfully provisioning the server. It is always recommended to run latest firmware, however in case of eTag issues it is mandatory to upgrade to latest firmware prior to taking any further troubleshooting steps.
|
Story Points: | --- | |
Clone Of: | ||||
: | 2088319 2088716 2088717 (view as bug list) | Environment: | ||
Last Closed: | 2022-05-25 21:48:25 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 2088319 | |||
Bug Blocks: |
Description
Manuel Rodriguez
2022-05-19 01:27:47 UTC
Setting high priority / urgency cause this may cause regression and breaking updates on this particular hardware. I suspect backporting this change is related to the issue: https://opendev.org/openstack/sushy/commit/e4e24f1414c222ea5eeb97c01e4e216bd7f5a285 https://github.com/openshift/ironic-image/pull/275 In particular "reverting" this change (which would mean untagging packages in the prod repo) could be the fastest resolution. https://github.com/openshift/ironic-image/pull/275/files I will raise this with the ART team and once I've done it I will try to narrow down why exactly is it causing problems and only in 4.8 and 4.9 (4.10 is fine). Perhaps there is another change that should have been backported together with the one above. Also I forgot to mention that we are using latest ILO firmware "Feb 23 2022 - iLO 5 v2.65" for HPE ProLiant DL360 Gen10[1], so we discarded the firmware option, we tried ILO resets too. And we validated OpenShift 4.10.15 RC 2022-05-16 and OpenShift 4.11.0 2022-05-11 are working fine. [1]https://support.hpe.com/connect/s/softwaredetails?language=en_US&softwareId=MTX_d373bd8a5c9948c6aae7bca5dc&tab=revisionHistory I think the issue was caused by the fact that we backported https://review.opendev.org/c/openstack/sushy/+/840652 but not https://review.opendev.org/c/openstack/sushy/+/818110. I created the necessary backports, they are now in review / CI. A brief update on the current state of this BZ at the end of the day: * suspected-missing backports are up for review upstream (links are posted above) * ART Team have un-tagged the problem packages for 4.8/4.9 so they shouldn't be included in the upcoming z-stream release * once we are able to test the new packages including the missing backports, we will validate if this is sufficient fix Today I tested the proposed fix together with Manuel in his environment, here is what we did: * deployed OCP 4.8.37 (to ensure the deployment completes successfully to allow testing) * replaced ironic-image with the one from the release affected by this problem ( 4.8.40 ) * reproduced the problem * replaced ironic-image again with a custom built 4.8.40 + the missing backport ( https://review.opendev.org/c/openstack/sushy/+/818110 ) * verified that the issue is resolved - we were able to deploy a BMH: 2022-05-20 01:10:36.342 1 DEBUG sushy.connector [req-0ade6685-61b7-4d2d-a58d-5bb3a16a58d8 ironic-user - - - -] HTTP request: PATCH https://[BMC-IP]/redfish/v1/Systems/1; headers: {'If-Match': 'W/"[eTAG]","[eTag]"', 'Content-Type': 'application/json', 'OData-Version': '4.0'}; body: {'Boot': {'BootSourceOverrideTarget': 'Pxe', 'BootSourceOverrideEnabled': 'Once'}}; blocking: False; timeout: 60; session arguments: {}; _op /usr/lib/python3.6/site-packages/sushy/connector.py:114ESC[00m /usr/lib/python3.6/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) 2022-05-20 01:10:37.075 1 DEBUG sushy.connector [req-0ade6685-61b7-4d2d-a58d-5bb3a16a58d8 ironic-user - - - -] HTTP response for PATCH https://[BMC-IP]/redfish/v1/Systems/1: status code: 200 _op /usr/lib/python3.6/site-packages/sushy/connector.py:202ESC[00m IP address and eTag value removed for security reasons. Please note the HTTP 200 response to the request (previously this is where the code was failing) as well as the altered eTag format in the request header. Nodes upon successful deployment of a worker (note freshly-added worker-3): kni@[HOSTNAME] [USER]]$ oc get nodesNAME STATUS ROLES AGE VERSION master-0 Ready master 10h v1.21.8+ee73ea2 master-1 Ready master 10h v1.21.8+ee73ea2 master-2 Ready master 10h v1.21.8+ee73ea2 worker-0 Ready loadbalancer,worker,worker-hp 9h v1.21.8+ee73ea2 worker-1 Ready,SchedulingDisabled worker 9h v1.21.8+ee73ea2 worker-2 Ready loadbalancer,worker,worker-hp 9h v1.21.8+ee73ea2 worker-3 Ready worker 2m33s v1.21.8+ee73ea2 This test validates that the fix proposed in this BZ successfully resolved the issue. Adding pending sushy release (which is a prerequisite for raising OCP PR) for tracking, currently under review. Current status: we have a tested fix merged upstream ( see https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c7 ). We are waiting for the release of the library with the fix so that we can raise a downstream PR to include the fix in the Ironic image corresponding to this OCP version. OCP 4.8 PR is now raised. I confirmed the proposed changes fix the issue. To validate we removed a worker node from a running OCP 4.8 cluster, then replaced the ironic image with the one with the fix, and provision a new worker node, before servers were not even powered on because ironic failed to set the boot device, but this time that part was completed and the normal deployment steps were completed too. Also just to let you know, we haven't observed this issue in OCP 4.10 or 4.11, we run daily CI jobs in same clusters and the following versions do not seem to be affected: - 4.10.15 RC - 2022-05-16 - 4.11.0 2022-05-11 - 4.11.0 2022-05-20 Thanks, This has been fixed by https://review.opendev.org/c/openstack/sushy/+/842461/ and verified as per https://bugzilla.redhat.com/show_bug.cgi?id=2088196#c11. Setting status to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.41 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:2272 |