Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2027544

Summary: Ironic node power status reverts to None after transitory network errors while communicating to servers
Product: Red Hat OpenStack Reporter: Mark Jones <marjones>
Component: openstack-ironicAssignee: Julia Kreger <jkreger>
Status: CLOSED ERRATA QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: 16.1 (Train)CC: jelynch, jkreger, jparoly, sbaker
Target Milestone: z9Keywords: OtherQA, Triaged
Target Release: 16.1 (Train on RHEL 8.2)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-sushy-2.0.3-1.20220315023816.0241cd9.el8ost openstack-ironic-13.0.7-1.20220330213501.3d77e61.el8ost Doc Type: Bug Fix
Doc Text:
Before this update, if there were repeated transient connectivity issues between the ironic-conductor service and a remote Baseboard Management Controller (BMC) using the Redfish hardware type when session authentication was used, the intermittent loss of connectivity could collide with a point where authentication was retried due to the in-memory credentials expiring. If this collision occurred, there was a loss of overall connectivity, which persisted due to the internal session cache built into the openstack-ironic-conductor service. With this update, support to detect and renegotiate in cases of this error were added to the Python DMTF Redfish library, sushy, and the openstack-ironic service. Intermittent connectivity failures colliding with session credential re-authentication no longer results in a complete loss of ability to communicate with the BMC until the openstack-ironic-conductor service is restarted.
Story Points: ---
Clone Of:
: 2064017 2064019 (view as bug list) Environment:
Last Closed: 2022-12-07 20:25:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2064017, 2064019    

Description Mark Jones 2021-11-30 01:04:59 UTC
Description of problem:

A customer is finding that that the power status within Ironic of some of their baremetal nodes are being set to None after a period of time (typically 24 hours or more). The customer is using Redfish to manage DELL servers via iDRACs.

After enabling debug within Ironic, what is seen from the logs when this issue appears to occur is a sequence where:

1. Ironic with Redfish attempts to perform a GET against https://x.x.x.x/redfish/v1/Systems/System.Embedded.1 which fails as the Authentication credentials are missing/invalid (likely auth expired)

2. Performs a GET against https://x.x.x.x/redfish/v1/SessionService but during the processing of this request a transitory network issue occurs and the GET connection fails. The error message in Ironic debug logs includes the text "Error <....> while attempting to establish a session. Falling back to basic authentication" which maps to _do_authenticate(..) in Sushy's auth.py.

3. All subsequent Ironic / Redfish operations to that specific iDRAC now fail with invalid credentials suggesting that the iDRAC is likely not supporting Redfish with Basic Authentication.

Looking earlier in the logs show that Session renewal is working fine when there are no networking interruptions. Also, restarting Ironic brings all the Ironic nodes that were marked as power state None back to their correct power status presumably because Sushy has returned back to authenticating using the REST API with SessionService rather than basic authentication.

This is a significant issue for the customer as they are loosing Ironic's node power status and managability in production.

The behaviour suggests that the Sushy falling back to Basic Auth is making the assumption that the target Redfish device supports that mode which is not necessarily the case.

Version-Release number of selected component (if applicable):
OSP 16.1

How reproducible:
Occuring in multiple production and lab OpenStack regions within the customer. Requires transitory network issues to observe.

Steps to Reproduce:
1. Deploy OSP 16.1 using Redfish with Ironic for management
2. Induce transitory network issues that cause connection failures during Redfish operations
3. Observe Ironic node power state marked as None over time

Actual results:
Ironic node power states are None

Expected results:
Ironic node power state should reflect the actual power state of the node

Additional info:

Customer's Ironic debug logs are available from the associated support case.

Comment 1 Steve Baker 2021-11-30 21:02:42 UTC
This change[1] *might* fix this issue but we need to model this exact scenario in a unit test to validate that. Also the change is in stable/train but has not been backported to any 16.x

[1] https://review.opendev.org/c/openstack/sushy/+/795558

Comment 2 Julia Kreger 2021-12-02 21:00:59 UTC
This appears to be a mix of the patch Steve noted, plus two other distinct issues. One being Ironic was only disqualifying cached sessions internally when a ConnectionError occured, and also a place in the sushy library where session location information lookups could actually quietly fail, albeit with lots of warnings being generated. This would continue until the service is restarted, and would not be caught by ironic to disqualify the session from the cache. Ultimately, all caused by invalid access data in the cache. I'm going to link the two patches working their way through code review upstream. The hardware vendor  in question has also been notified and they intend to test these patches as well.

Comment 5 Julia Kreger 2022-01-10 15:09:21 UTC
I've issued a downstream backport for the minimal sushy library change required which should help prevent part of this issue, at least the more obvious cases the customer is hitting *except* the downgrade of the sushy client to basic auth. Discussion and engagement is still going on back and forth because the complexity of the failure case. Ultimately this case revealed four different behavioral issues in the client library.

Comment 6 Julia Kreger 2022-01-18 14:59:41 UTC
A quick follow-up.

We believe we've reached consensus on fixing the sushy session initiation issues which actually basically boils down to four distinct issues, and we've started the process of back porting session invalidation checks to re-launch the client library. Combined, which should actually eliminate this issue entirely and make connection handling far more robust. Unfortunately we're in the early stages of backporting.

Comment 7 Steve Baker 2022-01-25 21:13:22 UTC
Adding the sushy version for that part of the fix, but this bug won't go to MODIFIED until the ironic (and the other sushy) change is also available

Comment 24 errata-xmlrpc 2022-12-07 20:25:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795