Bug 2064767

Summary: Ironic node power status reverts to None after transitory network errors while communicating to servers
Product: Red Hat OpenStack Reporter: Julia Kreger <jkreger>
Component: openstack-ironicAssignee: Julia Kreger <jkreger>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 16.2 (Train)CC: gregraka, pweeks, sbaker
Target Milestone: z3Keywords: Triaged
Target Release: 16.2 (Train on RHEL 8.4)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-sushy-2.0.6-2.20220316034932.f354049.el8ost openstack-ironic-13.0.8-2.20220317224925.36f3105.el8ost Doc Type: Bug Fix
Doc Text:
Before this update, the RHOSP Bare Metal service (ironic) could lose its connection to the remote Redfish baseboard management controller (BMC) resulting in the bare metal node entering a maintenance state and with its power status changing to `None`. Depending on environmental factors for a site, some or all of the bare metal nodes could be in this unwanted maintenance state for an extended period of time. + Transient network connectivity issues caused by high packet loss to the BMC caused connection caching issues when using Redfish. In cases where a session token needed to be renegotiated, the cached session object was never invalidated and connectivity was lost to the BMC. + With this update, the Bare Metal service now initializes an entirely new cached session with a remote Redfish BMC when connectivity or authentication issues are detected. Additionally, this enables you to use updated credentials if the BMC passwords for the nodes are changed in the future.
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-06-22 16:06:16 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Julia Kreger 2022-03-16 14:15:09 UTC
This bug was initially created as a copy of Bug #2027544

I am copying this bug because: 



Description of problem:

A customer is finding that that the power status within Ironic of some of their baremetal nodes are being set to None after a period of time (typically 24 hours or more). The customer is using Redfish to manage DELL servers via iDRACs.

After enabling debug within Ironic, what is seen from the logs when this issue appears to occur is a sequence where:

1. Ironic with Redfish attempts to perform a GET against https://x.x.x.x/redfish/v1/Systems/System.Embedded.1 which fails as the Authentication credentials are missing/invalid (likely auth expired)

2. Performs a GET against https://x.x.x.x/redfish/v1/SessionService but during the processing of this request a transitory network issue occurs and the GET connection fails. The error message in Ironic debug logs includes the text "Error <....> while attempting to establish a session. Falling back to basic authentication" which maps to _do_authenticate(..) in Sushy's auth.py.

3. All subsequent Ironic / Redfish operations to that specific iDRAC now fail with invalid credentials suggesting that the iDRAC is likely not supporting Redfish with Basic Authentication.

Looking earlier in the logs show that Session renewal is working fine when there are no networking interruptions. Also, restarting Ironic brings all the Ironic nodes that were marked as power state None back to their correct power status presumably because Sushy has returned back to authenticating using the REST API with SessionService rather than basic authentication.

This is a significant issue for the customer as they are loosing Ironic's node power status and managability in production.

The behaviour suggests that the Sushy falling back to Basic Auth is making the assumption that the target Redfish device supports that mode which is not necessarily the case.

Version-Release number of selected component (if applicable):
OSP 16.1

How reproducible:
Occuring in multiple production and lab OpenStack regions within the customer. Requires transitory network issues to observe.

Steps to Reproduce:
1. Deploy OSP 16.1 using Redfish with Ironic for management
2. Induce transitory network issues that cause connection failures during Redfish operations
3. Observe Ironic node power state marked as None over time

Actual results:
Ironic node power states are None

Expected results:
Ironic node power state should reflect the actual power state of the node

Additional info:

Customer's Ironic debug logs are available from the associated support case.

Comment 4 Julia Kreger 2022-03-30 19:33:32 UTC
Patches have merged and RPMs are built.

Comment 16 errata-xmlrpc 2022-06-22 16:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Release of components for Red Hat OpenStack Platform 16.2.3 (Train)), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:4793