Bug 2027544
| Summary: | Ironic node power status reverts to None after transitory network errors while communicating to servers | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Mark Jones <marjones> | |
| Component: | openstack-ironic | Assignee: | Julia Kreger <jkreger> | |
| Status: | CLOSED ERRATA | QA Contact: | ||
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 16.1 (Train) | CC: | jelynch, jkreger, jparoly, sbaker | |
| Target Milestone: | z9 | Keywords: | OtherQA, Triaged | |
| Target Release: | 16.1 (Train on RHEL 8.2) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | python-sushy-2.0.3-1.20220315023816.0241cd9.el8ost openstack-ironic-13.0.7-1.20220330213501.3d77e61.el8ost | Doc Type: | Bug Fix | |
| Doc Text: |
Before this update, if there were repeated transient connectivity issues between the ironic-conductor service and a remote Baseboard Management Controller (BMC) using the Redfish hardware type when session authentication was used, the intermittent loss of connectivity could collide with a point where authentication was retried due to the in-memory credentials expiring. If this collision occurred, there was a loss of overall connectivity, which persisted due to the internal session cache built into the openstack-ironic-conductor service. With this update, support to detect and renegotiate in cases of this error were added to the Python DMTF Redfish library, sushy, and the openstack-ironic service. Intermittent connectivity failures colliding with session credential re-authentication no longer results in a complete loss of ability to communicate with the BMC until the openstack-ironic-conductor service is restarted.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 2064017 2064019 (view as bug list) | Environment: | ||
| Last Closed: | 2022-12-07 20:25:32 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2064017, 2064019 | |||
|
Description
Mark Jones
2021-11-30 01:04:59 UTC
This change[1] *might* fix this issue but we need to model this exact scenario in a unit test to validate that. Also the change is in stable/train but has not been backported to any 16.x [1] https://review.opendev.org/c/openstack/sushy/+/795558 This appears to be a mix of the patch Steve noted, plus two other distinct issues. One being Ironic was only disqualifying cached sessions internally when a ConnectionError occured, and also a place in the sushy library where session location information lookups could actually quietly fail, albeit with lots of warnings being generated. This would continue until the service is restarted, and would not be caught by ironic to disqualify the session from the cache. Ultimately, all caused by invalid access data in the cache. I'm going to link the two patches working their way through code review upstream. The hardware vendor in question has also been notified and they intend to test these patches as well. I've issued a downstream backport for the minimal sushy library change required which should help prevent part of this issue, at least the more obvious cases the customer is hitting *except* the downgrade of the sushy client to basic auth. Discussion and engagement is still going on back and forth because the complexity of the failure case. Ultimately this case revealed four different behavioral issues in the client library. A quick follow-up. We believe we've reached consensus on fixing the sushy session initiation issues which actually basically boils down to four distinct issues, and we've started the process of back porting session invalidation checks to re-launch the client library. Combined, which should actually eliminate this issue entirely and make connection handling far more robust. Unfortunately we're in the early stages of backporting. Adding the sushy version for that part of the fix, but this bug won't go to MODIFIED until the ironic (and the other sushy) change is also available Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:8795 |