The circumstances are still unclear, but we've observed cleaning failing with: Unexpected error when processing next clean step. SSLError: ('timed out',): ssl.SSLError: ('timed out',) Ironic retries usual connection errors, but not this one. It should retry.
I've seen an issue on my 4.8.2 deployment, IPI/IPv6/Disconnected, the 3 masters are HP with iLo5 and we are using redfish-virtualmedia with the provisioning network disabled. The thing is, 2 nodes are stuck on clean_wait state and the other it's in available state. Adding Ironic Conductor logs also
The fixed package is available in 4.9.
Couldn't reproduce the reported problem. Closing as OtherQA. Please, re-open if happens again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759