Description of problem:
We have observed that sometimes the ironic.service systemd unit (which starts some provisioning related containers via podman) appears active, but actually some of the containers are not responsive.
The root cause appears to be crio deletes some containers on restart, even when they have been started via podman - discussion is in-progress to figure out the best long-term fix for that.
As a workaround we can improve the systemd exec script so that we detect when the podman services become broken, and trigger a systemd restart - this approach will continue to work if/when the crio issues are resolved.
Related upstream issues:
https://github.com/openshift/installer/pull/2249 (the workaround and immediate fix to unblock testing)
These have some additional analysis and details of the crio issues:
Note this is ready to test but there's a mistake in a comment I'd like to fix via https://github.com/openshift/installer/pull/2281
Since we need a valid bug for the PR upstream I'll move this back to POST until that merges.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.