Description of problem ---------------------- Deployment of some of our BM cluster is failing due to some workers not being provisioned because the ironic-agent.service does not start properly. Version-Release number of selected component -------------------------------------------- OpenShift 4.10.0 How reproducible ---------------- It occurs only on some workers of our fleet on BM clusters. Not sure what those workers have in common. Steps to Reproduce ------------------ 1. Deploy an OpenShift BM cluster using the IPI installer. 2. Monitor the BMH resources 3. Observe some BMH being stuck in inspecting state 4. Check the status ironic-agent.service on the node(s) stuck in inspecting state Actual results -------------- When connecting to a stuck host, this message is reported. > [systemd] > Failed Units: 1 > ironic-agent.service Here are the logs of the ironic-agent.service: > Mar 04 17:26:51 localhost systemd[1]: Starting Ironic Agent... > Mar 04 17:26:52 localhost podman[2592]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1a2e68265667bf616b9bc68ec255758dc60d85dae77a54391a820755a256f55... > Mar 04 17:26:52 localhost podman[2592]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1a2e68265667bf616b9bc68ec255758dc60d85dae77a54391a820755a256f55: error pinging docker registry quay.io: Get "http://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:41281->[::1]:53: read: connection refused > Mar 04 17:26:52 localhost systemd[1]: ironic-agent.service: Control process exited, code=exited status=125 > Mar 04 17:26:52 localhost systemd[1]: ironic-agent.service: Failed with result 'exit-code'. > Mar 04 17:26:52 localhost systemd[1]: Failed to start Ironic Agent. Expected results ---------------- All hosts should be provisioned properly. Additional info --------------- Restarting the ironic-agent.service manually allow to the provisioning of the host to continue successfully.
Looks like the service needs to be restarted on failure. There is a patch here https://github.com/openshift/image-customization-controller/pull/34 which should resolve this. This needs to be backported to 4.10
@
@lshilin@redhat.co
@lshilin what info should I provide?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069