Bug 2060968 - Installation failing due to ironic-agent.service not starting properly
Summary: Installation failing due to ironic-agent.service not starting properly
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Bare Metal Hardware Provisioning
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.0
Assignee: Riccardo Pittau
QA Contact: Lubov
URL:
Whiteboard:
Depends On:
Blocks: 2061977
TreeView+ depends on / blocked
 
Reported: 2022-03-04 17:59 UTC by Denis Ollier
Modified: 2022-08-10 10:52 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2061977 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:52:11 UTC
Target Upstream Version:
Embargoed:
rpittau: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift image-customization-controller pull 34 0 None Merged Restart ironic-agent.service when it fails 2022-03-08 17:25:00 UTC
Github openshift image-customization-controller pull 51 0 None open Bug 2060968: Add delay between ironic-agent restart, but restart forever 2022-06-10 12:44:41 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:52:30 UTC

Description Denis Ollier 2022-03-04 17:59:00 UTC
Description of problem
----------------------

Deployment of some of our BM cluster is failing due to some workers not being provisioned because the ironic-agent.service does not start properly.


Version-Release number of selected component
--------------------------------------------

OpenShift 4.10.0


How reproducible
----------------

It occurs only on some workers of our fleet on BM clusters. Not sure what those workers have in common.


Steps to Reproduce
------------------

1. Deploy an OpenShift BM cluster using the IPI installer.
2. Monitor the BMH resources
3. Observe some BMH being stuck in inspecting state
4. Check the status ironic-agent.service on the node(s) stuck in inspecting state


Actual results
--------------

When connecting to a stuck host, this message is reported. 

> [systemd]
> Failed Units: 1
>   ironic-agent.service

Here are the logs of the ironic-agent.service:

> Mar 04 17:26:51 localhost systemd[1]: Starting Ironic Agent...
> Mar 04 17:26:52 localhost podman[2592]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1a2e68265667bf616b9bc68ec255758dc60d85dae77a54391a820755a256f55...
> Mar 04 17:26:52 localhost podman[2592]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1a2e68265667bf616b9bc68ec255758dc60d85dae77a54391a820755a256f55: error pinging docker registry quay.io: Get "http://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:41281->[::1]:53: read: connection refused
> Mar 04 17:26:52 localhost systemd[1]: ironic-agent.service: Control process exited, code=exited status=125
> Mar 04 17:26:52 localhost systemd[1]: ironic-agent.service: Failed with result 'exit-code'.
> Mar 04 17:26:52 localhost systemd[1]: Failed to start Ironic Agent.


Expected results
----------------

All hosts should be provisioned properly.


Additional info
---------------

Restarting the ironic-agent.service manually allow to the provisioning of the host to continue successfully.

Comment 2 Bob Fournier 2022-03-08 17:25:00 UTC
Looks like the service needs to be restarted on failure. There is a patch here https://github.com/openshift/image-customization-controller/pull/34 which should resolve this.

This needs to be backported to 4.10

Comment 5 Lubov 2022-03-28 20:43:17 UTC
@

Comment 6 Denis Ollier 2022-03-28 21:01:35 UTC
@lshilin@redhat.co

Comment 7 Denis Ollier 2022-03-28 21:03:20 UTC
@lshilin what info should I provide?

Comment 20 errata-xmlrpc 2022-08-10 10:52:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.