Bug 2060968

Summary: Installation failing due to ironic-agent.service not starting properly
Product: OpenShift Container Platform Reporter: Denis Ollier <dollierp>
Component: Bare Metal Hardware ProvisioningAssignee: Riccardo Pittau <rpittau>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Lubov <lshilin>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: bfournie, kbidarka, kmajcher, lshilin, rpittau
Version: 4.10Keywords: OtherQA, Triaged
Target Milestone: ---Flags: rpittau: needinfo-
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2061977 (view as bug list) Environment:
Last Closed: 2022-08-10 10:52:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2061977    

Description Denis Ollier 2022-03-04 17:59:00 UTC
Description of problem
----------------------

Deployment of some of our BM cluster is failing due to some workers not being provisioned because the ironic-agent.service does not start properly.


Version-Release number of selected component
--------------------------------------------

OpenShift 4.10.0


How reproducible
----------------

It occurs only on some workers of our fleet on BM clusters. Not sure what those workers have in common.


Steps to Reproduce
------------------

1. Deploy an OpenShift BM cluster using the IPI installer.
2. Monitor the BMH resources
3. Observe some BMH being stuck in inspecting state
4. Check the status ironic-agent.service on the node(s) stuck in inspecting state


Actual results
--------------

When connecting to a stuck host, this message is reported. 

> [systemd]
> Failed Units: 1
>   ironic-agent.service

Here are the logs of the ironic-agent.service:

> Mar 04 17:26:51 localhost systemd[1]: Starting Ironic Agent...
> Mar 04 17:26:52 localhost podman[2592]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1a2e68265667bf616b9bc68ec255758dc60d85dae77a54391a820755a256f55...
> Mar 04 17:26:52 localhost podman[2592]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e1a2e68265667bf616b9bc68ec255758dc60d85dae77a54391a820755a256f55: error pinging docker registry quay.io: Get "http://quay.io/v2/": dial tcp: lookup quay.io on [::1]:53: read udp [::1]:41281->[::1]:53: read: connection refused
> Mar 04 17:26:52 localhost systemd[1]: ironic-agent.service: Control process exited, code=exited status=125
> Mar 04 17:26:52 localhost systemd[1]: ironic-agent.service: Failed with result 'exit-code'.
> Mar 04 17:26:52 localhost systemd[1]: Failed to start Ironic Agent.


Expected results
----------------

All hosts should be provisioned properly.


Additional info
---------------

Restarting the ironic-agent.service manually allow to the provisioning of the host to continue successfully.

Comment 2 Bob Fournier 2022-03-08 17:25:00 UTC
Looks like the service needs to be restarted on failure. There is a patch here https://github.com/openshift/image-customization-controller/pull/34 which should resolve this.

This needs to be backported to 4.10

Comment 5 Lubov 2022-03-28 20:43:17 UTC
@

Comment 6 Denis Ollier 2022-03-28 21:01:35 UTC
@lshilin@redhat.co

Comment 7 Denis Ollier 2022-03-28 21:03:20 UTC
@lshilin what info should I provide?

Comment 20 errata-xmlrpc 2022-08-10 10:52:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069