1880351 – [OCP 4.5] openstack IPI installation stuck after machine-config-daemon-firstboot failed on masters

Bug 1880351 - [OCP 4.5] openstack IPI installation stuck after machine-config-daemon-firstboot failed on masters

Summary: [OCP 4.5] openstack IPI installation stuck after machine-config-daemon-firstb...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-18 10:40 UTC by Mario Abajo
Modified:	2023-12-15 19:25 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-25 18:09:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mario Abajo 2020-09-18 10:40:44 UTC

Description of problem:
Installation stuck while waiting for the API. We accessed the masters via ssh and found that the "machine-config-daemon-firstboot.service" failed with error:

~~~
● machine-config-daemon-firstboot.service - Machine Config Daemon Firstboot
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-firstboot.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Fri 2020-09-18 07:58:24 UTC; 29min ago
  Process: 1831 ExecStart=/usr/libexec/machine-config-daemon firstboot-complete-machineconfig (code=exited, status=1/FAILURE)
 Main PID: 1831 (code=exited, status=1/FAILURE)
      CPU: 245ms

Sep 18 07:56:04 openshift-j56x6-master-2 machine-config-daemon[1831]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a: error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp 52.201.127.208:443: i/o timeout
Sep 18 07:56:04 openshift-j56x6-master-2 machine-config-daemon[1831]: W0918 07:56:03.991548    1893 run.go:40] podman failed: exit status 125; retrying...
Sep 18 07:57:24 openshift-j56x6-master-2 machine-config-daemon[1831]: I0918 07:57:23.992009    1893 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a
Sep 18 07:58:24 openshift-j56x6-master-2 machine-config-daemon[1831]: I0918 07:58:24.168875    1831 update.go:813] Updating files
Sep 18 07:58:24 openshift-j56x6-master-2 machine-config-daemon[1831]: I0918 07:58:24.169303    1831 update.go:850] Deleting stale data
Sep 18 07:58:24 openshift-j56x6-master-2 machine-config-daemon[1831]: error: failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1
Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: Failed to start Machine Config Daemon Firstboot.
Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: machine-config-daemon-firstboot.service: Consumed 245ms CPU time
~~~

We checked the quay.io url with curl and it worked so seems like the connection error was something temporal.
After issuing a reboot to the node, the service started well and the installation continued as expected. We have to that on the three masters so looks to me like when the machine was created and booted the networking at the openstack side was not fully operating, but it only takes a few more time to get it correct.

Version-Release number of the following components:
Openshift 4.5.9
Openstack 16

How reproducible:
Only in customer environment

Steps to Reproduce:
1. 
2.
3.

Actual results:


Expected results:
This thinks should not happen and the network should be fully working since the very beginning, but what would be interesting is that this service has a different mechanism, like do some tries when pulling the image or auto-restart the service, so it is more robust to this transient errors.


Additional info:

Comment 1 Martin André 2020-09-18 13:10:26 UTC

A similar issue was fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1870343.
We might need to implement a similar retry mechanism for machine-config-daemon-firstboot.service in 4.5 that was implemented in https://github.com/openshift/machine-config-operator/pull/2055.

Comment 2 Martin André 2020-09-18 14:30:00 UTC

Re-assigning to machine config operator component.

Comment 6 Ben Howard 2021-02-25 18:09:03 UTC

Closing as won't fix since this is 4.5 and there have been no other reports for 4.6 or hit with 4.7

Note You need to log in before you can comment on or make changes to this bug.