Description of problem: Installation stuck while waiting for the API. We accessed the masters via ssh and found that the "machine-config-daemon-firstboot.service" failed with error: ~~~ ● machine-config-daemon-firstboot.service - Machine Config Daemon Firstboot Loaded: loaded (/etc/systemd/system/machine-config-daemon-firstboot.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2020-09-18 07:58:24 UTC; 29min ago Process: 1831 ExecStart=/usr/libexec/machine-config-daemon firstboot-complete-machineconfig (code=exited, status=1/FAILURE) Main PID: 1831 (code=exited, status=1/FAILURE) CPU: 245ms Sep 18 07:56:04 openshift-j56x6-master-2 machine-config-daemon[1831]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a: error pinging docker registry quay.io: Get https://quay.io/v2/: dial tcp 52.201.127.208:443: i/o timeout Sep 18 07:56:04 openshift-j56x6-master-2 machine-config-daemon[1831]: W0918 07:56:03.991548 1893 run.go:40] podman failed: exit status 125; retrying... Sep 18 07:57:24 openshift-j56x6-master-2 machine-config-daemon[1831]: I0918 07:57:23.992009 1893 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af0f67519dbd7ffe2732d89cfa342ee55557f0dc5e8ee8c674eed5ff209bb15a Sep 18 07:58:24 openshift-j56x6-master-2 machine-config-daemon[1831]: I0918 07:58:24.168875 1831 update.go:813] Updating files Sep 18 07:58:24 openshift-j56x6-master-2 machine-config-daemon[1831]: I0918 07:58:24.169303 1831 update.go:850] Deleting stale data Sep 18 07:58:24 openshift-j56x6-master-2 machine-config-daemon[1831]: error: failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1 Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'. Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: Failed to start Machine Config Daemon Firstboot. Sep 18 07:58:24 openshift-j56x6-master-2 systemd[1]: machine-config-daemon-firstboot.service: Consumed 245ms CPU time ~~~ We checked the quay.io url with curl and it worked so seems like the connection error was something temporal. After issuing a reboot to the node, the service started well and the installation continued as expected. We have to that on the three masters so looks to me like when the machine was created and booted the networking at the openstack side was not fully operating, but it only takes a few more time to get it correct. Version-Release number of the following components: Openshift 4.5.9 Openstack 16 How reproducible: Only in customer environment Steps to Reproduce: 1. 2. 3. Actual results: Expected results: This thinks should not happen and the network should be fully working since the very beginning, but what would be interesting is that this service has a different mechanism, like do some tries when pulling the image or auto-restart the service, so it is more robust to this transient errors. Additional info:
A similar issue was fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1870343. We might need to implement a similar retry mechanism for machine-config-daemon-firstboot.service in 4.5 that was implemented in https://github.com/openshift/machine-config-operator/pull/2055.
Re-assigning to machine config operator component.
Closing as won't fix since this is 4.5 and there have been no other reports for 4.6 or hit with 4.7