Description of problem: Install ipi-on-vsphere against 4.11.0-0.nightly-2022-04-06-213816, machine-config-daemon-firstboot.service failed on master node. Check that service log on master node, got error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2614439046/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator: error: Timeout was reached [root@jima0407vspa-ns55d-master-0 ~]# journalctl -u machine-config-daemon-firstboot.service > mcd.log [root@jima0407vspa-ns55d-master-0 ~]# vi mcd.log [root@jima0407vspa-ns55d-master-0 ~]# cat mcd.log -- Logs begin at Thu 2022-04-07 08:03:07 UTC, end at Thu 2022-04-07 08:12:21 UTC. -- Apr 07 08:06:42 jima0407vspa-ns55d-master-0 systemd[1]: Starting Machine Config Daemon Firstboot... Apr 07 08:06:42 jima0407vspa-ns55d-master-0 sh[2819]: sed: can't read /etc/yum.repos.d/*.repo: No such file or directory Apr 07 08:06:42 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:42.632420 2821 update.go:1882] Running: systemctl start rpm-ostreed Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.250831 2821 rpm-ostree.go:325] Running captured: rpm-ostree status --json Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.447760 2821 rpm-ostree.go:325] Running captured: rpm-ostree status --json Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.484997 2821 daemon.go:233] Booted osImageURL: (411.85.202203181601-0) Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.486338 2821 update.go:1919] Adding SIGTERM protection Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.486586 2821 update.go:504] Checking Reconcilable for config mco-empty-mc to rendered-master-b19b4e8818e75cef029fe79f1bcd7edc Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.487157 2821 update.go:1897] Starting update from mco-empty-mc to rendered-master-b19b4e8818e75cef029fe79f1bcd7edc: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false} Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.490174 2821 update.go:1114] Updating files Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.490197 2821 update.go:1179] Deleting stale data Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.490295 2821 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-2614439046 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f Apr 07 08:07:21 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:21.877989 2821 update.go:1759] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f Apr 07 08:07:21 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:21.878022 2821 rpm-ostree.go:325] Running captured: rpm-ostree status --json Apr 07 08:07:21 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:21.916210 2821 rpm-ostree.go:238] Current origin is not custom Apr 07 08:07:22 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:22.522897 2821 rpm-ostree.go:265] Pivoting to: 411.85.202203242008-0 (da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4) Apr 07 08:07:22 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:22.522922 2821 rpm-ostree.go:297] Executing rebase from repo path /run/mco-machine-os-content/os-content-2614439046/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f and checksum da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 Apr 07 08:07:22 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:22.522930 2821 update.go:1882] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2614439046/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:08:53.068118 2821 update.go:1114] Updating files Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:08:53.068149 2821 update.go:1179] Deleting stale data Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:08:53.068158 2821 update.go:1927] Removing SIGTERM protection Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2614439046/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator: error: Timeout was reached Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: : exit status 1 Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'. Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: Failed to start Machine Config Daemon Firstboot. Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: machine-config-daemon-firstboot.service: Consumed 20.396s CPU time Then try to update OS manually by running below command on all master nodes, os was updated to expected version, and reboot all nodes manually based on message printed by below command, installation is ongoing, but same issue is happened again on workers when worker is provisioned. [root@jima0407vspa-ns55d-master-0 ~]# /run/bin/machine-config-daemon pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f I0407 08:32:31.738481 3439 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-3889429086 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f I0407 08:32:46.623172 3439 rpm-ostree.go:325] Running captured: rpm-ostree status --json I0407 08:32:46.732327 3439 rpm-ostree.go:238] Current origin is not custom I0407 08:32:47.331796 3439 rpm-ostree.go:265] Pivoting to: 411.85.202203242008-0 (da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4) I0407 08:32:47.331821 3439 rpm-ostree.go:297] Executing rebase from repo path /run/mco-machine-os-content/os-content-3889429086/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f and checksum da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 I0407 08:32:47.331834 3439 update.go:1882] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-3889429086/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator 83 metadata, 76 content objects imported; 910.0 MB content written Scanning metadata: 1709... done Staging deployment... done Upgraded: cri-o 1.24.0-1.rhaos4.11.gitbbe7b1e.el8 -> 1.24.0-5.rhaos4.11.gitd020fdb.el8 openshift-clients 4.11.0-202203180646.p0.ge9566cc.assembly.stream.el8 -> 4.11.0-202203211330.p0.ge9566cc.assembly.stream.el8 openshift-hyperkube 4.11.0-202203180038.p0.g02aefbf.assembly.stream.el8 -> 4.11.0-202203221325.p0.g54654d2.assembly.stream.el8 openvswitch2.16 2.16.0-53.el8fdp -> 2.16.0-57.el8fdp tzdata 2021e-1.el8 -> 2022a-1.el8 Changes queued for next boot. Run "systemctl reboot" to start a reboot [root@jima0407vspa-ns55d-master-0 ~]# rpm-ostree status State: idle Deployments: pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f CustomOrigin: Managed by machine-config-operator Version: 411.85.202203242008-0 (2022-03-24T20:11:26Z) Diff: 5 upgraded ● 6bbf853bd9add1d22e4b40a674b978cc74cbcbd6ab8174430b554e4c977179bc Version: 411.85.202203181601-0 (2022-03-18T16:05:02Z) Version-Release number of MCO (Machine Config Operator) (if applicable): Platform: vsphere (one local env on QE side, it works well on VMC) Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): maybe How reproducible: Always on QE local vsphere ENV with 4.11.0-0.nightly-2022-04-06-213816, it works well with 4.11.0-0.nightly-2022-03-26-130745 on the same local ENV. And it also works on VMC. Steps to Reproduce: 1. Install ipi-on-vsphere against 4.11.0-0.nightly-2022-04-06-213816 2. installation failed at waiting for bootstrap completed. 3. Actual results: Installation is failed Expected results: Installation is successful. Additional info: Attached bootstrap log bundle.
OK basically we're missing retries on the firstboot side.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069