Bug 2073021 - [vsphere] Failed to update OS on master nodes
Summary: [vsphere] Failed to update OS on master nodes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.11.0
Assignee: Yu Qi Zhang
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-07 13:23 UTC by jima
Modified: 2022-08-10 11:04 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:04:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3070 0 None open Bug 2073021: firstboot: Retry on failure 2022-04-07 14:47:20 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:04:59 UTC

Description jima 2022-04-07 13:23:23 UTC
Description of problem:

Install ipi-on-vsphere against 4.11.0-0.nightly-2022-04-06-213816, machine-config-daemon-firstboot.service failed on master node. Check that service log on master node, got error:
failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2614439046/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator: error: Timeout was reached

[root@jima0407vspa-ns55d-master-0 ~]# journalctl -u machine-config-daemon-firstboot.service > mcd.log
[root@jima0407vspa-ns55d-master-0 ~]# vi mcd.log
[root@jima0407vspa-ns55d-master-0 ~]# cat mcd.log
-- Logs begin at Thu 2022-04-07 08:03:07 UTC, end at Thu 2022-04-07 08:12:21 UTC. --
Apr 07 08:06:42 jima0407vspa-ns55d-master-0 systemd[1]: Starting Machine Config Daemon Firstboot...
Apr 07 08:06:42 jima0407vspa-ns55d-master-0 sh[2819]: sed: can't read /etc/yum.repos.d/*.repo: No such file or directory
Apr 07 08:06:42 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:42.632420    2821 update.go:1882] Running: systemctl start rpm-ostreed
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.250831    2821 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.447760    2821 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.484997    2821 daemon.go:233] Booted osImageURL:  (411.85.202203181601-0)
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.486338    2821 update.go:1919] Adding SIGTERM protection
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.486586    2821 update.go:504] Checking Reconcilable for config mco-empty-mc to rendered-master-b19b4e8818e75cef029fe79f1bcd7edc
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.487157    2821 update.go:1897] Starting update from mco-empty-mc to rendered-master-b19b4e8818e75cef029fe79f1bcd7edc: &{osUpdate:true kargs:false fips:false passwd:false files:false units:false kernelType:false extensions:false}
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.490174    2821 update.go:1114] Updating files
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.490197    2821 update.go:1179] Deleting stale data
Apr 07 08:06:45 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:06:45.490295    2821 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-2614439046 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f
Apr 07 08:07:21 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:21.877989    2821 update.go:1759] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f
Apr 07 08:07:21 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:21.878022    2821 rpm-ostree.go:325] Running captured: rpm-ostree status --json
Apr 07 08:07:21 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:21.916210    2821 rpm-ostree.go:238] Current origin is not custom
Apr 07 08:07:22 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:22.522897    2821 rpm-ostree.go:265] Pivoting to: 411.85.202203242008-0 (da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4)
Apr 07 08:07:22 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:22.522922    2821 rpm-ostree.go:297] Executing rebase from repo path /run/mco-machine-os-content/os-content-2614439046/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f and checksum da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4
Apr 07 08:07:22 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:07:22.522930    2821 update.go:1882] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2614439046/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:08:53.068118    2821 update.go:1114] Updating files
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:08:53.068149    2821 update.go:1179] Deleting stale data
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: I0407 08:08:53.068158    2821 update.go:1927] Removing SIGTERM protection
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: error: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f : error running rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-2614439046/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator: error: Timeout was reached
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 machine-config-daemon[2821]: : exit status 1
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=exited, status=1/FAILURE
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'exit-code'.
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: Failed to start Machine Config Daemon Firstboot.
Apr 07 08:08:53 jima0407vspa-ns55d-master-0 systemd[1]: machine-config-daemon-firstboot.service: Consumed 20.396s CPU time

Then try to update OS manually by running below command on all master nodes, os was updated to expected version, and reboot all nodes manually based on message printed by below command, installation is ongoing, but same issue is happened again on workers when worker is provisioned.

[root@jima0407vspa-ns55d-master-0 ~]# /run/bin/machine-config-daemon pivot quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f
I0407 08:32:31.738481    3439 run.go:18] Running: nice -- ionice -c 3 oc image extract --path /:/run/mco-machine-os-content/os-content-3889429086 --registry-config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f
I0407 08:32:46.623172    3439 rpm-ostree.go:325] Running captured: rpm-ostree status --json
I0407 08:32:46.732327    3439 rpm-ostree.go:238] Current origin is not custom
I0407 08:32:47.331796    3439 rpm-ostree.go:265] Pivoting to: 411.85.202203242008-0 (da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4)
I0407 08:32:47.331821    3439 rpm-ostree.go:297] Executing rebase from repo path /run/mco-machine-os-content/os-content-3889429086/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f and checksum da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4
I0407 08:32:47.331834    3439 update.go:1882] Running: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-3889429086/srv/repo:da4f20b403a77d1c8143c50a8a439c7f30be19de07985c90652115acc3ad5ed4 --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f --custom-origin-description Managed by machine-config-operator
83 metadata, 76 content objects imported; 910.0 MB content written
Scanning metadata: 1709... done
Staging deployment... done
Upgraded:
  cri-o 1.24.0-1.rhaos4.11.gitbbe7b1e.el8 -> 1.24.0-5.rhaos4.11.gitd020fdb.el8
  openshift-clients 4.11.0-202203180646.p0.ge9566cc.assembly.stream.el8 -> 4.11.0-202203211330.p0.ge9566cc.assembly.stream.el8
  openshift-hyperkube 4.11.0-202203180038.p0.g02aefbf.assembly.stream.el8 -> 4.11.0-202203221325.p0.g54654d2.assembly.stream.el8
  openvswitch2.16 2.16.0-53.el8fdp -> 2.16.0-57.el8fdp
  tzdata 2021e-1.el8 -> 2022a-1.el8
Changes queued for next boot. Run "systemctl reboot" to start a reboot
[root@jima0407vspa-ns55d-master-0 ~]# rpm-ostree status
State: idle
Deployments:
  pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d1adb8ced0b692db72ed74f8f547d769ef319e10d8917cb8ae97bafbb717955f
              CustomOrigin: Managed by machine-config-operator
                   Version: 411.85.202203242008-0 (2022-03-24T20:11:26Z)
                      Diff: 5 upgraded

● 6bbf853bd9add1d22e4b40a674b978cc74cbcbd6ab8174430b554e4c977179bc
                   Version: 411.85.202203181601-0 (2022-03-18T16:05:02Z)


Version-Release number of MCO (Machine Config Operator) (if applicable):

Platform: vsphere (one local env on QE side, it works well on VMC)

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): maybe

How reproducible:
Always on QE local vsphere ENV with 4.11.0-0.nightly-2022-04-06-213816, it works well with 4.11.0-0.nightly-2022-03-26-130745 on the same local ENV.
And it also works on VMC.

Steps to Reproduce:
1. Install ipi-on-vsphere against 4.11.0-0.nightly-2022-04-06-213816
2. installation failed at waiting for bootstrap completed.
3.

Actual results:
Installation is failed

Expected results:
Installation is successful.

Additional info:
Attached bootstrap log bundle.

Comment 2 Colin Walters 2022-04-07 14:43:35 UTC
OK basically we're missing retries on the firstboot side.

Comment 7 errata-xmlrpc 2022-08-10 11:04:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.