Description of problem: Upgrade from 4.1.0 fresh installation to latest 4.1.2 stable release failed with below error(master and worker node impacted): E0624 09:51:53.090052 8229 daemon.go:1111] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b E0624 09:51:53.090083 8229 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-481fd612b07ecc5a5e2a6fac937de410 Version-Release number of selected component (if applicable): 4.1.0 fresh install to 4.1.2 How reproducible: Install fresh 4.1.0 cluster on baremetal vm running on top of libvirt, once 4.1.0 state shown 'completed' in cluster settings overview and no more update running in cluster operator page, we shall start new update by selecting "Update Now" to 4.1.2. Eventually machine-config YAML will show this: master: >- pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node master03 is reporting: \"failed to run pivot: failed to start pivot.service: exit status 4\"" worker: >- pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node worker01 is reporting: \"failed to run pivot: failed to start pivot.service: exit status 4\"" From oc logs mcd for master03: I0624 09:50:53.382524 6526 update.go:836] controller syncing started I0624 09:50:53.385133 6526 update.go:723] logger doesn't support --jounald, grepping the journal I0624 09:50:53.503796 6526 update.go:836] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory I0624 09:50:53.506415 6526 daemon.go:703] Current config: rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a I0624 09:50:53.506450 6526 daemon.go:704] Desired config: rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 09:50:53.514790 6526 update.go:836] Validating against current config rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a I0624 09:50:53.518092 6526 daemon.go:899] Validating against current config rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a E0624 09:50:53.518155 6526 daemon.go:1111] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b E0624 09:50:53.518171 6526 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a [root@master03 ~]# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe CustomOrigin: Managed by pivot tool Version: 410.8.20190612.0 (2019-06-12T19:24:13Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b CustomOrigin: Provisioned from oscontainer Version: 410.8.20190520.0 (2019-05-20T22:55:04Z) From oc logs mcd for worker01: I0624 09:51:52.961061 8229 update.go:836] controller syncing started I0624 09:51:52.963852 8229 update.go:723] logger doesn't support --jounald, grepping the journal I0624 09:51:53.076187 8229 update.go:836] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory I0624 09:51:53.078969 8229 daemon.go:703] Current config: rendered-worker-481fd612b07ecc5a5e2a6fac937de410 I0624 09:51:53.078988 8229 daemon.go:704] Desired config: rendered-worker-0ab23b331c3fe4cbc5d73d251cc8e41b I0624 09:51:53.087002 8229 update.go:836] Validating against current config rendered-worker-481fd612b07ecc5a5e2a6fac937de410 I0624 09:51:53.089997 8229 daemon.go:899] Validating against current config rendered-worker-481fd612b07ecc5a5e2a6fac937de410 E0624 09:51:53.090052 8229 daemon.go:1111] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b E0624 09:51:53.090083 8229 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-481fd612b07ecc5a5e2a6fac937de410 [root@worker01 ~]# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe CustomOrigin: Managed by pivot tool Version: 410.8.20190612.0 (2019-06-12T19:24:13Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b CustomOrigin: Provisioned from oscontainer Version: 410.8.20190520.0 (2019-05-20T22:55:04Z) Steps to Reproduce: 1. Refer above 2. 3. Actual results: Update machine-config degraded with above error. In order to get this thing moving, ssh to each node impacted and executed `rpm-ostree rollback --reboot`, then it went trough smoothly. Expected results: Update should go smoothly without any intervention. Additional info: Had looked at this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1706606, not sure if this has being deployed part of the 4.1.0 with error seen.
-- Reboot -- Jun 24 09:43:36 localhost.localdomain systemd[1]: Starting Pivot Tool... Jun 24 09:43:36 localhost.localdomain pivot[841]: pivot version 0.0.5 Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.804320 841 root.go:369] Using image pullspec from /etc/pivot/image-pullspec Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.804610 841 run.go:16] Running: rpm-ostree status --json Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.956285 841 root.go:263] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.957156 841 run.go:16] Running: podman rmi Jun 24 09:43:37 worker01 pivot[841]: input name is blank Jun 24 09:43:37 worker01 pivot[841]: no valid images to delete Jun 24 09:43:37 worker01 pivot[841]: W0624 09:43:37.327483 841 run.go:75] (ignored) podman: exit status 125 Jun 24 09:43:37 worker01 pivot[841]: I0624 09:43:37.327730 841 root.go:398] unable to parse tuning file /etc/pivot/kernel-args: stat /etc/pivot/kernel-args: no such file or directory Jun 24 09:43:37 worker01 pivot[841]: I0624 09:43:37.327747 841 root.go:411] Already at target pivot; exiting... Jun 24 09:43:37 worker01 systemd[1]: Started Pivot Tool. Jun 24 09:43:37 worker01 systemd[1]: pivot.service: Consumed 174ms CPU time Logs for pivot worker01
A typo for the argument here "--jounald": I0624 09:51:52.963852 8229 update.go:723] logger doesn't support --jounald, grepping the journal
MCD logs after `rpm-ostree rollback --reboot` executed and node rebooted twice by MCO. [root@ocp4bastion ~]# oc logs machine-config-daemon-cwjsw I0624 10:09:16.778381 3944 start.go:67] Version: 4.1.1-201906061134-dirty (5b6fb8349a29735e48446d435962dec4547d3090) I0624 10:09:16.779033 3944 start.go:100] Starting node writer I0624 10:09:16.782284 3944 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0624 10:09:16.875186 3944 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe (410.8.20190612.0) I0624 10:09:16.909890 3944 start.go:196] Calling chroot("/rootfs") I0624 10:09:16.910629 3944 start.go:206] Starting MachineConfigDaemon I0624 10:09:16.911314 3944 update.go:836] Starting to manage node: master03 I0624 10:09:16.936737 3944 run.go:22] Running captured: rpm-ostree status I0624 10:09:17.122838 3944 daemon.go:741] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe CustomOrigin: Managed by pivot tool Version: 410.8.20190612.0 (2019-06-12T19:24:13Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b CustomOrigin: Provisioned from oscontainer Version: 410.8.20190520.0 (2019-05-20T22:55:04Z) I0624 10:09:17.122858 3944 run.go:22] Running captured: journalctl --list-boots I0624 10:09:17.297494 3944 daemon.go:748] journalctl --list-boots: -3 9cb74a88ce9642e6a3cb8cf6b9992948 Mon 2019-06-24 08:50:59 UTC—Mon 2019-06-24 09:43:03 UTC -2 e34d33f15404463d952b9c4e6b3d759b Mon 2019-06-24 09:43:11 UTC—Mon 2019-06-24 10:03:53 UTC -1 4f5b348bdc0d47a794a2af04fa7096c2 Mon 2019-06-24 10:04:02 UTC—Mon 2019-06-24 10:08:04 UTC 0 fb9320a76d984388a74114ed891725a4 Mon 2019-06-24 10:08:13 UTC—Mon 2019-06-24 10:09:17 UTC I0624 10:09:17.297511 3944 daemon.go:495] Enabling Kubelet Healthz Monitor E0624 10:09:21.002487 3944 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host E0624 10:09:21.003207 3944 reflector.go:134] k8s.io/client-go/informers/factory.go:132: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host I0624 10:09:24.306599 3944 update.go:836] controller syncing started I0624 10:09:24.311681 3944 update.go:723] logger doesn't support --jounald, grepping the journal I0624 10:09:24.770597 3944 daemon.go:703] Current config: rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a I0624 10:09:24.770615 3944 daemon.go:704] Desired config: rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.770623 3944 update.go:836] using pending config same as desired I0624 10:09:24.789056 3944 daemon.go:878] Using current config on disk rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.789203 3944 update.go:836] Using current config on disk rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.792975 3944 update.go:836] Validating against pending config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.796028 3944 daemon.go:895] Validating against pending config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.800180 3944 daemon.go:905] Validated on-disk state I0624 10:09:24.867283 3944 update.go:801] logger doesn't support --jounald, logging json directly I0624 10:09:24.872012 3944 daemon.go:939] Completing pending config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.911506 3944 update.go:836] completed update for config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:24.916541 3944 daemon.go:945] In desired config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 10:09:25.867925 3944 update.go:836] controller syncing started I0624 10:09:27.926859 3944 update.go:836] controller syncing started
Can you provide full `oc adm must-gather` logs? This has been already reported on google chat and it looked like pivot wrongly rebooted the system. It reconciled itself though (still investigating)
(In reply to Antonio Murdaca from comment #4) > Can you provide full `oc adm must-gather` logs? This has been already > reported on google chat and it looked like pivot wrongly rebooted the > system. It reconciled itself though (still investigating) Indeed, I was in the chat that Jeff asked for BZ as well. Im running `oc adm must-gather` at the moment. Will update soon.
(In reply to Muhammad Aizuddin Zali from comment #5) > (In reply to Antonio Murdaca from comment #4) > > Can you provide full `oc adm must-gather` logs? This has been already > > reported on google chat and it looked like pivot wrongly rebooted the > > system. It reconciled itself though (still investigating) > > Indeed, I was in the chat that Jeff asked for BZ as well. Im running `oc adm > must-gather` at the moment. Will update soon. `oc adm must-gather` output uploaded to this BZ as private attachment.
We know what's happening. Discussing a fix.
can you provide the output of: - oc get machineconfigpools - oc get machineconfigs - oc logs {MCDs instances}
Daniel, did you "rpm-ostree rollback --reboot" on all workers as well?
Created attachment 1584285 [details] output as requested by comment c#10
worker rendered-worker-e276ebdd5b52ee242e69568a885dec85 False True True The worker pool is Updating so it's expected that upgrade goes through (all masters are upgraded with ostree also). The upgrade is still rolling out to workers
https://github.com/openshift/machine-config-operator/pull/892
> - Q: is this expected ? And do we have somewhere a detailed description on how the upgrade is supposed to work ? Pivot failing there is expected, it basically means there's no os update available (image-pullspec is written when there's an os update that pivot needs to apply). The bug and fix behind all of this is linked https://bugzilla.redhat.com/show_bug.cgi?id=1723327#c17 - the normal upgrade flow is transparent to all this of course
Reconciling a failing upgrade in 4.1.x requires the following: for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-config-operator $i chroot /rootfs rm -rf /run/pivot/reboot-needed; done Then jump on the nodes that are currently Degraded and run: rpm-ostree rollback --reboot The nodes will then reconcile themselves and the machine-config operator will continue rolling the upgrade to 4.1.x+1
To verify this BZ on AWS: - bring up a 4.1.0 or 4.1.2 cluster on AWS - run the following for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-config-operator $i chroot /rootfs mkdir -p /run/pivot; oc rsh -n openshift-machine-config-operator $i chroot /rootfs touch /run/pivot/reboot-needed; done - run an upgrade to any payload that contains https://github.com/openshift/machine-config-operator/pull/892 - the upgrade goes through wrt machine-config operator
*** Bug 1722849 has been marked as a duplicate of this bug. ***
*** Bug 1724858 has been marked as a duplicate of this bug. ***
PR merged, steps to verify here https://bugzilla.redhat.com/show_bug.cgi?id=1723327#c21
Added `OSE41z_next` keyword as we want to ship this in 4.1.4
Looks verified with 4.1.0-0.nightly-2019-07-02-101812 - thanks Antonio for instructions! ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.2 True False 7m19s Cluster version is 4.1.2 $ for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-config-operator $i chroot /rootfs mkdir -p /run/pivot; oc rsh -n openshift-machine-config-operator $i chroot /rootfs touch /run/pivot/reboot-needed; done $ $ for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine- config-operator $i chroot /rootfs ls /run/pivot; done reboot-needed reboot-needed reboot-needed reboot-needed reboot-needed $ oc image info -a ../all-the-pull-secrets.json $(oc adm release info --image-for=machine-config-operator -a ../all-the-pull-secrets.json registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812) | grep commit io.openshift.build.commit.id=84b28f4868a73ff54c663a4972b78805c9befa6d io.openshift.build.commit.url=https://github.com/openshift/machine-config-operator/commit/84b28f4868a73ff54c663a4972b78805c9befa6d $ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge clusterversion.config.openshift.io/version patched $ oc adm upgrade --force --to-image=registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812 Updating to release image registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 1s Working towards registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812: downloading update $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-07-02-101812 True False 22m Cluster version is 4.1.0-0.nightly-2019-07-02-101812 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.1.0-0.nightly-2019-07-02-101812 True False False 79m cloud-credential 4.1.0-0.nightly-2019-07-02-101812 True False False 88m cluster-autoscaler 4.1.0-0.nightly-2019-07-02-101812 True False False 88m console 4.1.0-0.nightly-2019-07-02-101812 True False False 80m dns 4.1.0-0.nightly-2019-07-02-101812 True False False 88m image-registry 4.1.0-0.nightly-2019-07-02-101812 True False False 82m ingress 4.1.0-0.nightly-2019-07-02-101812 True False False 82m kube-apiserver 4.1.0-0.nightly-2019-07-02-101812 True False False 85m kube-controller-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 86m kube-scheduler 4.1.0-0.nightly-2019-07-02-101812 True False False 86m machine-api 4.1.0-0.nightly-2019-07-02-101812 True False False 88m machine-config 4.1.0-0.nightly-2019-07-02-101812 True False False 87m marketplace 4.1.0-0.nightly-2019-07-02-101812 True False False 29m monitoring 4.1.0-0.nightly-2019-07-02-101812 True False False 28m network 4.1.0-0.nightly-2019-07-02-101812 True False False 89m node-tuning 4.1.0-0.nightly-2019-07-02-101812 True False False 54m openshift-apiserver 4.1.0-0.nightly-2019-07-02-101812 True False False 30m openshift-controller-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 87m openshift-samples 4.1.0-0.nightly-2019-07-02-101812 True False False 52m operator-lifecycle-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 86m operator-lifecycle-manager-catalog 4.1.0-0.nightly-2019-07-02-101812 True False False 86m service-ca 4.1.0-0.nightly-2019-07-02-101812 True False False 88m service-catalog-apiserver 4.1.0-0.nightly-2019-07-02-101812 True False False 84m service-catalog-controller-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 85m storage 4.1.0-0.nightly-2019-07-02-101812 True False False 53m ```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1635