Bug 1723327
| Summary: | Upgrade failure with "Marking Degraded due to: during bootstrap: unexpected on-disk state validating" error. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Muhammad Aizuddin Zali <mzali> | ||||
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 4.1.0 | CC: | amurdaca, asonmez, bmilne, cmarches, dmoessne, erjones, jmathian, mzali, sponnaga | ||||
| Target Milestone: | --- | Keywords: | OSE41z_next | ||||
| Target Release: | 4.1.z | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | 4.1.6 | ||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | |||||||
| : | 1724155 1729345 (view as bug list) | Environment: | |||||
| Last Closed: | 2019-07-04 09:01:41 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1724155, 1729345 | ||||||
| Attachments: |
|
||||||
-- Reboot -- Jun 24 09:43:36 localhost.localdomain systemd[1]: Starting Pivot Tool... Jun 24 09:43:36 localhost.localdomain pivot[841]: pivot version 0.0.5 Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.804320 841 root.go:369] Using image pullspec from /etc/pivot/image-pullspec Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.804610 841 run.go:16] Running: rpm-ostree status --json Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.956285 841 root.go:263] Previous pivot: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe Jun 24 09:43:36 localhost.localdomain pivot[841]: I0624 09:43:36.957156 841 run.go:16] Running: podman rmi Jun 24 09:43:37 worker01 pivot[841]: input name is blank Jun 24 09:43:37 worker01 pivot[841]: no valid images to delete Jun 24 09:43:37 worker01 pivot[841]: W0624 09:43:37.327483 841 run.go:75] (ignored) podman: exit status 125 Jun 24 09:43:37 worker01 pivot[841]: I0624 09:43:37.327730 841 root.go:398] unable to parse tuning file /etc/pivot/kernel-args: stat /etc/pivot/kernel-args: no such file or directory Jun 24 09:43:37 worker01 pivot[841]: I0624 09:43:37.327747 841 root.go:411] Already at target pivot; exiting... Jun 24 09:43:37 worker01 systemd[1]: Started Pivot Tool. Jun 24 09:43:37 worker01 systemd[1]: pivot.service: Consumed 174ms CPU time Logs for pivot worker01 A typo for the argument here "--jounald": I0624 09:51:52.963852 8229 update.go:723] logger doesn't support --jounald, grepping the journal MCD logs after `rpm-ostree rollback --reboot` executed and node rebooted twice by MCO.
[root@ocp4bastion ~]# oc logs machine-config-daemon-cwjsw
I0624 10:09:16.778381 3944 start.go:67] Version: 4.1.1-201906061134-dirty (5b6fb8349a29735e48446d435962dec4547d3090)
I0624 10:09:16.779033 3944 start.go:100] Starting node writer
I0624 10:09:16.782284 3944 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0624 10:09:16.875186 3944 daemon.go:200] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe (410.8.20190612.0)
I0624 10:09:16.909890 3944 start.go:196] Calling chroot("/rootfs")
I0624 10:09:16.910629 3944 start.go:206] Starting MachineConfigDaemon
I0624 10:09:16.911314 3944 update.go:836] Starting to manage node: master03
I0624 10:09:16.936737 3944 run.go:22] Running captured: rpm-ostree status
I0624 10:09:17.122838 3944 daemon.go:741] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe
CustomOrigin: Managed by pivot tool
Version: 410.8.20190612.0 (2019-06-12T19:24:13Z)
pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b
CustomOrigin: Provisioned from oscontainer
Version: 410.8.20190520.0 (2019-05-20T22:55:04Z)
I0624 10:09:17.122858 3944 run.go:22] Running captured: journalctl --list-boots
I0624 10:09:17.297494 3944 daemon.go:748] journalctl --list-boots:
-3 9cb74a88ce9642e6a3cb8cf6b9992948 Mon 2019-06-24 08:50:59 UTC—Mon 2019-06-24 09:43:03 UTC
-2 e34d33f15404463d952b9c4e6b3d759b Mon 2019-06-24 09:43:11 UTC—Mon 2019-06-24 10:03:53 UTC
-1 4f5b348bdc0d47a794a2af04fa7096c2 Mon 2019-06-24 10:04:02 UTC—Mon 2019-06-24 10:08:04 UTC
0 fb9320a76d984388a74114ed891725a4 Mon 2019-06-24 10:08:13 UTC—Mon 2019-06-24 10:09:17 UTC
I0624 10:09:17.297511 3944 daemon.go:495] Enabling Kubelet Healthz Monitor
E0624 10:09:21.002487 3944 reflector.go:134] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to list *v1.MachineConfig: Get https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
E0624 10:09:21.003207 3944 reflector.go:134] k8s.io/client-go/informers/factory.go:132: Failed to list *v1.Node: Get https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0: dial tcp 172.30.0.1:443: connect: no route to host
I0624 10:09:24.306599 3944 update.go:836] controller syncing started
I0624 10:09:24.311681 3944 update.go:723] logger doesn't support --jounald, grepping the journal
I0624 10:09:24.770597 3944 daemon.go:703] Current config: rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a
I0624 10:09:24.770615 3944 daemon.go:704] Desired config: rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.770623 3944 update.go:836] using pending config same as desired
I0624 10:09:24.789056 3944 daemon.go:878] Using current config on disk rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.789203 3944 update.go:836] Using current config on disk rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.792975 3944 update.go:836] Validating against pending config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.796028 3944 daemon.go:895] Validating against pending config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.800180 3944 daemon.go:905] Validated on-disk state
I0624 10:09:24.867283 3944 update.go:801] logger doesn't support --jounald, logging json directly
I0624 10:09:24.872012 3944 daemon.go:939] Completing pending config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.911506 3944 update.go:836] completed update for config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:24.916541 3944 daemon.go:945] In desired config rendered-master-e1fca7e66d845fc65f048a4d8ef3380c
I0624 10:09:25.867925 3944 update.go:836] controller syncing started
I0624 10:09:27.926859 3944 update.go:836] controller syncing started
Can you provide full `oc adm must-gather` logs? This has been already reported on google chat and it looked like pivot wrongly rebooted the system. It reconciled itself though (still investigating) (In reply to Antonio Murdaca from comment #4) > Can you provide full `oc adm must-gather` logs? This has been already > reported on google chat and it looked like pivot wrongly rebooted the > system. It reconciled itself though (still investigating) Indeed, I was in the chat that Jeff asked for BZ as well. Im running `oc adm must-gather` at the moment. Will update soon. (In reply to Muhammad Aizuddin Zali from comment #5) > (In reply to Antonio Murdaca from comment #4) > > Can you provide full `oc adm must-gather` logs? This has been already > > reported on google chat and it looked like pivot wrongly rebooted the > > system. It reconciled itself though (still investigating) > > Indeed, I was in the chat that Jeff asked for BZ as well. Im running `oc adm > must-gather` at the moment. Will update soon. `oc adm must-gather` output uploaded to this BZ as private attachment. We know what's happening. Discussing a fix. can you provide the output of:
- oc get machineconfigpools
- oc get machineconfigs
- oc logs {MCDs instances}
Daniel, did you "rpm-ostree rollback --reboot" on all workers as well? Created attachment 1584285 [details]
output as requested by comment c#10
worker rendered-worker-e276ebdd5b52ee242e69568a885dec85 False True True The worker pool is Updating so it's expected that upgrade goes through (all masters are upgraded with ostree also). The upgrade is still rolling out to workers > - Q: is this expected ? And do we have somewhere a detailed description on how the upgrade is supposed to work ? Pivot failing there is expected, it basically means there's no os update available (image-pullspec is written when there's an os update that pivot needs to apply). The bug and fix behind all of this is linked https://bugzilla.redhat.com/show_bug.cgi?id=1723327#c17 - the normal upgrade flow is transparent to all this of course Reconciling a failing upgrade in 4.1.x requires the following:
for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-config-operator $i chroot /rootfs rm -rf /run/pivot/reboot-needed; done
Then jump on the nodes that are currently Degraded and run:
rpm-ostree rollback --reboot
The nodes will then reconcile themselves and the machine-config operator will continue rolling the upgrade to 4.1.x+1
To verify this BZ on AWS:
- bring up a 4.1.0 or 4.1.2 cluster on AWS
- run the following
for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-config-operator $i chroot /rootfs mkdir -p /run/pivot; oc rsh -n openshift-machine-config-operator $i chroot /rootfs touch /run/pivot/reboot-needed; done
- run an upgrade to any payload that contains https://github.com/openshift/machine-config-operator/pull/892
- the upgrade goes through wrt machine-config operator
*** Bug 1722849 has been marked as a duplicate of this bug. *** *** Bug 1724858 has been marked as a duplicate of this bug. *** PR merged, steps to verify here https://bugzilla.redhat.com/show_bug.cgi?id=1723327#c21 Added `OSE41z_next` keyword as we want to ship this in 4.1.4 Looks verified with 4.1.0-0.nightly-2019-07-02-101812 - thanks Antonio for instructions!
```
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.2 True False 7m19s Cluster version is 4.1.2
$ for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-config-operator $i chroot /rootfs mkdir -p /run/pivot; oc rsh -n openshift-machine-config-operator $i chroot /rootfs touch /run/pivot/reboot-needed; done
$
$ for i in $(oc get pods -n openshift-machine-config-operator -l k8s-app=machine-config-daemon -o go-template --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}' | xargs); do oc rsh -n openshift-machine-
config-operator $i chroot /rootfs ls /run/pivot; done
reboot-needed
reboot-needed
reboot-needed
reboot-needed
reboot-needed
$ oc image info -a ../all-the-pull-secrets.json $(oc adm release info --image-for=machine-config-operator -a ../all-the-pull-secrets.json registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812) | grep commit
io.openshift.build.commit.id=84b28f4868a73ff54c663a4972b78805c9befa6d
io.openshift.build.commit.url=https://github.com/openshift/machine-config-operator/commit/84b28f4868a73ff54c663a4972b78805c9befa6d
$ oc patch clusterversion/version --patch '{"spec":{"upstream":"https://openshift-release.svc.ci.openshift.org/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched
$ oc adm upgrade --force --to-image=registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812
Updating to release image registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version True True 1s Working towards registry.svc.ci.openshift.org/ocp/release:4.1.0-0.nightly-2019-07-02-101812: downloading update
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-07-02-101812 True False 22m Cluster version is 4.1.0-0.nightly-2019-07-02-101812
$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE
authentication 4.1.0-0.nightly-2019-07-02-101812 True False False 79m
cloud-credential 4.1.0-0.nightly-2019-07-02-101812 True False False 88m
cluster-autoscaler 4.1.0-0.nightly-2019-07-02-101812 True False False 88m
console 4.1.0-0.nightly-2019-07-02-101812 True False False 80m
dns 4.1.0-0.nightly-2019-07-02-101812 True False False 88m
image-registry 4.1.0-0.nightly-2019-07-02-101812 True False False 82m
ingress 4.1.0-0.nightly-2019-07-02-101812 True False False 82m
kube-apiserver 4.1.0-0.nightly-2019-07-02-101812 True False False 85m
kube-controller-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 86m
kube-scheduler 4.1.0-0.nightly-2019-07-02-101812 True False False 86m
machine-api 4.1.0-0.nightly-2019-07-02-101812 True False False 88m
machine-config 4.1.0-0.nightly-2019-07-02-101812 True False False 87m
marketplace 4.1.0-0.nightly-2019-07-02-101812 True False False 29m
monitoring 4.1.0-0.nightly-2019-07-02-101812 True False False 28m
network 4.1.0-0.nightly-2019-07-02-101812 True False False 89m
node-tuning 4.1.0-0.nightly-2019-07-02-101812 True False False 54m
openshift-apiserver 4.1.0-0.nightly-2019-07-02-101812 True False False 30m
openshift-controller-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 87m
openshift-samples 4.1.0-0.nightly-2019-07-02-101812 True False False 52m
operator-lifecycle-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 86m
operator-lifecycle-manager-catalog 4.1.0-0.nightly-2019-07-02-101812 True False False 86m
service-ca 4.1.0-0.nightly-2019-07-02-101812 True False False 88m
service-catalog-apiserver 4.1.0-0.nightly-2019-07-02-101812 True False False 84m
service-catalog-controller-manager 4.1.0-0.nightly-2019-07-02-101812 True False False 85m
storage 4.1.0-0.nightly-2019-07-02-101812 True False False 53m
```
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1635 |
Description of problem: Upgrade from 4.1.0 fresh installation to latest 4.1.2 stable release failed with below error(master and worker node impacted): E0624 09:51:53.090052 8229 daemon.go:1111] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b E0624 09:51:53.090083 8229 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-481fd612b07ecc5a5e2a6fac937de410 Version-Release number of selected component (if applicable): 4.1.0 fresh install to 4.1.2 How reproducible: Install fresh 4.1.0 cluster on baremetal vm running on top of libvirt, once 4.1.0 state shown 'completed' in cluster settings overview and no more update running in cluster operator page, we shall start new update by selecting "Update Now" to 4.1.2. Eventually machine-config YAML will show this: master: >- pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node master03 is reporting: \"failed to run pivot: failed to start pivot.service: exit status 4\"" worker: >- pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node worker01 is reporting: \"failed to run pivot: failed to start pivot.service: exit status 4\"" From oc logs mcd for master03: I0624 09:50:53.382524 6526 update.go:836] controller syncing started I0624 09:50:53.385133 6526 update.go:723] logger doesn't support --jounald, grepping the journal I0624 09:50:53.503796 6526 update.go:836] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory I0624 09:50:53.506415 6526 daemon.go:703] Current config: rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a I0624 09:50:53.506450 6526 daemon.go:704] Desired config: rendered-master-e1fca7e66d845fc65f048a4d8ef3380c I0624 09:50:53.514790 6526 update.go:836] Validating against current config rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a I0624 09:50:53.518092 6526 daemon.go:899] Validating against current config rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a E0624 09:50:53.518155 6526 daemon.go:1111] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b E0624 09:50:53.518171 6526 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-master-66f7b5a80c61a0ddd408c4cc06e1728a [root@master03 ~]# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe CustomOrigin: Managed by pivot tool Version: 410.8.20190612.0 (2019-06-12T19:24:13Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b CustomOrigin: Provisioned from oscontainer Version: 410.8.20190520.0 (2019-05-20T22:55:04Z) From oc logs mcd for worker01: I0624 09:51:52.961061 8229 update.go:836] controller syncing started I0624 09:51:52.963852 8229 update.go:723] logger doesn't support --jounald, grepping the journal I0624 09:51:53.076187 8229 update.go:836] error loading pending config open /etc/machine-config-daemon/state.json: no such file or directory I0624 09:51:53.078969 8229 daemon.go:703] Current config: rendered-worker-481fd612b07ecc5a5e2a6fac937de410 I0624 09:51:53.078988 8229 daemon.go:704] Desired config: rendered-worker-0ab23b331c3fe4cbc5d73d251cc8e41b I0624 09:51:53.087002 8229 update.go:836] Validating against current config rendered-worker-481fd612b07ecc5a5e2a6fac937de410 I0624 09:51:53.089997 8229 daemon.go:899] Validating against current config rendered-worker-481fd612b07ecc5a5e2a6fac937de410 E0624 09:51:53.090052 8229 daemon.go:1111] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b E0624 09:51:53.090083 8229 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-481fd612b07ecc5a5e2a6fac937de410 [root@worker01 ~]# rpm-ostree status State: idle AutomaticUpdates: disabled Deployments: ● pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d5f430248df68b2ec1cd2816eebe990924b4cc638891b2a529b7dc183e5b66fe CustomOrigin: Managed by pivot tool Version: 410.8.20190612.0 (2019-06-12T19:24:13Z) pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:53389c9b4a00d7afebb98f7bd9d20348deb1d77ca4baf194f0ae1b582b7e965b CustomOrigin: Provisioned from oscontainer Version: 410.8.20190520.0 (2019-05-20T22:55:04Z) Steps to Reproduce: 1. Refer above 2. 3. Actual results: Update machine-config degraded with above error. In order to get this thing moving, ssh to each node impacted and executed `rpm-ostree rollback --reboot`, then it went trough smoothly. Expected results: Update should go smoothly without any intervention. Additional info: Had looked at this BZ: https://bugzilla.redhat.com/show_bug.cgi?id=1706606, not sure if this has being deployed part of the 4.1.0 with error seen.