Bug 1752508
| Summary: | upgrade hanging with "during bootstrap: unexpected on-disk state validating against rendered .." | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | daniel <dmoessne> |
| Component: | Machine Config Operator | Assignee: | Sinny Kumari <skumari> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Michael Nguyen <mnguyen> |
| Severity: | low | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.1.z | CC: | amurdaca, brad.williams, kgarriso, rdiazgav, skumari, smilner, wkulhane |
| Target Milestone: | --- | ||
| Target Release: | 4.4.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-03-02 17:16:25 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
[root@bastion bz]# oc adm upgrade
info: An upgrade is in progress. Working towards 4.1.15: 89% complete
No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.
[root@bastion bz]#
this is hanging forever
[root@bastion bz]# oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.15-201909041605+63712ea-dirty", GitCommit:"63712ea", GitTreeState:"dirty", BuildDate:"2019-09-04T23:58:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5909ca9", GitCommit:"5909ca9", GitTreeState:"clean", BuildDate:"2019-09-05T00:06:19Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
[root@bastion bz]#
[root@bastion bz]# oc get nodes
NAME STATUS ROLES AGE VERSION
master0.coreos.local Ready master 101d v1.13.4+205da2b4a
master1.coreos.local Ready master 101d v1.13.4+205da2b4a
master2.coreos.local Ready master 101d v1.13.4+205da2b4a
worker0.coreos.local Ready worker 101d v1.13.4+d81afa6ba
worker1.coreos.local Ready worker 101d v1.13.4+d81afa6ba
worker2.coreos.local Ready worker 101d v1.13.4+d81afa6ba
worker3.coreos.local Ready worker 101d v1.13.4+d81afa6ba
worker4.coreos.local Ready worker 101d v1.13.4+d81afa6ba
[root@bastion bz]#
[root@bastion bz]# oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
master0.coreos.local Ready master 101d v1.13.4+205da2b4a 192.168.5.20 <none> Red Hat Enterprise Linux CoreOS 410.8.20190718.1 (Ootpa) 4.18.0-80.4.2.el8_0.x86_64 cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8
master1.coreos.local Ready master 101d v1.13.4+205da2b4a 192.168.5.21 <none> Red Hat Enterprise Linux CoreOS 410.8.20190718.1 (Ootpa) 4.18.0-80.4.2.el8_0.x86_64 cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8
master2.coreos.local Ready master 101d v1.13.4+205da2b4a 192.168.5.22 <none> Red Hat Enterprise Linux CoreOS 410.8.20190718.1 (Ootpa) 4.18.0-80.4.2.el8_0.x86_64 cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8
worker0.coreos.local Ready worker 101d v1.13.4+d81afa6ba 192.168.5.30 <none> Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa) 4.18.0-80.7.2.el8_0.x86_64 cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker1.coreos.local Ready worker 101d v1.13.4+d81afa6ba 192.168.5.31 <none> Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa) 4.18.0-80.7.2.el8_0.x86_64 cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker2.coreos.local Ready worker 101d v1.13.4+d81afa6ba 192.168.5.32 <none> Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa) 4.18.0-80.7.2.el8_0.x86_64 cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker3.coreos.local Ready worker 101d v1.13.4+d81afa6ba 192.168.5.33 <none> Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa) 4.18.0-80.7.2.el8_0.x86_64 cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker4.coreos.local Ready worker 101d v1.13.4+d81afa6ba 192.168.5.34 <none> Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa) 4.18.0-80.7.2.el8_0.x86_64 cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
[root@bastion bz]#
[root@bastion bz]# oc get machineconfig
NAME GENERATEDBYCONTROLLER IGNITIONVERSION CREATED
00-master 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
00-worker 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
01-master-container-runtime 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
01-master-kubelet 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
01-worker-container-runtime 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
01-worker-kubelet 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
99-master-23eb5e1e-8876-11e9-8939-5254002cde16-registries 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
99-master-ssh 2.2.0 101d
99-worker-249fdd64-8876-11e9-8939-5254002cde16-registries 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 101d
99-worker-ssh 2.2.0 101d
rendered-master-1a9f1d91cd3c23364d257287499eb8ee e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 6d3h
rendered-master-3444cb84703a72295e73662a1cd4d91f 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 28d
rendered-master-4c11795011d74ae70ca85b4b2ae96310 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 2d23h
rendered-master-92547a4268e468d5729ab88062d7c148 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 29d
rendered-master-cf14b9b48f8f4a54fddd62c16c591286 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 31d
rendered-worker-133dcde3bd8c57902329bf61ce15a1ac e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab 2.2.0 6d3h
rendered-worker-402cf68854ee4cd5fcab6cf92bdc0497 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 28d
rendered-worker-55910a8e1ba5fe41102e3ef54cd93f8d 916adbf5fdc1381714fadc327c17000a8b3707e1 2.2.0 2d23h
rendered-worker-7cfa497d4fe265dd6418422d6551212e 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 29d
rendered-worker-d2c85a97c31fb0bfb5847a7ab6ecab0b 83392b13a5c17e56656acf3f7b0031e3303fb5c0 2.2.0 31d
[root@bastion bz]#
[root@bastion bz]# oc get machineconfigpools.machineconfiguration.openshift.io
NAME CONFIG UPDATED UPDATING DEGRADED
master rendered-master-8bc942b7163ff02019d5f1c7b76103f1 False True True
worker rendered-worker-8cd661b4a862c17e332314621bce67f4 False True True
[root@bastion bz]#
[root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io master
[...]
Status:
Conditions:
Last Transition Time: 2019-06-06T16:15:54Z
Message:
Reason:
Status: False
Type: RenderDegraded
Last Transition Time: 2019-08-06T08:51:22Z
Message:
Reason:
Status: False
Type: Updated
Last Transition Time: 2019-08-08T06:58:27Z
Message:
Reason:
Status: True
Type: Degraded
Last Transition Time: 2019-08-08T06:58:27Z
Message: Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1"
Reason: 3 nodes are reporting degraded status on sync
Status: True
Type: NodeDegraded
Last Transition Time: 2019-08-06T08:51:22Z
Message:
Reason: All nodes are updating to rendered-master-4c11795011d74ae70ca85b4b2ae96310
Status: True
Type: Updating
Configuration:
Name: rendered-master-8bc942b7163ff02019d5f1c7b76103f1
[...]
[root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io worker
[...]
Status:
Conditions:
Last Transition Time: 2019-06-06T16:15:54Z
Message:
Reason:
Status: False
Type: RenderDegraded
Last Transition Time: 2019-08-16T07:19:54Z
Message:
Reason:
Status: False
Type: Updated
Last Transition Time: 2019-08-16T07:44:03Z
Message:
Reason:
Status: True
Type: Degraded
Last Transition Time: 2019-08-16T07:44:03Z
Message: Node worker0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker3.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker4.coreos.local is reporting: "during bootstrap: machineconfig.machineconfiguration.openshift.io \"rendered-worker-8cd661b4a862c17e332314621bce67f4\" not found"
Reason: 5 nodes are reporting degraded status on sync
Status: True
Type: NodeDegraded
Last Transition Time: 2019-08-16T07:19:54Z
Message:
Reason: All nodes are updating to rendered-worker-55910a8e1ba5fe41102e3ef54cd93f8d
Status: True
Type: Updating
[...]
Can you confirm that manually rebooting the nodes doesn't solve the issue? (as a workaround, we're still trying to understand what's going on) - do you have a cluster kubeconfig to share also? (In reply to Antonio Murdaca from comment #3) > Can you confirm that manually rebooting the nodes doesn't solve the issue? I have rebooted all nodes of the cluster multiple times in the past as well as before opening this bz, I have rebooted all the nodes (masters and workers) but this changed nothing, i.e. the problem persists > (as a workaround, we're still trying to understand what's going on) - do you > have a cluster kubeconfig to share also? Not sure if I am getting what you need exactly, really the kubeconfig, or do you want access to the cluster or can you kindly let me know what exactly you'd like to have ? thx, daniel (In reply to daniel from comment #4) > (In reply to Antonio Murdaca from comment #3) > > Can you confirm that manually rebooting the nodes doesn't solve the issue? > I have rebooted all nodes of the cluster multiple times in the past as well > as before opening this bz, I have rebooted all the nodes (masters and > workers) but this changed nothing, i.e. the problem persists uhm, ok might be something else, can you perhaps tar up the whole systemd journal for a node that's currently failing? (that's unfortunately not included in must-gather) > > > (as a workaround, we're still trying to understand what's going on) - do you > > have a cluster kubeconfig to share also? > Not sure if I am getting what you need exactly, really the kubeconfig, or do > you want access to the cluster or can you kindly let me know what exactly > you'd like to have ? > > thx, > daniel lowering priority as this is libvirt tho. But I'm keeping an eye on this anyway. Are you still hitting this? Could you provide a new must-gather. your link (http://inf3.coe.muc.redhat.com/pub/ocp/debug/must-gather-20190916.tar.gz) just 404s... @brad briefly looking this seems to be different than the original post. i see in mcc: ``` 2019-12-13T20:50:58.681266227Z E1213 20:50:58.680777 1 render_controller.go:216] error finding pools for machineconfig: could not find any MachineConfigPool set for MachineConfig managed-ssh-keys-infra with labels: map[cr.applier/hash:00338f448a68ac10398cab84da147ec6eb29fb7b cr.applier/unit:machineconfig machineconfiguration.openshift.io/role:infra] ``` so perhaps check that your pools are set up correctly. Also seeing: ``` 2019-12-13T20:56:22.353291917Z E1213 20:56:22.353269 21974 daemon.go:1136] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003 2019-12-13T20:56:22.353308527Z E1213 20:56:22.353294 21974 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-e4183503741870a5d8f36d4f4135b773 ``` Digging a bit this looks the same as : https://bugzilla.redhat.com/show_bug.cgi?id=1763700#c12 which is being backported to 4.1 via https://bugzilla.redhat.com/show_bug.cgi?id=1764719 You might be able to unstick the upgrade with `touch /run/machine-config-daemon-force` @brad Also this should probably be moved to a separate BZ as it seems to have different symptoms than the original BZ. @brad when you open the new BZ can you also attach the `journalctl -u pivot.service` logs from the failing node if there are any? I have opened a separate BZ for the starter issues: https://bugzilla.redhat.com/show_bug.cgi?id=1783621 I can confirm that `touch /run/machine-config-daemon-force` fixed the stuck machineconfigpool on my cluster when I saw this error upgrading from 4.1.30 to 4.1.31. The error I was was this one (just copied from above - but it was the same): ``` 2019-12-13T20:56:22.353291917Z E1213 20:56:22.353269 21974 daemon.go:1136] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003 2019-12-13T20:56:22.353308527Z E1213 20:56:22.353294 21974 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-e4183503741870a5d8f36d4f4135b773 ``` Closing do to missing requested information. If this issue persists please provide the requested information and reopen the bug. |
Description of problem: Upgrading OCP 4.1.8 to all intermittent versions up to now (4.1.15) do not upgrade machines Version-Release number of selected component (if applicable): # oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.15-201909041605+63712ea-dirty", GitCommit:"63712ea", GitTreeState:"dirty", BuildDate:"2019-09-04T23:58:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5909ca9", GitCommit:"5909ca9", GitTreeState:"clean", BuildDate:"2019-09-05T00:06:19Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"} How reproducible: upgrade OCP 4.1.8 to later version does not update machines/os Steps to Reproduce: 1. install OCP 4.1.1 on libvirt 2. upgrade to all minor versions: ~~~ [root@bastion bz]# oc adm release info Name: 4.1.15 Digest: sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef Created: 2019-09-06T20:10:43Z OS/Arch: linux/amd64 Manifests: 287 Pull From: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef Release Metadata: Version: 4.1.15 Upgrades: 4.1.2, 4.1.3, 4.1.4, 4.1.6, 4.1.7, 4.1.8, 4.1.9, 4.1.11, 4.1.13, 4.1.14 Metadata: description: Metadata: url: https://access.redhat.com/errata/RHBA-2019:2681 Component Versions: Kubernetes 1.13.4 ~~~ 3. find machine upgrade always being not updated like so (always, no progress even after weeks (later versions)) ~~~ [root@bastion bz]# oc get machineconfigpools.machineconfiguration.openshift.io NAME CONFIG UPDATED UPDATING DEGRADED master rendered-master-8bc942b7163ff02019d5f1c7b76103f1 False True True worker rendered-worker-8cd661b4a862c17e332314621bce67f4 False True True [root@bastion bz]# ~~~ Actual results: looking in the machineconfigpools I find the following for masters and workers: ~~~ [root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io master [...] Status: Conditions: Last Transition Time: 2019-06-06T16:15:54Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-08-08T06:58:27Z Message: Reason: Status: True Type: Degraded Last Transition Time: 2019-08-08T06:58:27Z Message: Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1" Reason: 3 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2019-08-06T08:51:22Z Message: Reason: All nodes are updating to rendered-master-4c11795011d74ae70ca85b4b2ae96310 Status: True Type: Updating Configuration: Name: rendered-master-8bc942b7163ff02019d5f1c7b76103f1 [...] [root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io worker [...] Status: Conditions: Last Transition Time: 2019-06-06T16:15:54Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-08-16T07:19:54Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-08-16T07:44:03Z Message: Reason: Status: True Type: Degraded Last Transition Time: 2019-08-16T07:44:03Z Message: Node worker0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker3.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker4.coreos.local is reporting: "during bootstrap: machineconfig.machineconfiguration.openshift.io \"rendered-worker-8cd661b4a862c17e332314621bce67f4\" not found" Reason: 5 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2019-08-16T07:19:54Z Message: Reason: All nodes are updating to rendered-worker-55910a8e1ba5fe41102e3ef54cd93f8d Status: True Type: Updating [...] ~~~ as can be seen above those are stuck since mid of August despite of additional updates Expected results: machines are updated with the latest image Additional info: Once I run into https://bugzilla.redhat.com/show_bug.cgi?id=1723327 which was solved by rolling back ostree on the affected nodes, but that does not do th trick this time and caused hosts being on different oc levels