Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1752508

Summary: upgrade hanging with "during bootstrap: unexpected on-disk state validating against rendered .."
Product: OpenShift Container Platform Reporter: daniel <dmoessne>
Component: Machine Config OperatorAssignee: Sinny Kumari <skumari>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Michael Nguyen <mnguyen>
Severity: low Docs Contact:
Priority: low    
Version: 4.1.zCC: amurdaca, brad.williams, kgarriso, rdiazgav, skumari, smilner, wkulhane
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-02 17:16:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description daniel 2019-09-16 13:25:21 UTC
Description of problem:

Upgrading OCP 4.1.8 to all intermittent versions up to now (4.1.15) do not upgrade machines


Version-Release number of selected component (if applicable):
# oc version 
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.15-201909041605+63712ea-dirty", GitCommit:"63712ea", GitTreeState:"dirty", BuildDate:"2019-09-04T23:58:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5909ca9", GitCommit:"5909ca9", GitTreeState:"clean", BuildDate:"2019-09-05T00:06:19Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}


How reproducible:

upgrade OCP 4.1.8 to later version does not update machines/os

Steps to Reproduce:
1. install OCP 4.1.1 on libvirt
2. upgrade to all minor versions:
~~~
[root@bastion bz]# oc adm release info
Name:      4.1.15  
Digest:    sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef
Created:   2019-09-06T20:10:43Z
OS/Arch:   linux/amd64
Manifests: 287

Pull From: quay.io/openshift-release-dev/ocp-release@sha256:0a7f743a98e4d0937f44561138a03db8c09cdc4817a771a67f154e032435bcef

Release Metadata:  
  Version:  4.1.15 
  Upgrades: 4.1.2, 4.1.3, 4.1.4, 4.1.6, 4.1.7, 4.1.8, 4.1.9, 4.1.11, 4.1.13, 4.1.14
  Metadata:
    description:   
  Metadata:
    url:  https://access.redhat.com/errata/RHBA-2019:2681

Component Versions:
  Kubernetes 1.13.4

~~~
3. find machine upgrade always being not updated like so (always, no progress even after weeks (later versions))

~~~
[root@bastion bz]# oc get machineconfigpools.machineconfiguration.openshift.io 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-8bc942b7163ff02019d5f1c7b76103f1   False     True       True
worker   rendered-worker-8cd661b4a862c17e332314621bce67f4   False     True       True
[root@bastion bz]# 
~~~


Actual results:
looking in the machineconfigpools I find the following for masters and workers:
~~~
[root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io master
[...]
Status:
  Conditions:
    Last Transition Time:  2019-06-06T16:15:54Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1"
    Reason:                3 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:               
    Reason:                All nodes are updating to rendered-master-4c11795011d74ae70ca85b4b2ae96310
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-master-8bc942b7163ff02019d5f1c7b76103f1
[...]

[root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io worker 
[...]
Status:
  Conditions:
    Last Transition Time:  2019-06-06T16:15:54Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-16T07:19:54Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-08-16T07:44:03Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-08-16T07:44:03Z
    Message:               Node worker0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker3.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker4.coreos.local is reporting: "during bootstrap: machineconfig.machineconfiguration.openshift.io \"rendered-worker-8cd661b4a862c17e332314621bce67f4\" not found"
    Reason:                5 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-16T07:19:54Z
    Message:               
    Reason:                All nodes are updating to rendered-worker-55910a8e1ba5fe41102e3ef54cd93f8d
    Status:                True
    Type:                  Updating
[...]
~~~

as can be seen above those are stuck since mid of August despite of additional updates


Expected results:
machines are updated with the latest image

Additional info:
Once I run into https://bugzilla.redhat.com/show_bug.cgi?id=1723327 which was solved by rolling back ostree on the affected nodes,
but that does not do th trick this time and caused hosts being on different oc levels

Comment 1 daniel 2019-09-16 13:25:51 UTC
[root@bastion bz]# oc adm upgrade
info: An upgrade is in progress. Working towards 4.1.15: 89% complete

No updates available. You may force an upgrade to a specific release image, but doing so may not be supported and result in downtime or data loss.
[root@bastion bz]# 

this is hanging forever


[root@bastion bz]# oc version 
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.15-201909041605+63712ea-dirty", GitCommit:"63712ea", GitTreeState:"dirty", BuildDate:"2019-09-04T23:58:30Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+5909ca9", GitCommit:"5909ca9", GitTreeState:"clean", BuildDate:"2019-09-05T00:06:19Z", GoVersion:"go1.11.13", Compiler:"gc", Platform:"linux/amd64"}
[root@bastion bz]# 
[root@bastion bz]# oc get nodes
NAME                   STATUS   ROLES    AGE    VERSION
master0.coreos.local   Ready    master   101d   v1.13.4+205da2b4a
master1.coreos.local   Ready    master   101d   v1.13.4+205da2b4a
master2.coreos.local   Ready    master   101d   v1.13.4+205da2b4a
worker0.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba
worker1.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba
worker2.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba
worker3.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba
worker4.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba
[root@bastion bz]# 
[root@bastion bz]# oc get nodes -o wide
NAME                   STATUS   ROLES    AGE    VERSION             INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                   KERNEL-VERSION               CONTAINER-RUNTIME
master0.coreos.local   Ready    master   101d   v1.13.4+205da2b4a   192.168.5.20   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190718.1 (Ootpa)   4.18.0-80.4.2.el8_0.x86_64   cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8
master1.coreos.local   Ready    master   101d   v1.13.4+205da2b4a   192.168.5.21   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190718.1 (Ootpa)   4.18.0-80.4.2.el8_0.x86_64   cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8
master2.coreos.local   Ready    master   101d   v1.13.4+205da2b4a   192.168.5.22   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190718.1 (Ootpa)   4.18.0-80.4.2.el8_0.x86_64   cri-o://1.13.9-1.rhaos4.1.gitd70609a.el8
worker0.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba   192.168.5.30   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa)   4.18.0-80.7.2.el8_0.x86_64   cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker1.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba   192.168.5.31   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa)   4.18.0-80.7.2.el8_0.x86_64   cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker2.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba   192.168.5.32   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa)   4.18.0-80.7.2.el8_0.x86_64   cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker3.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba   192.168.5.33   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa)   4.18.0-80.7.2.el8_0.x86_64   cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
worker4.coreos.local   Ready    worker   101d   v1.13.4+d81afa6ba   192.168.5.34   <none>        Red Hat Enterprise Linux CoreOS 410.8.20190807.0 (Ootpa)   4.18.0-80.7.2.el8_0.x86_64   cri-o://1.13.10-0.1.dev.rhaos4.1.git9e2e1de.el8-dev
[root@bastion bz]# 
[root@bastion bz]# oc get machineconfig
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   CREATED
00-master                                                   916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
00-worker                                                   916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
01-master-container-runtime                                 916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
01-master-kubelet                                           916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
01-worker-container-runtime                                 916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
01-worker-kubelet                                           916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
99-master-23eb5e1e-8876-11e9-8939-5254002cde16-registries   916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
99-master-ssh                                                                                          2.2.0             101d
99-worker-249fdd64-8876-11e9-8939-5254002cde16-registries   916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             101d
99-worker-ssh                                                                                          2.2.0             101d
rendered-master-1a9f1d91cd3c23364d257287499eb8ee            e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             6d3h
rendered-master-3444cb84703a72295e73662a1cd4d91f            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             28d
rendered-master-4c11795011d74ae70ca85b4b2ae96310            916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             2d23h
rendered-master-92547a4268e468d5729ab88062d7c148            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             29d
rendered-master-cf14b9b48f8f4a54fddd62c16c591286            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             31d
rendered-worker-133dcde3bd8c57902329bf61ce15a1ac            e17ddba2f24258f7ab7bb0eb034208cd3a0d1bab   2.2.0             6d3h
rendered-worker-402cf68854ee4cd5fcab6cf92bdc0497            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             28d
rendered-worker-55910a8e1ba5fe41102e3ef54cd93f8d            916adbf5fdc1381714fadc327c17000a8b3707e1   2.2.0             2d23h
rendered-worker-7cfa497d4fe265dd6418422d6551212e            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             29d
rendered-worker-d2c85a97c31fb0bfb5847a7ab6ecab0b            83392b13a5c17e56656acf3f7b0031e3303fb5c0   2.2.0             31d
[root@bastion bz]# 
[root@bastion bz]# oc get machineconfigpools.machineconfiguration.openshift.io 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-8bc942b7163ff02019d5f1c7b76103f1   False     True       True
worker   rendered-worker-8cd661b4a862c17e332314621bce67f4   False     True       True
[root@bastion bz]# 
[root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io master
[...]
Status:
  Conditions:
    Last Transition Time:  2019-06-06T16:15:54Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-08-08T06:58:27Z
    Message:               Node master0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1", Node master2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-master-8bc942b7163ff02019d5f1c7b76103f1"
    Reason:                3 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-06T08:51:22Z
    Message:               
    Reason:                All nodes are updating to rendered-master-4c11795011d74ae70ca85b4b2ae96310
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-master-8bc942b7163ff02019d5f1c7b76103f1
[...]

[root@bastion bz]# oc describe machineconfigpools.machineconfiguration.openshift.io worker 
[...]
Status:
  Conditions:
    Last Transition Time:  2019-06-06T16:15:54Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-08-16T07:19:54Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-08-16T07:44:03Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-08-16T07:44:03Z
    Message:               Node worker0.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker1.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker2.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker3.coreos.local is reporting: "during bootstrap: unexpected on-disk state validating against rendered-worker-7cfa497d4fe265dd6418422d6551212e", Node worker4.coreos.local is reporting: "during bootstrap: machineconfig.machineconfiguration.openshift.io \"rendered-worker-8cd661b4a862c17e332314621bce67f4\" not found"
    Reason:                5 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2019-08-16T07:19:54Z
    Message:               
    Reason:                All nodes are updating to rendered-worker-55910a8e1ba5fe41102e3ef54cd93f8d
    Status:                True
    Type:                  Updating
[...]

Comment 3 Antonio Murdaca 2019-09-16 15:05:15 UTC
Can you confirm that manually rebooting the nodes doesn't solve the issue? (as a workaround, we're still trying to understand what's going on) - do you have a cluster kubeconfig to share also?

Comment 4 daniel 2019-09-16 15:26:27 UTC
(In reply to Antonio Murdaca from comment #3)
> Can you confirm that manually rebooting the nodes doesn't solve the issue?
I have rebooted all nodes of the cluster multiple times in the past as well as before opening this bz, I have rebooted all the nodes (masters and workers) but this changed nothing, i.e. the problem persists

> (as a workaround, we're still trying to understand what's going on) - do you
> have a cluster kubeconfig to share also?
Not sure if I am getting what you need exactly, really the kubeconfig, or do you want access to the cluster or can you kindly let me know what exactly you'd like to have ?

thx,
daniel

Comment 5 Antonio Murdaca 2019-09-16 15:46:23 UTC
(In reply to daniel from comment #4)
> (In reply to Antonio Murdaca from comment #3)
> > Can you confirm that manually rebooting the nodes doesn't solve the issue?
> I have rebooted all nodes of the cluster multiple times in the past as well
> as before opening this bz, I have rebooted all the nodes (masters and
> workers) but this changed nothing, i.e. the problem persists

uhm, ok might be something else, can you perhaps tar up the whole systemd journal for a node that's currently failing? (that's unfortunately not included in must-gather)

> 
> > (as a workaround, we're still trying to understand what's going on) - do you
> > have a cluster kubeconfig to share also?
> Not sure if I am getting what you need exactly, really the kubeconfig, or do
> you want access to the cluster or can you kindly let me know what exactly
> you'd like to have ?
> 
> thx,
> daniel

Comment 6 Antonio Murdaca 2019-09-16 16:19:15 UTC
lowering priority as this is libvirt tho. But I'm keeping an eye on this anyway.

Comment 9 Kirsten Garrison 2019-11-07 21:03:56 UTC
Are you still hitting this?  Could you provide a new must-gather. your link (http://inf3.coe.muc.redhat.com/pub/ocp/debug/must-gather-20190916.tar.gz) just 404s...

Comment 16 Kirsten Garrison 2019-12-13 23:44:27 UTC
@brad briefly looking this seems to be different than the original post.

i see in mcc:
```
2019-12-13T20:50:58.681266227Z E1213 20:50:58.680777       1 render_controller.go:216] error finding pools for machineconfig: could not find any MachineConfigPool set for MachineConfig managed-ssh-keys-infra with labels: map[cr.applier/hash:00338f448a68ac10398cab84da147ec6eb29fb7b cr.applier/unit:machineconfig machineconfiguration.openshift.io/role:infra]
```

so perhaps check that your pools are set up correctly.

Also seeing:

```
2019-12-13T20:56:22.353291917Z E1213 20:56:22.353269   21974 daemon.go:1136] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003
2019-12-13T20:56:22.353308527Z E1213 20:56:22.353294   21974 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-e4183503741870a5d8f36d4f4135b773
```


Digging a bit this looks the same as : https://bugzilla.redhat.com/show_bug.cgi?id=1763700#c12

which is being backported to 4.1 via https://bugzilla.redhat.com/show_bug.cgi?id=1764719

You might be able to unstick the upgrade with `touch /run/machine-config-daemon-force`

Comment 17 Kirsten Garrison 2019-12-14 00:01:50 UTC
@brad Also this should probably be moved to a separate BZ as it seems to have different symptoms than the original BZ.

Comment 18 Kirsten Garrison 2019-12-14 00:22:15 UTC
@brad when you open the new BZ can you also attach the `journalctl -u pivot.service`  logs from the failing node if there are any?

Comment 19 brad.williams 2019-12-14 05:13:47 UTC
I have opened a separate BZ for the starter issues:
https://bugzilla.redhat.com/show_bug.cgi?id=1783621

Comment 20 Wolfgang Kulhanek 2020-02-04 08:52:50 UTC
I can confirm that `touch /run/machine-config-daemon-force` fixed the stuck machineconfigpool on my cluster when I saw this error upgrading from 4.1.30 to 4.1.31.

The error I was was this one (just copied from above - but it was the same):
```
2019-12-13T20:56:22.353291917Z E1213 20:56:22.353269   21974 daemon.go:1136] expected target osImageURL quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8895eadc29a0aca62c7be9343dd547a60700be5185151d6a959728a28ac52003
2019-12-13T20:56:22.353308527Z E1213 20:56:22.353294   21974 writer.go:132] Marking Degraded due to: during bootstrap: unexpected on-disk state validating against rendered-worker-e4183503741870a5d8f36d4f4135b773
```

Comment 26 Steve Milner 2020-03-02 17:16:25 UTC
Closing do to missing requested information. If this issue persists please provide the requested information and reopen the bug.