Bug 1895371

Summary: 4.5.17 to 4.6.3 upgrade stuck with MCO controller version mismatch error
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: Machine Config OperatorAssignee: MCO Team <team-mco>
Machine Config Operator sub component: Machine Config Operator QA Contact: Jian Zhang <jiazha>
Status: CLOSED NOTABUG Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, dahernan, jerzhang, mkrejci, wlewis, yanyang
Version: 4.6.z   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-10-11 16:17:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2020-11-06 14:11:13 UTC
Description of problem:

4.5.17 -> 4.6.3 upgrade for profile 22_Disconnected UPI on AWS & Private cluster & Etcd Encryption on failed with MCO stuck on:

[2020-11-06T05:41:02.370Z]     Last Sync Error:  pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3159f2bce37fc868a9b3f52213073ff5 expected 4f29aba0ccd1d7c04f839f5c9ceaa3dd63a04960 has 3b284da53fa0fffd88d2825457ccc346a08a01e8: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-900812ab7dccd9d8167433a3bf8b122a, retrying
[2020-11-06T05:41:02.370Z]     Master:           0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-900812ab7dccd9d8167433a3bf8b122a
[2020-11-06T05:41:02.370Z]     Worker:           all 3 nodes are at latest configuration rendered-worker-bf4dc6aae7780a05e9ba22b644e8f2a5



Version-Release number of selected component (if applicable): 4.6.3


Additional info:

will add must-gather while install is stalled

Comment 3 Yu Qi Zhang 2020-12-08 17:12:41 UTC
Hi Mike,

Just to make sure, the job you described 22_Disconnected UPI on AWS & Private cluster & Etcd Encryption, has that been failing consistently since you reported this bug? Or was this more of a one-time flake?

Comment 4 Mike Fiedler 2020-12-08 21:23:13 UTC
That specific upgrade failed - as represented in the must-gather

Comment 5 Yu Qi Zhang 2021-02-03 01:45:43 UTC
Sorry for the delay, I took a look at the cluster and it seems to be in a very odd state, let me try to break it down a bit:

1. The cluster has a whopping 7 rendered master configs, 6 from 4.5.17 and one from 4.6.3, indicating that the job has updated the cluster 6 times total (is this expected?)
2. The cluster currently is:
   on rendered-master-3159f2bce37fc868a9b3f52213073ff5 (2020-11-06T03:30:27Z)                            <- some 4.5 version
   in the middle of updating to rendered-master-e028910daefac169b0921d0e7cd02c92 (2020-11-06T04:21:48Z)  <- some 4.5 version
   desires to be on rendered-master-900812ab7dccd9d8167433a3bf8b122a (2020-11-06T04:37:07Z)              <- this is 4.6

It sounds to me that one of the previous configured changes never applied, and for some reason the cluster triggered an update to 4.6.3 anyways

Thus the master pool has:
1 node on rendered-master-3159f2bce37fc868a9b3f52213073ff5
1 node on rendered-master-3159f2bce37fc868a9b3f52213073ff5 but updating to rendered-master-e028910daefac169b0921d0e7cd02c92, failing
1 node on rendered-master-e028910daefac169b0921d0e7cd02c92 (completed this update)

The reason why the master pool was stuck was because of this:

2020-11-06T04:39:03.514189389Z I1106 04:39:03.514140  154705 daemon.go:344] evicting pod openshift-etcd/etcd-quorum-guard-58cf5c7864-vwnvh
2020-11-06T04:39:03.523997093Z E1106 04:39:03.523957  154705 daemon.go:344] error when evicting pod "etcd-quorum-guard-58cf5c7864-vwnvh" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

And that's really the main thing blocking this upgrade I believe.



As to your original error

[2020-11-06T05:41:02.370Z]     Last Sync Error:  pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3159f2bce37fc868a9b3f52213073ff5 expected 4f29aba0ccd1d7c04f839f5c9ceaa3dd63a04960 has 3b284da53fa0fffd88d2825457ccc346a08a01e8: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-900812ab7dccd9d8167433a3bf8b122a, retrying
[2020-11-06T05:41:02.370Z]     Master:           0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-900812ab7dccd9d8167433a3bf8b122a
[2020-11-06T05:41:02.370Z]     Worker:           all 3 nodes are at latest configuration rendered-worker-bf4dc6aae7780a05e9ba22b644e8f2a5

I actually don't see this directly reported by the operator pod. I will look into how this status was the final reported one.



So as I see it there are four issues here:
1. the MCO didn't properly block the update from the previous pool (we should now after https://github.com/openshift/machine-config-operator/pull/2231)
2. The MCO didn't properly report the correct error message (the provided one was technically not wrong although not indicative of what happened at all... might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1915499 so I will investigate as part of that BZ)
3. The job profile in question kicked off an upgrade without checking if the previous MC updates successfully completed (might not actually be an issue if the MCO was a bit more robust on this, but I think if the jobs check, that would be good too)
4. the etcd quorum guard cannot be evicted for some reason, which blocks the node from progressing its update (and the master pool as a result)

For now I'm going to pass this off to the etcd team to check why the pod cannot be drained (I am guessing it is because we don't currently have 3 pods up, thus this one cannot be drained as it would lose quorum, but I'm not sure why that's the case) If that progresses the operator level failure should also go away.

Comment 8 Sam Batschelet 2021-03-05 20:58:12 UTC
# etcd-peer-ip-10-0-55-47.us-east-2.compute.internal gets SIGTERM
> 2020-11-06T04:23:09.020464143Z 2020-11-06 04:23:09.020437 N | pkg/osutil: received terminated signal, shutting down...

When etcd comes back there is no further health checks from quorum-guard for the remainder of the test... So while etcd was running quorum-guard was not reporting.

Looks like we are trying to reboot after drain and kubelet is stopped?

>Nov 06 04:23:08.925917 ip-10-0-55-47 hyperkube[1435]: I1106 04:23:08.925850    1435 factory.go:177] Factory "crio" was unable to handle container "/system.slice/machine-config-daemon-reboot.service"
>Nov 06 04:23:08.937423 ip-10-0-55-47 systemd[1]: Stopping MCO environment configuration...
>Nov 06 04:23:08.937537 ip-10-0-55-47 hyperkube[1435]: I1106 04:23:08.937430    1435 dynamic_cafile_content.go:182] Shutting down client-ca-bundle::/etc/kubernetes/kubelet-ca.crt
>Nov 06 04:23:08.957489 ip-10-0-55-47 systemd[1]: Stopped MCO environment configuration.
>Nov 06 04:23:08.957587 ip-10-0-55-47 systemd[1]: kubelet.service: Consumed 5min 26.214s CPU time

> Nov 06 04:24:49.949676 ip-10-0-56-242 hyperkube[1435]: I1106 04:23:42.068389       1 node_controller.go:434] Pool master: node ip-10-0-55-47.us-east-2.compute.internal is now reporting unready: node ip-10-0-55-47.us-east-2.compute.internal is reporting OutOfDisk=Unknown 

So it looks like a drain took out quorum-guard and the node went unready. Then all we see is

>Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: I1106 04:23:09.112642   96127 main.go:356] received interrupt, shutting down
>Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: E1106 04:23:09.615695   96127 main.go:309] failed to gracefully close secure listener: close tcp [::]:9101: use of closed network connection

# interesting traceback
```
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: E1106 04:23:09.615695   96127 main.go:309] failed to gracefully close secure listener: close tcp [::]:9101: use of closed network connection
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: Traceback (most recent call last)::Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:   File "<string>", line 1, in <module>
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:   File "/usr/lib64/python3.6/json/__init__.py", line 299, in load
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:     parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:   File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:     return _default_decoder.decode(s)
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:   File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:   File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]:     raise JSONDecodeError("Expecting value", s, err.value) from None
Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```

Moving back to MCO team it appears that the server got wedged on reboot. quorum-guard was not able to report health because the node was drained and kubelet stopped.

> 2020-11-06T04:39:03.523997093Z E1106 04:39:03.523957  154705 daemon.go:344] error when evicting pod "etcd-quorum-guard-58cf5c7864-vwnvh" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

So when you tried to drain ip-10-0-56-242 quorum-guard did its job and blocked drain.