Bug 1895371
| Summary: | 4.5.17 to 4.6.3 upgrade stuck with MCO controller version mismatch error | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> |
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Jian Zhang <jiazha> |
| Status: | CLOSED NOTABUG | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | aos-bugs, dahernan, jerzhang, mkrejci, wlewis, yanyang |
| Version: | 4.6.z | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-10-11 16:17:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Mike Fiedler
2020-11-06 14:11:13 UTC
Hi Mike, Just to make sure, the job you described 22_Disconnected UPI on AWS & Private cluster & Etcd Encryption, has that been failing consistently since you reported this bug? Or was this more of a one-time flake? That specific upgrade failed - as represented in the must-gather Sorry for the delay, I took a look at the cluster and it seems to be in a very odd state, let me try to break it down a bit: 1. The cluster has a whopping 7 rendered master configs, 6 from 4.5.17 and one from 4.6.3, indicating that the job has updated the cluster 6 times total (is this expected?) 2. The cluster currently is: on rendered-master-3159f2bce37fc868a9b3f52213073ff5 (2020-11-06T03:30:27Z) <- some 4.5 version in the middle of updating to rendered-master-e028910daefac169b0921d0e7cd02c92 (2020-11-06T04:21:48Z) <- some 4.5 version desires to be on rendered-master-900812ab7dccd9d8167433a3bf8b122a (2020-11-06T04:37:07Z) <- this is 4.6 It sounds to me that one of the previous configured changes never applied, and for some reason the cluster triggered an update to 4.6.3 anyways Thus the master pool has: 1 node on rendered-master-3159f2bce37fc868a9b3f52213073ff5 1 node on rendered-master-3159f2bce37fc868a9b3f52213073ff5 but updating to rendered-master-e028910daefac169b0921d0e7cd02c92, failing 1 node on rendered-master-e028910daefac169b0921d0e7cd02c92 (completed this update) The reason why the master pool was stuck was because of this: 2020-11-06T04:39:03.514189389Z I1106 04:39:03.514140 154705 daemon.go:344] evicting pod openshift-etcd/etcd-quorum-guard-58cf5c7864-vwnvh 2020-11-06T04:39:03.523997093Z E1106 04:39:03.523957 154705 daemon.go:344] error when evicting pod "etcd-quorum-guard-58cf5c7864-vwnvh" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. And that's really the main thing blocking this upgrade I believe. As to your original error [2020-11-06T05:41:02.370Z] Last Sync Error: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-3159f2bce37fc868a9b3f52213073ff5 expected 4f29aba0ccd1d7c04f839f5c9ceaa3dd63a04960 has 3b284da53fa0fffd88d2825457ccc346a08a01e8: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-900812ab7dccd9d8167433a3bf8b122a, retrying [2020-11-06T05:41:02.370Z] Master: 0 (ready 0) out of 3 nodes are updating to latest configuration rendered-master-900812ab7dccd9d8167433a3bf8b122a [2020-11-06T05:41:02.370Z] Worker: all 3 nodes are at latest configuration rendered-worker-bf4dc6aae7780a05e9ba22b644e8f2a5 I actually don't see this directly reported by the operator pod. I will look into how this status was the final reported one. So as I see it there are four issues here: 1. the MCO didn't properly block the update from the previous pool (we should now after https://github.com/openshift/machine-config-operator/pull/2231) 2. The MCO didn't properly report the correct error message (the provided one was technically not wrong although not indicative of what happened at all... might be related to https://bugzilla.redhat.com/show_bug.cgi?id=1915499 so I will investigate as part of that BZ) 3. The job profile in question kicked off an upgrade without checking if the previous MC updates successfully completed (might not actually be an issue if the MCO was a bit more robust on this, but I think if the jobs check, that would be good too) 4. the etcd quorum guard cannot be evicted for some reason, which blocks the node from progressing its update (and the master pool as a result) For now I'm going to pass this off to the etcd team to check why the pod cannot be drained (I am guessing it is because we don't currently have 3 pods up, thus this one cannot be drained as it would lose quorum, but I'm not sure why that's the case) If that progresses the operator level failure should also go away. # etcd-peer-ip-10-0-55-47.us-east-2.compute.internal gets SIGTERM > 2020-11-06T04:23:09.020464143Z 2020-11-06 04:23:09.020437 N | pkg/osutil: received terminated signal, shutting down... When etcd comes back there is no further health checks from quorum-guard for the remainder of the test... So while etcd was running quorum-guard was not reporting. Looks like we are trying to reboot after drain and kubelet is stopped? >Nov 06 04:23:08.925917 ip-10-0-55-47 hyperkube[1435]: I1106 04:23:08.925850 1435 factory.go:177] Factory "crio" was unable to handle container "/system.slice/machine-config-daemon-reboot.service" >Nov 06 04:23:08.937423 ip-10-0-55-47 systemd[1]: Stopping MCO environment configuration... >Nov 06 04:23:08.937537 ip-10-0-55-47 hyperkube[1435]: I1106 04:23:08.937430 1435 dynamic_cafile_content.go:182] Shutting down client-ca-bundle::/etc/kubernetes/kubelet-ca.crt >Nov 06 04:23:08.957489 ip-10-0-55-47 systemd[1]: Stopped MCO environment configuration. >Nov 06 04:23:08.957587 ip-10-0-55-47 systemd[1]: kubelet.service: Consumed 5min 26.214s CPU time > Nov 06 04:24:49.949676 ip-10-0-56-242 hyperkube[1435]: I1106 04:23:42.068389 1 node_controller.go:434] Pool master: node ip-10-0-55-47.us-east-2.compute.internal is now reporting unready: node ip-10-0-55-47.us-east-2.compute.internal is reporting OutOfDisk=Unknown So it looks like a drain took out quorum-guard and the node went unready. Then all we see is >Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: I1106 04:23:09.112642 96127 main.go:356] received interrupt, shutting down >Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: E1106 04:23:09.615695 96127 main.go:309] failed to gracefully close secure listener: close tcp [::]:9101: use of closed network connection # interesting traceback ``` Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: E1106 04:23:09.615695 96127 main.go:309] failed to gracefully close secure listener: close tcp [::]:9101: use of closed network connection Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: Traceback (most recent call last)::Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: File "<string>", line 1, in <module> Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: File "/usr/lib64/python3.6/json/__init__.py", line 299, in load Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw) Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: File "/usr/lib64/python3.6/json/__init__.py", line 354, in loads Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: return _default_decoder.decode(s) Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: File "/usr/lib64/python3.6/json/decoder.py", line 339, in decode Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: obj, end = self.raw_decode(s, idx=_w(s, 0).end()) Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: File "/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: raise JSONDecodeError("Expecting value", s, err.value) from None Nov 06 04:30:41.211942 ip-10-0-55-47 hyperkube[1430]: json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) ``` Moving back to MCO team it appears that the server got wedged on reboot. quorum-guard was not able to report health because the node was drained and kubelet stopped. > 2020-11-06T04:39:03.523997093Z E1106 04:39:03.523957 154705 daemon.go:344] error when evicting pod "etcd-quorum-guard-58cf5c7864-vwnvh" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. So when you tried to drain ip-10-0-56-242 quorum-guard did its job and blocked drain. |