Bug 1899316
| Summary: | 4.5.18 -> 4.6.4 stuck, etcd-quorum-guard pending ContainerCreating on updated control-plane node | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | RHCOS | Assignee: | Sam Batschelet <sbatsche> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | ge liu <geliu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.5 | CC: | aos-bugs, bbreard, bshirren, cpassare, cverna, dmace, dornelas, dustymabe, imcleod, jligon, kgarriso, lmohanty, miabbott, mrobson, nstielau, plarsen, sbatsche, sdodson, smilner, thaller, wking |
| Target Milestone: | --- | Keywords: | UpcomingSprint, Upgrades |
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-01-26 08:15:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Bumping the severity to high as we seen this issue on customers cluster.After the root cause analysis please set the appropriate severity. https://bugzilla.redhat.com/show_bug.cgi?id=1899600 is the canonical tracker for an IO performance regression which probably explains the problem reported in this bug, and we intend to mark examples (e.g. this bug) as duplicates of https://bugzilla.redhat.com/show_bug.cgi?id=1899600. If I understand correctly we think there are issues that are cropping up because some machines seem to be getting a different IP address from DHCP after upgrade. It is true in 4.6 we switched to use NetworkManager's internal DHCP implementation, but I'm not sure that's the culprit here. At least not in the default configuration. I just dug in to try to find out. Specifically: - "Does RHCOS in 4.5 use a different DHCP client ID scheme than 4.6?" According to some upstream commits [1] [2] it seems as if dhclient is configured to use the MAC address as the client ID and we've also configured NM to do the same. [1] https://github.com/coreos/fedora-coreos-config/pull/82 [2] https://github.com/coreos/coreos-assembler/pull/367 I just did some local testing and can confirm that the client ID in 4.5 and 4.6 uses the MAC address. I did this by taking a peak at the entries in `/var/lib/libvirt/dnsmasq/virbr0.status` for the VMs on my local machine. I did an upgrade from 4.5 to 4.6 and was able to get the same IP with no problem. However, if I manually changed the client ID strategy with something like: echo -e "[connection]\nipv4.dhcp-client-id=duid" | sudo tee /etc/NetworkManager/conf.d/20-client-id.conf then I watch the dnsmasq logs (sudo journalctl -f -t dnsmasq-dhcp) and see a DHCPNAK and a new address get given out because the clientid has changed. I guess all of this is to say that with the default configuration I would not have expected the IP address to change even with the switch from dhclient to the internal NM dhcp client. Some questions: - Was there any significant Networking configuration on these machines? - Is there anything common across the various instances of this failure that could help us find the culprit? - If bonding is common across all of the instances, is it possible one NIC failed and the interface for the other NIC was used for the bond? This reminds me of https://bugzilla.redhat.com/show_bug.cgi?id=1823602#c39 NetworkManager would (debug) log which DHCP client-id it chooses (and why). In general, if there is a networking issue with NetworkManager, collecting trace logs is a good first step. Could somebody please provide a full log of NetworkManager that shows the issue? Usually, you would configure level=TRACE in NetworkManager.conf (https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28). But for NetworkManager-in-initrd, just set "rd.debug" on the kernel command line, boot and provide the logging output. Higher priority work has prevented this issue from being solved; adding UpcomingSprint keyword Closing this as INSUFFCIENT_DATA since we can't reproduce it. Please feel free to reopen if this problem happens again. *** Bug 1915853 has been marked as a duplicate of this bug. *** |
Seen in reported Insights on an external cluster (so we don't have access to a full must-gather at the moment). The cluster is stuck updating from 4.5.18 to 4.6.4: $ jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring)' config/version.json 2020-11-17T08:39:00Z - Partial 4.6.4 false 2020-11-16T14:46:34Z 2020-11-16T22:00:00Z Completed 4.5.18 true 2020-10-01T14:05:50Z 2020-10-01T16:21:00Z Completed 4.5.13 true 2020-09-29T12:44:15Z 2020-09-29T16:05:53Z Completed 4.5.11 false The cluster-version operator is flipping between seeming happy-but-stuck: $ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' config/version.json | sort 2020-09-29T16:05:53Z Available=True -: Done applying 4.5.18 2020-11-17T08:39:00Z Progressing=True -: Working towards 4.6.4: 11% complete 2020-11-17T10:15:06Z RetrievedUpdates=True -: - 2020-11-18T16:03:23Z Failing=False -: - and complaining about etcd: 2020-11-18T07:36:32Z Failing=True ClusterOperatorDegraded: Cluster operator etcd is reporting a failure: EtcdMembersDegraded: 2 of 3 members are available, ocp2-9nws9-master-0 is unhealthy etcd operator is sad: $ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' config/clusteroperator/etcd.json 2020-11-17T12:39:50Z Degraded=True EtcdMembers_UnhealthyMembers: EtcdMembersDegraded: 2 of 3 members are available, ocp2-9nws9-master-0 is unhealthy 2020-11-17T13:27:48Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 8 EtcdMembersProgressing: No unstarted etcd members found 2020-09-29T14:03:04Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8 EtcdMembersAvailable: 2 of 3 members are available, ocp2-9nws9-master-0 is unhealthy 2020-09-29T13:56:51Z Upgradeable=True AsExpected: - So is the machine-config operator: $ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' config/clusteroperator/machine-config.json 2020-11-17T11:33:35Z Progressing=True -: Working towards 4.6.4 2020-11-17T14:36:21Z Degraded=True RequiredPoolsFailed: Unable to apply 4.6.4: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-a8083f3c68269d9fa3239729f236f883 expected 054f6197a19ceffff44f361674bd24644d1a2bcb has 3b284da53fa0fffd88d2825457ccc346a08a01e8: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ocp2-9nws9-master-1 is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \\\"etcd-quorum-guard-848f9f765-tgzj2\\\": global timeout reached: 1m30s\"", retrying 2020-11-17T11:30:16Z Available=False -: Cluster not available for 4.6.4 2020-09-29T14:03:37Z Upgradeable=True AsExpected: - And a few other operators. The reason for most sad operators is that that master-1 node is cordoned, mostly drained, and the operators are missing the workloads they would like to schedule on that node. $ cat config/node/ocp2-9nws9-master-* | jq -r '[.status.conditions[] | select(.type == "Ready")][0] as $ready | .metadata.creationTimestamp + " " + $ready.lastTransitionTime + " " + $ready.status + " " + .metadata.name + " " + .status.nodeInfo.kubeletVersion + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + "->" + .metadata.annotations["machineconfiguration.openshift.io/desiredConfig"] + " " + ([(.spec.taints // [])[].key] | sort | tostring)' | sort 2020-09-29T13:41:38Z 2020-11-17T11:48:56Z True ocp2-9nws9-master-2 v1.18.3+ac53d20 rendered-master-a8083f3c68269d9fa3239729f236f883->rendered-master-a8083f3c68269d9fa3239729f236f883 ["node-role.kubernetes.io/master"] 2020-09-29T13:41:38Z 2020-11-17T12:51:07Z True ocp2-9nws9-master-0 v1.19.0+9f84db3 rendered-master-30a3a1dc24e52e13ee5b32e096d07c34->rendered-master-30a3a1dc24e52e13ee5b32e096d07c34 ["node-role.kubernetes.io/master"] 2020-09-29T13:41:47Z 2020-11-17T11:45:19Z True ocp2-9nws9-master-1 v1.18.3+ac53d20 rendered-master-a8083f3c68269d9fa3239729f236f883->rendered-master-30a3a1dc24e52e13ee5b32e096d07c34 ["node-role.kubernetes.io/master","node.kubernetes.io/unschedulable"] master-0 is already updated. master-1 is being drained by the machine-config daemon, but it's blocked by etcd-quorum-guard's PDB. etcd-quorum-guard's PDB is blocked because the etcd-quorum-guard pod on the already-updated master-0 is pending: $ jq '{metadata: (.metadata | {creationTimestamp}), status: (.status | {phase, conditions, containerStatuses})}' config/pod/openshift-etcd/etcd-quorum-guard-848f9f765-r6qn2.json { "metadata": { "creationTimestamp": "2020-11-17T11:53:21Z" }, "status": { "phase": "Pending", "conditions": [ { "type": "Initialized", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2020-11-17T13:00:10Z" }, { "type": "Ready", "status": "False", "lastProbeTime": null, "lastTransitionTime": "2020-11-17T13:00:10Z", "reason": "ContainersNotReady", "message": "containers with unready status: [guard]" }, { "type": "ContainersReady", "status": "False", "lastProbeTime": null, "lastTransitionTime": "2020-11-17T13:00:10Z", "reason": "ContainersNotReady", "message": "containers with unready status: [guard]" }, { "type": "PodScheduled", "status": "True", "lastProbeTime": null, "lastTransitionTime": "2020-11-17T13:00:10Z" } ], "containerStatuses": [ { "name": "guard", "state": { "waiting": { "reason": "ContainerCreating" } }, "lastState": {}, "ready": false, "restartCount": 0, "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a95399379b48916216c49b1e4d4786b19081378ccd5821b27d97b496edd56a86", "imageID": "", "started": false } ] } } Not clear to me why that pod is pending for so long (pod created 11:53:21Z, most conditions changed an hour later at 13:00:10Z, and now it is at least 2020-11-18T16:03:23Z, over 27 hours later, by the Failing=True condition, and we are still pending). Assigning to the node folks in case they have insights into the long pending phase. Searching for possible existing bugs turned up bug 1883521, bug 1890256, and bug 1897026. I'm not clear enough on the specific mechanism in those bugs to know if this one is a dup or not.