Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1899316

Summary: 4.5.18 -> 4.6.4 stuck, etcd-quorum-guard pending ContainerCreating on updated control-plane node
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: RHCOSAssignee: Sam Batschelet <sbatsche>
Status: CLOSED INSUFFICIENT_DATA QA Contact: ge liu <geliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.5CC: aos-bugs, bbreard, bshirren, cpassare, cverna, dmace, dornelas, dustymabe, imcleod, jligon, kgarriso, lmohanty, miabbott, mrobson, nstielau, plarsen, sbatsche, sdodson, smilner, thaller, wking
Target Milestone: ---Keywords: UpcomingSprint, Upgrades
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-26 08:15:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-11-18 21:07:06 UTC
Seen in reported Insights on an external cluster (so we don't have access to a full must-gather at the moment).  The cluster is stuck updating from 4.5.18 to 4.6.4:

$ jq -r '.status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version + " " + (.verified | tostring)' config/version.json 
2020-11-17T08:39:00Z - Partial 4.6.4 false
2020-11-16T14:46:34Z 2020-11-16T22:00:00Z Completed 4.5.18 true
2020-10-01T14:05:50Z 2020-10-01T16:21:00Z Completed 4.5.13 true
2020-09-29T12:44:15Z 2020-09-29T16:05:53Z Completed 4.5.11 false

The cluster-version operator is flipping between seeming happy-but-stuck:

$ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' config/version.json | sort
2020-09-29T16:05:53Z Available=True -: Done applying 4.5.18
2020-11-17T08:39:00Z Progressing=True -: Working towards 4.6.4: 11% complete
2020-11-17T10:15:06Z RetrievedUpdates=True -: -
2020-11-18T16:03:23Z Failing=False -: -

and complaining about etcd:

2020-11-18T07:36:32Z Failing=True ClusterOperatorDegraded: Cluster operator etcd is reporting a failure: EtcdMembersDegraded: 2 of 3 members are available, ocp2-9nws9-master-0 is unhealthy

etcd operator is sad:

$ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' config/clusteroperator/etcd.json       
2020-11-17T12:39:50Z Degraded=True EtcdMembers_UnhealthyMembers: EtcdMembersDegraded: 2 of 3 members are available, ocp2-9nws9-master-0 is unhealthy
2020-11-17T13:27:48Z Progressing=False AsExpected: NodeInstallerProgressing: 3 nodes are at revision 8
EtcdMembersProgressing: No unstarted etcd members found
2020-09-29T14:03:04Z Available=True AsExpected: StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 8
EtcdMembersAvailable: 2 of 3 members are available, ocp2-9nws9-master-0 is unhealthy
2020-09-29T13:56:51Z Upgradeable=True AsExpected: -

So is the machine-config operator:

$ jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + (.reason // "-") + ": " + (.message // "-")' config/clusteroperator/machine-config.json
2020-11-17T11:33:35Z Progressing=True -: Working towards 4.6.4
2020-11-17T14:36:21Z Degraded=True RequiredPoolsFailed: Unable to apply 4.6.4: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: controller version mismatch for rendered-master-a8083f3c68269d9fa3239729f236f883 expected 054f6197a19ceffff44f361674bd24644d1a2bcb has 3b284da53fa0fffd88d2825457ccc346a08a01e8: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ocp2-9nws9-master-1 is reporting: \"failed to drain node (5 tries): timed out waiting for the condition: error when evicting pod \\\"etcd-quorum-guard-848f9f765-tgzj2\\\": global timeout reached: 1m30s\"", retrying
2020-11-17T11:30:16Z Available=False -: Cluster not available for 4.6.4
2020-09-29T14:03:37Z Upgradeable=True AsExpected: -

And a few other operators.  The reason for most sad operators is that that master-1 node is cordoned, mostly drained, and the operators are missing the workloads they would like to schedule on that node.

$ cat config/node/ocp2-9nws9-master-* | jq -r '[.status.conditions[] | select(.type == "Ready")][0] as $ready | .metadata.creationTimestamp + " " + $ready.lastTransitionTime + " " + $ready.status + " " + .metadata.name + " " + .status.nodeInfo.kubeletVersion + " " + .metadata.annotations["machineconfiguration.openshift.io/currentConfig"] + "->" + .metadata.annotations["machineconfiguration.openshift.io/desiredConfig"] + " " + ([(.spec.taints // [])[].key] | sort | tostring)' | sort
2020-09-29T13:41:38Z 2020-11-17T11:48:56Z True ocp2-9nws9-master-2 v1.18.3+ac53d20 rendered-master-a8083f3c68269d9fa3239729f236f883->rendered-master-a8083f3c68269d9fa3239729f236f883 ["node-role.kubernetes.io/master"]
2020-09-29T13:41:38Z 2020-11-17T12:51:07Z True ocp2-9nws9-master-0 v1.19.0+9f84db3 rendered-master-30a3a1dc24e52e13ee5b32e096d07c34->rendered-master-30a3a1dc24e52e13ee5b32e096d07c34 ["node-role.kubernetes.io/master"]
2020-09-29T13:41:47Z 2020-11-17T11:45:19Z True ocp2-9nws9-master-1 v1.18.3+ac53d20 rendered-master-a8083f3c68269d9fa3239729f236f883->rendered-master-30a3a1dc24e52e13ee5b32e096d07c34 ["node-role.kubernetes.io/master","node.kubernetes.io/unschedulable"]

master-0 is already updated.  master-1 is being drained by the machine-config daemon, but it's blocked by etcd-quorum-guard's PDB.  etcd-quorum-guard's PDB is blocked because the etcd-quorum-guard pod on the already-updated master-0 is pending:

$ jq '{metadata: (.metadata | {creationTimestamp}), status: (.status | {phase, conditions, containerStatuses})}' config/pod/openshift-etcd/etcd-quorum-guard-848f9f765-r6qn2.json
{
  "metadata": {
    "creationTimestamp": "2020-11-17T11:53:21Z"
  },
  "status": {
    "phase": "Pending",
    "conditions": [
      {
        "type": "Initialized",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-17T13:00:10Z"
      },
      {
        "type": "Ready",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-17T13:00:10Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [guard]"
      },
      {
        "type": "ContainersReady",
        "status": "False",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-17T13:00:10Z",
        "reason": "ContainersNotReady",
        "message": "containers with unready status: [guard]"
      },
      {
        "type": "PodScheduled",
        "status": "True",
        "lastProbeTime": null,
        "lastTransitionTime": "2020-11-17T13:00:10Z"
      }
    ],
    "containerStatuses": [
      {
        "name": "guard",
        "state": {
          "waiting": {
            "reason": "ContainerCreating"
          }
        },
        "lastState": {},
        "ready": false,
        "restartCount": 0,
        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a95399379b48916216c49b1e4d4786b19081378ccd5821b27d97b496edd56a86",
        "imageID": "",
        "started": false
      }
    ]
  }
}

Not clear to me why that pod is pending for so long (pod created 11:53:21Z, most conditions changed an hour later at 13:00:10Z, and now it is at least 2020-11-18T16:03:23Z, over 27 hours later, by the Failing=True condition, and we are still pending).  Assigning to the node folks in case they have insights into the long pending phase.

Searching for possible existing bugs turned up bug 1883521, bug 1890256, and bug 1897026.  I'm not clear enough on the specific mechanism in those bugs to know if this one is a dup or not.

Comment 2 Lalatendu Mohanty 2020-11-19 12:07:13 UTC
Bumping the severity to high as we seen this issue on customers cluster.After the root cause analysis please set the appropriate severity.

Comment 3 Dan Mace 2020-11-20 16:13:28 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1899600 is the canonical tracker for an IO performance regression which probably explains the problem reported in this bug, and we intend to mark examples (e.g. this bug) as duplicates of https://bugzilla.redhat.com/show_bug.cgi?id=1899600.

Comment 21 Dusty Mabe 2021-01-04 21:18:23 UTC
If I understand correctly we think there are issues that are cropping
up because some machines seem to be getting a different IP address
from DHCP after upgrade.

It is true in 4.6 we switched to use NetworkManager's internal DHCP
implementation, but I'm not sure that's the culprit here. At least not
in the default configuration. I just dug in to try to find out.
Specifically:

- "Does RHCOS in 4.5 use a different DHCP client ID scheme than 4.6?"

According to some upstream commits [1] [2] it seems as if dhclient 
is configured to use the MAC address as the client ID and we've also
configured NM to do the same.

[1] https://github.com/coreos/fedora-coreos-config/pull/82
[2] https://github.com/coreos/coreos-assembler/pull/367

I just did some local testing and can confirm that the client ID in
4.5 and 4.6 uses the MAC address. I did this by taking a peak at the
entries in `/var/lib/libvirt/dnsmasq/virbr0.status` for the VMs on my
local machine.

I did an upgrade from 4.5 to 4.6 and was able to get the same IP with
no problem. However, if I manually changed the client ID strategy with
something like:

echo -e "[connection]\nipv4.dhcp-client-id=duid" | sudo tee /etc/NetworkManager/conf.d/20-client-id.conf

then I watch the dnsmasq logs (sudo journalctl -f -t dnsmasq-dhcp) and
see a DHCPNAK and a new address get given out because the clientid has
changed.


I guess all of this is to say that with the default configuration I
would not have expected the IP address to change even with the switch
from dhclient to the internal NM dhcp client. Some questions:

- Was there any significant Networking configuration on these machines?
- Is there anything common across the various instances of this failure
  that could help us find the culprit? 
- If bonding is common across all of the instances, is it possible
  one NIC failed and the interface for the other NIC was used for the
  bond? This reminds me of https://bugzilla.redhat.com/show_bug.cgi?id=1823602#c39

Comment 24 Thomas Haller 2021-01-11 07:05:56 UTC
NetworkManager would (debug) log which DHCP client-id it chooses (and why). In general, if there is a networking issue with NetworkManager, collecting trace logs is a good first step.

Could somebody please provide a full log of NetworkManager that shows the issue?

Usually, you would configure level=TRACE in NetworkManager.conf (https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28).
But for NetworkManager-in-initrd, just set "rd.debug" on the kernel command line, boot and provide the logging output.

Comment 29 Micah Abbott 2021-01-15 20:55:33 UTC
Higher priority work has prevented this issue from being solved; adding UpcomingSprint keyword

Comment 37 Clement Verna 2021-01-26 08:15:27 UTC
Closing this as INSUFFCIENT_DATA since we can't reproduce it. Please feel free to reopen if this problem happens again.

Comment 39 Suresh Kolichala 2021-04-09 15:38:31 UTC
*** Bug 1915853 has been marked as a duplicate of this bug. ***