Description of problem: In starter, after upgrading to 4.3.2, I have observed that a handful of nodes are periodically going NotReady. Logging onto the node shows multiple failed services: [systemd] Failed Units: 8 chronyd.service irqbalance.service polkit.service rhsmcertd.service rpc-statd.service rpcbind.service sssd.service systemd-hostnamed.service The operators are sometimes able to work through the issues and clear up the failures, but occasionally, the system gets wedged with multiple operators going degraded (machine-config, network, and monitoring). Version-Release number of selected component (if applicable): 4.3.2 How reproducible: This issue impacted both starter clusters running 4.3.2 Steps to Reproduce: 1. Long running Openshift cluster with 53 nodes 2. upgrade from 4.3.1 to 4.3.2 3. Observed cycling nodes and eventual wedged operators Additional info: I have been able to grab some events from the pods: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SystemOOM 162m kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: polkitd, pid: 3950 Warning SystemOOM 162m kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: NetworkManager, pid: 2628 Warning ContainerGCFailed 73m (x2 over 114m) kubelet, ip-10-0-159-213.us-west-1.compute.internal rpc error: code = DeadlineExceeded desc = context deadline exceeded Normal NodeNotReady 54m (x13 over 8d) kubelet, ip-10-0-159-213.us-west-1.compute.internal Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeNotReady Normal NodeHasSufficientPID 25m (x12 over 8d) kubelet, ip-10-0-159-213.us-west-1.compute.internal Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeHasSufficientPID Normal NodeReady 25m (x15 over 8d) kubelet, ip-10-0-159-213.us-west-1.compute.internal Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeReady Normal NodeHasNoDiskPressure 25m (x12 over 8d) kubelet, ip-10-0-159-213.us-west-1.compute.internal Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeHasNoDiskPressure Warning SystemOOM 9m49s kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: agetty, pid: 2739 Warning SystemOOM 9m49s kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: NetworkManager, pid: 25120 Warning SystemOOM 9m49s kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: systemd-logind, pid: 2656 Warning SystemOOM 9m49s kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: sssd_be, pid: 7898 Warning SystemOOM 9m49s kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: sssd, pid: 2629 Warning SystemOOM 9m48s kubelet, ip-10-0-159-213.us-west-1.compute.internal System OOM encountered, victim process: agetty, pid: 2740 Normal NodeHasSufficientMemory 7m17s (x14 over 8d) kubelet, ip-10-0-159-213.us-west-1.compute.internal Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeHasSufficientMemory Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ContainerGCFailed 54m (x2 over 79m) kubelet, ip-10-0-140-219.us-west-1.compute.internal rpc error: code = DeadlineExceeded desc = context deadline exceeded Warning SystemOOM 30m kubelet, ip-10-0-140-219.us-west-1.compute.internal System OOM encountered, victim process: podman pause, pid: 21869 Warning SystemOOM 30m kubelet, ip-10-0-140-219.us-west-1.compute.internal System OOM encountered, victim process: bash, pid: 21639 Normal NodeHasSufficientPID 13m (x9 over 121m) kubelet, ip-10-0-140-219.us-west-1.compute.internal Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeHasSufficientPID Normal NodeNotReady 13m (x6 over 108m) kubelet, ip-10-0-140-219.us-west-1.compute.internal Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeNotReady Normal NodeHasSufficientMemory 13m (x9 over 121m) kubelet, ip-10-0-140-219.us-west-1.compute.internal Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 13m (x9 over 121m) kubelet, ip-10-0-140-219.us-west-1.compute.internal Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeHasNoDiskPressure Normal NodeReady 13m (x9 over 121m) kubelet, ip-10-0-140-219.us-west-1.compute.internal Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeReady Other observations: The problem seems to impact the nodes where openshift-monitoring payloads are running, more specifically the prometheus-k8s pods themselves.
To help with impact analysis we need to find answers to the following questions. It is fine if we do not answer some of these questions at this point of time, but we should try to get answers. What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit? What kind of clusters are impacted because of the bug? What cluster functionality is degraded while hitting the bug? Does the upgrade complete? What is the expected rate of the failure (%) for vulnerable clusters which attempt the update? What is the observed rate of failure we see in CI? Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. Is it possible to recover the cluster from the bug? Is recovery automatic without intervention? I.e. is the condition transient? Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix? Is there a manual workaround that exists to recover from the bug? What are manual steps? How long before the bug is fixed? Is this a regression? From which version does this regression exist?
This should be fixed with https://github.com/openshift/origin/pull/24611 and https://bugzilla.redhat.com/show_bug.cgi?id=1800319
*** This bug has been marked as a duplicate of bug 1808429 ***
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475