Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1808143

Summary:	Nodes periodically going NotReady with multiple failed services
Product:	OpenShift Container Platform	Reporter:	brad.williams
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, jeder, jokerman, lmohanty, mwoodson, nmalik, rphillips, sdodson, vrutkovs, wking
Version:	4.3.z	Keywords:	Upgrades
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-02 17:12:24 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description brad.williams 2020-02-27 22:54:58 UTC

Description of problem:
In starter, after upgrading to 4.3.2, I have observed that a handful of nodes are periodically going NotReady.  Logging onto the node shows multiple failed services:

[systemd]
Failed Units: 8
  chronyd.service
  irqbalance.service
  polkit.service
  rhsmcertd.service
  rpc-statd.service
  rpcbind.service
  sssd.service
  systemd-hostnamed.service

The operators are sometimes able to work through the issues and clear up the failures, but occasionally, the system gets wedged with multiple operators going degraded (machine-config, network, and monitoring).

Version-Release number of selected component (if applicable):
4.3.2

How reproducible:
This issue impacted both starter clusters running 4.3.2

Steps to Reproduce:
1.  Long running Openshift cluster with 53 nodes
2.  upgrade from 4.3.1 to 4.3.2
3.  Observed cycling nodes and eventual wedged operators

Additional info:
I have been able to grab some events from the pods:

Events:
  Type     Reason                   Age                  From                                                 Message
  ----     ------                   ----                 ----                                                 -------
  Warning  SystemOOM                162m                 kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: polkitd, pid: 3950
  Warning  SystemOOM                162m                 kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: NetworkManager, pid: 2628
  Warning  ContainerGCFailed        73m (x2 over 114m)   kubelet, ip-10-0-159-213.us-west-1.compute.internal  rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Normal   NodeNotReady             54m (x13 over 8d)    kubelet, ip-10-0-159-213.us-west-1.compute.internal  Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeNotReady
  Normal   NodeHasSufficientPID     25m (x12 over 8d)    kubelet, ip-10-0-159-213.us-west-1.compute.internal  Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeReady                25m (x15 over 8d)    kubelet, ip-10-0-159-213.us-west-1.compute.internal  Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeReady
  Normal   NodeHasNoDiskPressure    25m (x12 over 8d)    kubelet, ip-10-0-159-213.us-west-1.compute.internal  Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
  Warning  SystemOOM                9m49s                kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: agetty, pid: 2739
  Warning  SystemOOM                9m49s                kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: NetworkManager, pid: 25120
  Warning  SystemOOM                9m49s                kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: systemd-logind, pid: 2656
  Warning  SystemOOM                9m49s                kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: sssd_be, pid: 7898
  Warning  SystemOOM                9m49s                kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: sssd, pid: 2629
  Warning  SystemOOM                9m48s                kubelet, ip-10-0-159-213.us-west-1.compute.internal  System OOM encountered, victim process: agetty, pid: 2740
  Normal   NodeHasSufficientMemory  7m17s (x14 over 8d)  kubelet, ip-10-0-159-213.us-west-1.compute.internal  Node ip-10-0-159-213.us-west-1.compute.internal status is now: NodeHasSufficientMemory

Events:
  Type     Reason                   Age                 From                                                 Message
  ----     ------                   ----                ----                                                 -------
  Warning  ContainerGCFailed        54m (x2 over 79m)   kubelet, ip-10-0-140-219.us-west-1.compute.internal  rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  SystemOOM                30m                 kubelet, ip-10-0-140-219.us-west-1.compute.internal  System OOM encountered, victim process: podman pause, pid: 21869
  Warning  SystemOOM                30m                 kubelet, ip-10-0-140-219.us-west-1.compute.internal  System OOM encountered, victim process: bash, pid: 21639
  Normal   NodeHasSufficientPID     13m (x9 over 121m)  kubelet, ip-10-0-140-219.us-west-1.compute.internal  Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeNotReady             13m (x6 over 108m)  kubelet, ip-10-0-140-219.us-west-1.compute.internal  Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeNotReady
  Normal   NodeHasSufficientMemory  13m (x9 over 121m)  kubelet, ip-10-0-140-219.us-west-1.compute.internal  Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    13m (x9 over 121m)  kubelet, ip-10-0-140-219.us-west-1.compute.internal  Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeReady                13m (x9 over 121m)  kubelet, ip-10-0-140-219.us-west-1.compute.internal  Node ip-10-0-140-219.us-west-1.compute.internal status is now: NodeReady


Other observations:
The problem seems to impact the nodes where openshift-monitoring payloads are running, more specifically the prometheus-k8s pods themselves.

Comment 3 Lalatendu Mohanty 2020-02-28 11:24:01 UTC

To help with impact analysis we need to find answers to the following questions. It is fine if we do not answer some of these questions at this point of time, but we should try to get answers. 

What symptoms (in Telemetry, Insights, etc.) does a cluster experiencing this bug exhibit?
What kind of clusters are impacted because of the bug? 
What cluster functionality is degraded while hitting the bug?
Does the upgrade complete?
What is the expected rate of the failure (%) for vulnerable clusters which attempt the update?
What is the observed rate of failure we see in CI?
Can this bug cause data loss? Data loss = API server data loss or CRD state information loss etc. 
Is it possible to recover the cluster from the bug?
Is recovery automatic without intervention?  I.e. is the condition transient?
Is recovery possible with the only intervention being 'oc adm upgrade …' to a new release image with a fix?
Is there a manual workaround that exists to recover from the bug? What are manual steps? 
How long before the bug is fixed?
Is this a regression? From which version does this regression exist?

Comment 4 Ryan Phillips 2020-03-02 16:32:21 UTC

This should be fixed with https://github.com/openshift/origin/pull/24611 and https://bugzilla.redhat.com/show_bug.cgi?id=1800319

Comment 5 Scott Dodson 2020-03-02 17:12:24 UTC


*** This bug has been marked as a duplicate of bug 1808429 ***

Comment 6 W. Trevor King 2021-04-05 17:36:45 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475