Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1848419

Summary:	Root cause desired: Nodes are going into NotReady state intermittently.
Product:	OpenShift Container Platform	Reporter:	manisha <mdhanve>
Component:	Node	Assignee:	Peter Hunt <pehunt>
Node sub component:	CRI-O	QA Contact:	Sunil Choudhary <schoudha>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, dornelas, jokerman, mpatel, pehunt
Version:	3.11.0
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-13 21:20:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description manisha 2020-06-18 10:56:39 UTC

Description of problem:

Nodes are going into NotReady state intermittently.

We could see PLEG errors from node.service logs.

  ~~~
Jun 15 10:30:15 abcd atomic-openshift-node[27628]: E0615 10:30:15.865002   27628 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608)
  ~~~
The nodes were not able to resolve themselves via dns & all readiness and liveness probes on those nodes were failing.

  ~~~
Jun 15 08:13:45 abcd atomic-openshift-node[3036]: I0615 08:13:45.367667    3036 prober.go:111] Readiness probe for "profile-service-workplace-services:profile-service" failed (failure): Get http://10.0.0.x:8081/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  ~~~

Looks like socket at /var/run/openshift-sdn/cni-server.sock couldn't connect to openvswitch during this time. This is shown in SDN pod logs as well:

  ~~~
I0612 11:47:44.154711    2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0612 11:47:44.254705    2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
  ~~~

However, Sdn & ovs pods were running fine on the affected nodes.

After manually removing the dead containers, images, and orphan volumes & restarting docker & atomic services the nodes are stable & under observation now. However, the root cause is still not known.

Supporting logs:

SDN also reports roundrobin LoadBalancer messages frequently.
  
  ~~~
I0612 11:46:50.332004    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8778-tcp
I0612 11:46:50.332012    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8081-tcp
I0612 11:46:50.332020    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8443-tcp
I0612 11:46:50.332027    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8080-tcp
I0612 11:46:50.332034    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:9799-tcp
I0612 11:46:50.332055    2548 roundrobin.go:310] LoadBalancerRR: Setting endpoints for xyz: to [10.0.0.y:8080]
I0612 11:46:50.332063    2548 roundrobin.go:240] Delete endpoint 10.0.0.y:8080 for service "xyz:"
  ~~~


Bug Reference: 1468420

Comment 1 Ben Bennett 2020-06-18 13:09:04 UTC

Setting the target release to the development branch so we can investigate and fix.  Once we understand the issue we can consider a backport.

Comment 2 Casey Callendrello 2020-06-22 14:14:14 UTC

Network error messages aren't related; they're normal and uninteresting.

The PLEG failing is what matters.

Over to node.

Comment 3 Ryan Phillips 2020-07-08 18:36:50 UTC

What size disks are you running? I suspect you are hitting an IOPS throttle. Do you see high io/wait times? High memory utilization?

Comment 5 Seth Jennings 2020-07-30 18:39:41 UTC

> GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608)

CU hit the new grpc message size limit

We increased this a while back

https://github.com/kubernetes/kubernetes/pull/63977
https://access.redhat.com/solutions/3803411

Not sure how to avoid hitting it unless we figure out some way to do more aggressive GC

I assume that cleaning up dead containers reduced the size of the message and restored connectivity between the kubelet and crio.

Comment 6 Peter Hunt 2020-08-03 19:46:04 UTC

given the above root cause (un-gc'd pods causing grpc overflow), is there more action needed for this bug?