Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1848419

Summary: Root cause desired: Nodes are going into NotReady state intermittently.
Product: OpenShift Container Platform Reporter: manisha <mdhanve>
Component: NodeAssignee: Peter Hunt <pehunt>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, dornelas, jokerman, mpatel, pehunt
Version: 3.11.0   
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-13 21:20:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description manisha 2020-06-18 10:56:39 UTC
Description of problem:

Nodes are going into NotReady state intermittently.

We could see PLEG errors from node.service logs.

  ~~~
Jun 15 10:30:15 abcd atomic-openshift-node[27628]: E0615 10:30:15.865002   27628 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608)
  ~~~
The nodes were not able to resolve themselves via dns & all readiness and liveness probes on those nodes were failing.

  ~~~
Jun 15 08:13:45 abcd atomic-openshift-node[3036]: I0615 08:13:45.367667    3036 prober.go:111] Readiness probe for "profile-service-workplace-services:profile-service" failed (failure): Get http://10.0.0.x:8081/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  ~~~

Looks like socket at /var/run/openshift-sdn/cni-server.sock couldn't connect to openvswitch during this time. This is shown in SDN pod logs as well:

  ~~~
I0612 11:47:44.154711    2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0612 11:47:44.254705    2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
  ~~~

However, Sdn & ovs pods were running fine on the affected nodes.

After manually removing the dead containers, images, and orphan volumes & restarting docker & atomic services the nodes are stable & under observation now. However, the root cause is still not known.

Supporting logs:

SDN also reports roundrobin LoadBalancer messages frequently.
  
  ~~~
I0612 11:46:50.332004    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8778-tcp
I0612 11:46:50.332012    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8081-tcp
I0612 11:46:50.332020    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8443-tcp
I0612 11:46:50.332027    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8080-tcp
I0612 11:46:50.332034    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:9799-tcp
I0612 11:46:50.332055    2548 roundrobin.go:310] LoadBalancerRR: Setting endpoints for xyz: to [10.0.0.y:8080]
I0612 11:46:50.332063    2548 roundrobin.go:240] Delete endpoint 10.0.0.y:8080 for service "xyz:"
  ~~~


Bug Reference: 1468420

Comment 1 Ben Bennett 2020-06-18 13:09:04 UTC
Setting the target release to the development branch so we can investigate and fix.  Once we understand the issue we can consider a backport.

Comment 2 Casey Callendrello 2020-06-22 14:14:14 UTC
Network error messages aren't related; they're normal and uninteresting.

The PLEG failing is what matters.

Over to node.

Comment 3 Ryan Phillips 2020-07-08 18:36:50 UTC
What size disks are you running? I suspect you are hitting an IOPS throttle. Do you see high io/wait times? High memory utilization?

Comment 5 Seth Jennings 2020-07-30 18:39:41 UTC
> GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608)

CU hit the new grpc message size limit

We increased this a while back

https://github.com/kubernetes/kubernetes/pull/63977
https://access.redhat.com/solutions/3803411

Not sure how to avoid hitting it unless we figure out some way to do more aggressive GC

I assume that cleaning up dead containers reduced the size of the message and restored connectivity between the kubelet and crio.

Comment 6 Peter Hunt 2020-08-03 19:46:04 UTC
given the above root cause (un-gc'd pods causing grpc overflow), is there more action needed for this bug?