Bug 1848419 - Root cause desired: Nodes are going into NotReady state intermittently.
Summary: Root cause desired: Nodes are going into NotReady state intermittently.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.11.z
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-18 10:56 UTC by manisha
Modified: 2020-11-13 21:20 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-13 21:20:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description manisha 2020-06-18 10:56:39 UTC
Description of problem:

Nodes are going into NotReady state intermittently.

We could see PLEG errors from node.service logs.

  ~~~
Jun 15 10:30:15 abcd atomic-openshift-node[27628]: E0615 10:30:15.865002   27628 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608)
  ~~~
The nodes were not able to resolve themselves via dns & all readiness and liveness probes on those nodes were failing.

  ~~~
Jun 15 08:13:45 abcd atomic-openshift-node[3036]: I0615 08:13:45.367667    3036 prober.go:111] Readiness probe for "profile-service-workplace-services:profile-service" failed (failure): Get http://10.0.0.x:8081/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
  ~~~

Looks like socket at /var/run/openshift-sdn/cni-server.sock couldn't connect to openvswitch during this time. This is shown in SDN pod logs as well:

  ~~~
I0612 11:47:44.154711    2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I0612 11:47:44.254705    2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
  ~~~

However, Sdn & ovs pods were running fine on the affected nodes.

After manually removing the dead containers, images, and orphan volumes & restarting docker & atomic services the nodes are stable & under observation now. However, the root cause is still not known.

Supporting logs:

SDN also reports roundrobin LoadBalancer messages frequently.
  
  ~~~
I0612 11:46:50.332004    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8778-tcp
I0612 11:46:50.332012    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8081-tcp
I0612 11:46:50.332020    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8443-tcp
I0612 11:46:50.332027    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8080-tcp
I0612 11:46:50.332034    2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:9799-tcp
I0612 11:46:50.332055    2548 roundrobin.go:310] LoadBalancerRR: Setting endpoints for xyz: to [10.0.0.y:8080]
I0612 11:46:50.332063    2548 roundrobin.go:240] Delete endpoint 10.0.0.y:8080 for service "xyz:"
  ~~~


Bug Reference: 1468420

Comment 1 Ben Bennett 2020-06-18 13:09:04 UTC
Setting the target release to the development branch so we can investigate and fix.  Once we understand the issue we can consider a backport.

Comment 2 Casey Callendrello 2020-06-22 14:14:14 UTC
Network error messages aren't related; they're normal and uninteresting.

The PLEG failing is what matters.

Over to node.

Comment 3 Ryan Phillips 2020-07-08 18:36:50 UTC
What size disks are you running? I suspect you are hitting an IOPS throttle. Do you see high io/wait times? High memory utilization?

Comment 5 Seth Jennings 2020-07-30 18:39:41 UTC
> GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608)

CU hit the new grpc message size limit

We increased this a while back

https://github.com/kubernetes/kubernetes/pull/63977
https://access.redhat.com/solutions/3803411

Not sure how to avoid hitting it unless we figure out some way to do more aggressive GC

I assume that cleaning up dead containers reduced the size of the message and restored connectivity between the kubelet and crio.

Comment 6 Peter Hunt 2020-08-03 19:46:04 UTC
given the above root cause (un-gc'd pods causing grpc overflow), is there more action needed for this bug?


Note You need to log in before you can comment on or make changes to this bug.