Description of problem: Nodes are going into NotReady state intermittently. We could see PLEG errors from node.service logs. ~~~ Jun 15 10:30:15 abcd atomic-openshift-node[27628]: E0615 10:30:15.865002 27628 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608) ~~~ The nodes were not able to resolve themselves via dns & all readiness and liveness probes on those nodes were failing. ~~~ Jun 15 08:13:45 abcd atomic-openshift-node[3036]: I0615 08:13:45.367667 3036 prober.go:111] Readiness probe for "profile-service-workplace-services:profile-service" failed (failure): Get http://10.0.0.x:8081/health: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) ~~~ Looks like socket at /var/run/openshift-sdn/cni-server.sock couldn't connect to openvswitch during this time. This is shown in SDN pod logs as well: ~~~ I0612 11:47:44.154711 2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I0612 11:47:44.254705 2548 healthcheck.go:62] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory ~~~ However, Sdn & ovs pods were running fine on the affected nodes. After manually removing the dead containers, images, and orphan volumes & restarting docker & atomic services the nodes are stable & under observation now. However, the root cause is still not known. Supporting logs: SDN also reports roundrobin LoadBalancer messages frequently. ~~~ I0612 11:46:50.332004 2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8778-tcp I0612 11:46:50.332012 2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8081-tcp I0612 11:46:50.332020 2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8443-tcp I0612 11:46:50.332027 2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:8080-tcp I0612 11:46:50.332034 2548 roundrobin.go:338] LoadBalancerRR: Removing endpoints for xyz-public:9799-tcp I0612 11:46:50.332055 2548 roundrobin.go:310] LoadBalancerRR: Setting endpoints for xyz: to [10.0.0.y:8080] I0612 11:46:50.332063 2548 roundrobin.go:240] Delete endpoint 10.0.0.y:8080 for service "xyz:" ~~~ Bug Reference: 1468420
Setting the target release to the development branch so we can investigate and fix. Once we understand the issue we can consider a backport.
Network error messages aren't related; they're normal and uninteresting. The PLEG failing is what matters. Over to node.
What size disks are you running? I suspect you are hitting an IOPS throttle. Do you see high io/wait times? High memory utilization?
> GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8396465 vs. 8388608) CU hit the new grpc message size limit We increased this a while back https://github.com/kubernetes/kubernetes/pull/63977 https://access.redhat.com/solutions/3803411 Not sure how to avoid hitting it unless we figure out some way to do more aggressive GC I assume that cleaning up dead containers reduced the size of the message and restored connectivity between the kubelet and crio.
given the above root cause (un-gc'd pods causing grpc overflow), is there more action needed for this bug?