Description of problem: I was running concurrent build tests, after running builds for some time I see some of the builds pods are stuck in Unknown State and that node is stuck in NotReady state. Normal NodeNotReady 1h kubelet, ip-172-31-6-181.us-west-2.compute.internal Node ip-172-31-6-181.us-west-2.compute.internal status is now: NodeNotReady Error log when node became NotReady Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396014 12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304) Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396089 12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304) Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396110 12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304) Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.480665 12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m3.444073397s ago; threshold is 3m0s] Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512568 12747 kubelet_node_status.go:445] Recording NodeNotReady event message for node ip-172-31-6-181.us-west-2.compute.internal Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512651 12747 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 LastTransitionTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.475963593s ago; threshold is 3m0s} Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198031 12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304) Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198131 12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304) Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198162 12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304) Apr 20 16:14:54 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:54.680821 12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m6.644219983s ago; threshold is 3m0s] Version-Release number of selected component (if applicable): openshift v3.10.0-0.22.0 kubernetes v1.10.0+b81c8f8 etcd 3.2.16 How reproducible: Running concurrent builds for some time. Steps to Reproduce: 1. Start some concurrent builds on a cluster 2. keep running builds for some time. Actual results: Node NotReady Expected results: Builds should finish. Additional info:
Created attachment 1424631 [details] node yaml
Created attachment 1424632 [details] node logs
Nodes do not become even after restarting node, docker and rebooting the node itself.
cleaning everything under /var/lib/docker fixed the problem.
"grpc message size" is the ResourceExhausted. this happens when there are large numbers of built images in /var/lib/docker under ocp 3.10 and docker 1.13
This looks like the return code in grpc is being exceeded. I would guess.
Antonio and Mrunal WDYT
This is probably because the amount of containers/images on the node are overflowing the max response size of the grpc client in the kubelet. This PR https://github.com/kubernetes/kubernetes/pull/63977 increases the size and should fix this issue. I would hear more from the pod team though (and we'll need a backport anyway)
https://github.com/openshift/origin/pull/19774
Hi Vikas,please check if it has been fixed.
Tried multiple runs in 3.10.0-0.63.0 version, did not happen again.