Bug 1570145

Summary: Build pod stuck in Unknown state and node stuck in NotReady
Product: OpenShift Container Platform Reporter: Vikas Laad <vlaad>
Component: ContainersAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED CURRENTRELEASE QA Contact: Vikas Laad <vlaad>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: amurdaca, aos-bugs, dwalsh, jhonce, jokerman, mifiedle, mmccomas, mpatel, vlaad, wsun
Target Milestone: ---   
Target Release: 3.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: aos-scalability-310
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: grpc buffer too low (4MB) Consequence: error out from the kubelet and not being able to handle requests from the remote runtime Fix: increase the client buffer Result: no more errors from kubelet to the runtime over grpc
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-11 18:34:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
node yaml
none
node logs none

Description Vikas Laad 2018-04-20 17:51:49 UTC
Description of problem:
I was running concurrent build tests, after running builds for some time I see some of the builds pods are stuck in Unknown State and that node is stuck in NotReady state.

  Normal   NodeNotReady       1h                kubelet, ip-172-31-6-181.us-west-2.compute.internal  Node ip-172-31-6-181.us-west-2.compute.internal status is now: NodeNotReady

Error log when node became NotReady

Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396014   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396089   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396110   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.480665   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m3.444073397s ago; threshold is 3m0s]
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512568   12747 kubelet_node_status.go:445] Recording NodeNotReady event message for node ip-172-31-6-181.us-west-2.compute.internal
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512651   12747 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 LastTransitionTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.475963593s ago; threshold is 3m0s}
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198031   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198131   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198162   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:54 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:54.680821   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m6.644219983s ago; threshold is 3m0s]

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.22.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
Running concurrent builds for some time.

Steps to Reproduce:
1. Start some concurrent builds on a cluster
2. keep running builds for some time.

Actual results:
Node NotReady

Expected results:
Builds should finish.

Additional info:

Comment 1 Vikas Laad 2018-04-20 17:56:20 UTC
Created attachment 1424631 [details]
node yaml

Comment 2 Vikas Laad 2018-04-20 17:57:14 UTC
Created attachment 1424632 [details]
node logs

Comment 4 Vikas Laad 2018-04-23 16:22:08 UTC
Nodes do not become even after restarting node, docker and rebooting the node itself.

Comment 5 Vikas Laad 2018-04-23 17:57:05 UTC
cleaning everything under /var/lib/docker fixed the problem.

Comment 7 Mike Fiedler 2018-05-02 17:35:36 UTC
"grpc message size" is the ResourceExhausted.  this happens when there are large numbers of built images in /var/lib/docker under ocp 3.10 and docker 1.13

Comment 8 Daniel Walsh 2018-05-03 19:30:55 UTC
This looks like the return code in grpc is being exceeded. I would guess.

Comment 9 Daniel Walsh 2018-05-03 19:31:27 UTC
Antonio and Mrunal WDYT

Comment 10 Antonio Murdaca 2018-05-17 15:10:44 UTC
This is probably because the amount of containers/images on the node are overflowing the max response size of the grpc client in the kubelet.

This PR https://github.com/kubernetes/kubernetes/pull/63977 increases the size and should fix this issue.

I would hear more from the pod team though (and we'll need a backport anyway)

Comment 11 Seth Jennings 2018-05-23 17:45:26 UTC
https://github.com/openshift/origin/pull/19774

Comment 13 Wei Sun 2018-06-12 06:01:26 UTC
Hi Vikas,please check if it has been fixed.

Comment 14 Vikas Laad 2018-06-12 13:24:52 UTC
Tried multiple runs in 3.10.0-0.63.0 version, did not happen again.