Bug 1570145 - Build pod stuck in Unknown state and node stuck in NotReady
Summary: Build pod stuck in Unknown state and node stuck in NotReady
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.10.0
Assignee: Antonio Murdaca
QA Contact: Vikas Laad
URL:
Whiteboard: aos-scalability-310
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-20 17:51 UTC by Vikas Laad
Modified: 2020-09-20 13:02 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: grpc buffer too low (4MB) Consequence: error out from the kubelet and not being able to handle requests from the remote runtime Fix: increase the client buffer Result: no more errors from kubelet to the runtime over grpc
Clone Of:
Environment:
Last Closed: 2018-09-11 18:34:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
node yaml (12.34 KB, text/plain)
2018-04-20 17:56 UTC, Vikas Laad
no flags Details
node logs (3.51 MB, application/zip)
2018-04-20 17:57 UTC, Vikas Laad
no flags Details

Description Vikas Laad 2018-04-20 17:51:49 UTC
Description of problem:
I was running concurrent build tests, after running builds for some time I see some of the builds pods are stuck in Unknown State and that node is stuck in NotReady state.

  Normal   NodeNotReady       1h                kubelet, ip-172-31-6-181.us-west-2.compute.internal  Node ip-172-31-6-181.us-west-2.compute.internal status is now: NodeNotReady

Error log when node became NotReady

Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396014   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396089   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396110   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.480665   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m3.444073397s ago; threshold is 3m0s]
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512568   12747 kubelet_node_status.go:445] Recording NodeNotReady event message for node ip-172-31-6-181.us-west-2.compute.internal
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512651   12747 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 LastTransitionTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.475963593s ago; threshold is 3m0s}
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198031   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198131   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198162   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:54 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:54.680821   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m6.644219983s ago; threshold is 3m0s]

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.22.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
Running concurrent builds for some time.

Steps to Reproduce:
1. Start some concurrent builds on a cluster
2. keep running builds for some time.

Actual results:
Node NotReady

Expected results:
Builds should finish.

Additional info:

Comment 1 Vikas Laad 2018-04-20 17:56:20 UTC
Created attachment 1424631 [details]
node yaml

Comment 2 Vikas Laad 2018-04-20 17:57:14 UTC
Created attachment 1424632 [details]
node logs

Comment 4 Vikas Laad 2018-04-23 16:22:08 UTC
Nodes do not become even after restarting node, docker and rebooting the node itself.

Comment 5 Vikas Laad 2018-04-23 17:57:05 UTC
cleaning everything under /var/lib/docker fixed the problem.

Comment 7 Mike Fiedler 2018-05-02 17:35:36 UTC
"grpc message size" is the ResourceExhausted.  this happens when there are large numbers of built images in /var/lib/docker under ocp 3.10 and docker 1.13

Comment 8 Daniel Walsh 2018-05-03 19:30:55 UTC
This looks like the return code in grpc is being exceeded. I would guess.

Comment 9 Daniel Walsh 2018-05-03 19:31:27 UTC
Antonio and Mrunal WDYT

Comment 10 Antonio Murdaca 2018-05-17 15:10:44 UTC
This is probably because the amount of containers/images on the node are overflowing the max response size of the grpc client in the kubelet.

This PR https://github.com/kubernetes/kubernetes/pull/63977 increases the size and should fix this issue.

I would hear more from the pod team though (and we'll need a backport anyway)

Comment 11 Seth Jennings 2018-05-23 17:45:26 UTC
https://github.com/openshift/origin/pull/19774

Comment 13 Wei Sun 2018-06-12 06:01:26 UTC
Hi Vikas,please check if it has been fixed.

Comment 14 Vikas Laad 2018-06-12 13:24:52 UTC
Tried multiple runs in 3.10.0-0.63.0 version, did not happen again.


Note You need to log in before you can comment on or make changes to this bug.