1570145 – Build pod stuck in Unknown state and node stuck in NotReady

Bug 1570145 - Build pod stuck in Unknown state and node stuck in NotReady

Summary: Build pod stuck in Unknown state and node stuck in NotReady

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.10.0
Assignee:	Antonio Murdaca
QA Contact:	Vikas Laad
Docs Contact:
URL:
Whiteboard:	aos-scalability-310
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-20 17:51 UTC by Vikas Laad
Modified:	2020-09-20 13:02 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: grpc buffer too low (4MB) Consequence: error out from the kubelet and not being able to handle requests from the remote runtime Fix: increase the client buffer Result: no more errors from kubelet to the runtime over grpc
Clone Of:
Environment:
Last Closed:	2018-09-11 18:34:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
node yaml (12.34 KB, text/plain) 2018-04-20 17:56 UTC, Vikas Laad	no flags	Details
node logs (3.51 MB, application/zip) 2018-04-20 17:57 UTC, Vikas Laad	no flags	Details
View All

Description Vikas Laad 2018-04-20 17:51:49 UTC

Description of problem:
I was running concurrent build tests, after running builds for some time I see some of the builds pods are stuck in Unknown State and that node is stuck in NotReady state.

  Normal   NodeNotReady       1h                kubelet, ip-172-31-6-181.us-west-2.compute.internal  Node ip-172-31-6-181.us-west-2.compute.internal status is now: NodeNotReady

Error log when node became NotReady

Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396014   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396089   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396110   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.480665   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m3.444073397s ago; threshold is 3m0s]
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512568   12747 kubelet_node_status.go:445] Recording NodeNotReady event message for node ip-172-31-6-181.us-west-2.compute.internal
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512651   12747 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 LastTransitionTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.475963593s ago; threshold is 3m0s}
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198031   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198131   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198162   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:54 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:54.680821   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m6.644219983s ago; threshold is 3m0s]

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.22.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
Running concurrent builds for some time.

Steps to Reproduce:
1. Start some concurrent builds on a cluster
2. keep running builds for some time.

Actual results:
Node NotReady

Expected results:
Builds should finish.

Additional info:

Comment 1 Vikas Laad 2018-04-20 17:56:20 UTC

Created attachment 1424631 [details]
node yaml

Comment 2 Vikas Laad 2018-04-20 17:57:14 UTC

Created attachment 1424632 [details]
node logs

Comment 4 Vikas Laad 2018-04-23 16:22:08 UTC

Nodes do not become even after restarting node, docker and rebooting the node itself.

Comment 5 Vikas Laad 2018-04-23 17:57:05 UTC

cleaning everything under /var/lib/docker fixed the problem.

Comment 7 Mike Fiedler 2018-05-02 17:35:36 UTC

"grpc message size" is the ResourceExhausted.  this happens when there are large numbers of built images in /var/lib/docker under ocp 3.10 and docker 1.13

Comment 8 Daniel Walsh 2018-05-03 19:30:55 UTC

This looks like the return code in grpc is being exceeded. I would guess.

Comment 9 Daniel Walsh 2018-05-03 19:31:27 UTC

Antonio and Mrunal WDYT

Comment 10 Antonio Murdaca 2018-05-17 15:10:44 UTC

This is probably because the amount of containers/images on the node are overflowing the max response size of the grpc client in the kubelet.

This PR https://github.com/kubernetes/kubernetes/pull/63977 increases the size and should fix this issue.

I would hear more from the pod team though (and we'll need a backport anyway)

Comment 11 Seth Jennings 2018-05-23 17:45:26 UTC

https://github.com/openshift/origin/pull/19774

Comment 13 Wei Sun 2018-06-12 06:01:26 UTC

Hi Vikas,please check if it has been fixed.

Comment 14 Vikas Laad 2018-06-12 13:24:52 UTC

Tried multiple runs in 3.10.0-0.63.0 version, did not happen again.

Note You need to log in before you can comment on or make changes to this bug.