Bug 1570145

Summary:

Build pod stuck in Unknown state and node stuck in NotReady

Product:

OpenShift Container Platform

Reporter:

Vikas Laad <vlaad>

Component:

Containers

Assignee:

Antonio Murdaca <amurdaca>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Vikas Laad <vlaad>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.10.0

CC:

amurdaca, aos-bugs, dwalsh, jhonce, jokerman, mifiedle, mmccomas, mpatel, vlaad, wsun

Target Milestone:

---

Target Release:

3.10.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

aos-scalability-310

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Cause: grpc buffer too low (4MB) Consequence: error out from the kubelet and not being able to handle requests from the remote runtime Fix: increase the client buffer Result: no more errors from kubelet to the runtime over grpc

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-09-11 18:34:13 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
node yaml	none
node logs	none

Description Vikas Laad 2018-04-20 17:51:49 UTC

Description of problem:
I was running concurrent build tests, after running builds for some time I see some of the builds pods are stuck in Unknown State and that node is stuck in NotReady state.

  Normal   NodeNotReady       1h                kubelet, ip-172-31-6-181.us-west-2.compute.internal  Node ip-172-31-6-181.us-west-2.compute.internal status is now: NodeNotReady

Error log when node became NotReady

Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396014   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396089   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:51.396110   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.480665   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m3.444073397s ago; threshold is 3m0s]
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512568   12747 kubelet_node_status.go:445] Recording NodeNotReady event message for node ip-172-31-6-181.us-west-2.compute.internal
Apr 20 16:14:51 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:51.512651   12747 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 LastTransitionTime:2018-04-20 16:14:51.512530609 +0000 UTC m=+87026.483617476 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m3.475963593s ago; threshold is 3m0s}
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198031   12747 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198131   12747 kuberuntime_container.go:323] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:53 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: E0420 16:14:53.198162   12747 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4229189 vs. 4194304)
Apr 20 16:14:54 ip-172-31-6-181.us-west-2.compute.internal atomic-openshift-node[12747]: I0420 16:14:54.680821   12747 kubelet.go:1794] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m6.644219983s ago; threshold is 3m0s]

Version-Release number of selected component (if applicable):
openshift v3.10.0-0.22.0
kubernetes v1.10.0+b81c8f8
etcd 3.2.16

How reproducible:
Running concurrent builds for some time.

Steps to Reproduce:
1. Start some concurrent builds on a cluster
2. keep running builds for some time.

Actual results:
Node NotReady

Expected results:
Builds should finish.

Additional info:

Comment 1 Vikas Laad 2018-04-20 17:56:20 UTC

Created attachment 1424631 [details]
node yaml

Comment 2 Vikas Laad 2018-04-20 17:57:14 UTC

Created attachment 1424632 [details]
node logs

Comment 4 Vikas Laad 2018-04-23 16:22:08 UTC

Nodes do not become even after restarting node, docker and rebooting the node itself.

Comment 5 Vikas Laad 2018-04-23 17:57:05 UTC

cleaning everything under /var/lib/docker fixed the problem.

Comment 7 Mike Fiedler 2018-05-02 17:35:36 UTC

"grpc message size" is the ResourceExhausted.  this happens when there are large numbers of built images in /var/lib/docker under ocp 3.10 and docker 1.13

Comment 8 Daniel Walsh 2018-05-03 19:30:55 UTC

This looks like the return code in grpc is being exceeded. I would guess.

Comment 9 Daniel Walsh 2018-05-03 19:31:27 UTC

Antonio and Mrunal WDYT

Comment 10 Antonio Murdaca 2018-05-17 15:10:44 UTC

This is probably because the amount of containers/images on the node are overflowing the max response size of the grpc client in the kubelet.

This PR https://github.com/kubernetes/kubernetes/pull/63977 increases the size and should fix this issue.

I would hear more from the pod team though (and we'll need a backport anyway)

Comment 11 Seth Jennings 2018-05-23 17:45:26 UTC

https://github.com/openshift/origin/pull/19774

Comment 13 Wei Sun 2018-06-12 06:01:26 UTC

Hi Vikas,please check if it has been fixed.

Comment 14 Vikas Laad 2018-06-12 13:24:52 UTC

Tried multiple runs in 3.10.0-0.63.0 version, did not happen again.