1697525 – [starter-us-east-1] Infra node stuck in NotReady (rpc error: code = ResourceExhausted)

Bug 1697525 - [starter-us-east-1] Infra node stuck in NotReady (rpc error: code = ResourceExhausted)

Summary: [starter-us-east-1] Infra node stuck in NotReady (rpc error: code = ResourceE...

Keywords:
Status:	CLOSED DUPLICATE of bug 1825989
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Urvashi Mohnani
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-08 15:09 UTC by brad.williams
Modified:	2020-05-13 22:21 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-13 22:21:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description brad.williams 2019-04-08 15:09:48 UTC

Description of problem:

We have received reports of users having issues on the starter-us-east-1 cluster.  Upon investigation, we have observed an Infra node that is stuck in a NotReady state.

Version-Release number of selected component (if applicable):

oc v3.11.82
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.starter-us-east-1.openshift.com:443
openshift v3.11.82
kubernetes v1.11.0+d4cacc0


How reproducible:
Currently, the infra node (ip-172-31-51-95.ec2.internal) is stuck in NotReady


Expected results:
The infra nodes should be able to be (manually)recoverable to a "Ready" state. 


Additional info:
The docker unit appears to be blocked by:
     https://bugzilla.redhat.com/show_bug.cgi?id=1692401

Comment 2 Justin Pierce 2019-04-08 15:26:42 UTC

Example log output from atomic-openshift-node: 

Apr 08 15:25:51 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:51.851436    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:51 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:51.851477    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711036    2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711094    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711111    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561315    2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561372    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561389    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:55.969629    2733 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ip-172-31-51-95.ec2.internal"
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:55.983783    2733 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-51-95.ec2.internal" collected
Apr 08 15:25:56 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:56.238751    2733 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]
Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417444    2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417601    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417618    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
....

Comment 3 Seth Jennings 2019-04-08 17:08:33 UTC

Sending to Containers.  We bumped this before but it seems we've hit the grpc msg size limit again :-/

Comment 4 Mrunal Patel 2019-04-10 21:34:17 UTC

What version of crio is installed on the node?

Comment 5 brad.williams 2019-04-11 16:36:10 UTC

# crio --version
crio version 1.11.11-1.rhaos3.11.git474f73d.el7

Comment 6 Ryan Phillips 2020-05-13 22:21:36 UTC

A patch was committed to the 3.11 tree to fix the kubepods.slice memory cgroup. It should help with this resource issue.

https://github.com/openshift/origin/pull/24895

*** This bug has been marked as a duplicate of bug 1825989 ***

Note You need to log in before you can comment on or make changes to this bug.