Bug 1697525

Summary:	[starter-us-east-1] Infra node stuck in NotReady (rpc error: code = ResourceExhausted)
Product:	OpenShift Container Platform	Reporter:	brad.williams
Component:	Node	Assignee:	Urvashi Mohnani <umohnani>
Status:	CLOSED DUPLICATE	QA Contact:	Sunil Choudhary <schoudha>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	3.11.0	CC:	aos-bugs, dwalsh, jokerman, jupierce, mmccomas, mpatel, nagrawal, rphillips
Target Milestone:	---
Target Release:	3.11.z
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-13 22:21:36 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description brad.williams 2019-04-08 15:09:48 UTC

Description of problem:

We have received reports of users having issues on the starter-us-east-1 cluster.  Upon investigation, we have observed an Infra node that is stuck in a NotReady state.

Version-Release number of selected component (if applicable):

oc v3.11.82
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://internal.api.starter-us-east-1.openshift.com:443
openshift v3.11.82
kubernetes v1.11.0+d4cacc0


How reproducible:
Currently, the infra node (ip-172-31-51-95.ec2.internal) is stuck in NotReady


Expected results:
The infra nodes should be able to be (manually)recoverable to a "Ready" state. 


Additional info:
The docker unit appears to be blocked by:
     https://bugzilla.redhat.com/show_bug.cgi?id=1692401

Comment 2 Justin Pierce 2019-04-08 15:26:42 UTC

Example log output from atomic-openshift-node: 

Apr 08 15:25:51 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:51.851436    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:51 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:51.851477    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711036    2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711094    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711111    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561315    2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561372    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561389    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:55.969629    2733 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ip-172-31-51-95.ec2.internal"
Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:55.983783    2733 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-51-95.ec2.internal" collected
Apr 08 15:25:56 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:56.238751    2733 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s]
Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417444    2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417601    2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417618    2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608)
....

Comment 3 Seth Jennings 2019-04-08 17:08:33 UTC

Sending to Containers.  We bumped this before but it seems we've hit the grpc msg size limit again :-/

Comment 4 Mrunal Patel 2019-04-10 21:34:17 UTC

What version of crio is installed on the node?

Comment 5 brad.williams 2019-04-11 16:36:10 UTC

# crio --version
crio version 1.11.11-1.rhaos3.11.git474f73d.el7

Comment 6 Ryan Phillips 2020-05-13 22:21:36 UTC

A patch was committed to the 3.11 tree to fix the kubepods.slice memory cgroup. It should help with this resource issue.

https://github.com/openshift/origin/pull/24895

*** This bug has been marked as a duplicate of bug 1825989 ***