Description of problem: We have received reports of users having issues on the starter-us-east-1 cluster. Upon investigation, we have observed an Infra node that is stuck in a NotReady state. Version-Release number of selected component (if applicable): oc v3.11.82 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://internal.api.starter-us-east-1.openshift.com:443 openshift v3.11.82 kubernetes v1.11.0+d4cacc0 How reproducible: Currently, the infra node (ip-172-31-51-95.ec2.internal) is stuck in NotReady Expected results: The infra nodes should be able to be (manually)recoverable to a "Ready" state. Additional info: The docker unit appears to be blocked by: https://bugzilla.redhat.com/show_bug.cgi?id=1692401
Example log output from atomic-openshift-node: Apr 08 15:25:51 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:51.851436 2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:51 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:51.851477 2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711036 2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711094 2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:53 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:53.711111 2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561315 2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561372 2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:55.561389 2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:55.969629 2733 cloud_request_manager.go:89] Requesting node addresses from cloud provider for node "ip-172-31-51-95.ec2.internal" Apr 08 15:25:55 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:55.983783 2733 cloud_request_manager.go:108] Node addresses from cloud provider for node "ip-172-31-51-95.ec2.internal" collected Apr 08 15:25:56 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: I0408 15:25:56.238751 2733 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 2562047h47m16.854775807s ago; threshold is 3m0s] Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417444 2733 remote_runtime.go:262] ListContainers with filter &ContainerFilter{Id:,State:nil,PodSandboxId:,LabelSelector:map[string]string{},} from runtime service failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417601 2733 kuberuntime_container.go:329] getKubeletContainers failed: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) Apr 08 15:25:57 ip-172-31-51-95.ec2.internal atomic-openshift-node[2733]: E0408 15:25:57.417618 2733 generic.go:197] GenericPLEG: Unable to retrieve pods: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (8392739 vs. 8388608) ....
Sending to Containers. We bumped this before but it seems we've hit the grpc msg size limit again :-/
What version of crio is installed on the node?
# crio --version crio version 1.11.11-1.rhaos3.11.git474f73d.el7
A patch was committed to the 3.11 tree to fix the kubepods.slice memory cgroup. It should help with this resource issue. https://github.com/openshift/origin/pull/24895 *** This bug has been marked as a duplicate of bug 1825989 ***