Description of problem: I see that one of the openshift node says kubletNotReady with error PLEG is not healthy and due to this reason the glusterfspod running on this node went into NodeLost state [root@dhcp46-132 ~]# oc get nodes NAME STATUS ROLES AGE VERSION dhcp46-132.lab.eng.blr.redhat.com NotReady master 4d v1.11.0+d4cacc0 dhcp46-2.lab.eng.blr.redhat.com Ready infra 4d v1.11.0+d4cacc0 dhcp46-202.lab.eng.blr.redhat.com NotReady compute 4d v1.11.0+d4cacc0 dhcp46-225.lab.eng.blr.redhat.com Ready compute 4d v1.11.0+d4cacc0 dhcp47-15.lab.eng.blr.redhat.com Ready compute 4d v1.11.0+d4cacc0 Version-Release number of selected component (if applicable): [root@dhcp46-132 ~]# oc version oc v3.11.16 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://dhcp46-132.lab.eng.blr.redhat.com:8443 openshift v3.11.17 kubernetes v1.11.0+d4cacc0 [root@dhcp46-132 ~]# rpm -qa | grep docker docker-rhel-push-plugin-1.13.1-75.git8633870.el7_5.x86_64 docker-1.13.1-75.git8633870.el7_5.x86_64 docker-client-1.13.1-75.git8633870.el7_5.x86_64 python-docker-pycreds-1.10.6-4.el7.noarch python-docker-2.4.2-1.3.el7.noarch atomic-openshift-docker-excluder-3.11.16-1.git.0.b48b8f8.el7.noarch cockpit-docker-176-2.el7.x86_64 docker-common-1.13.1-75.git8633870.el7_5.x86_64 How reproducible: Hit it once Steps to Reproduce: 1. create a setup with 1 master and four worker nodes 2. Install RHOCS by running the deploy_cluster.yaml file 3. Run the script attached to test the heketi OOM kill issue. Actual results: I see that when the number of pvc's start's to get Bound and reaches 3 i see that it get stuck there and when i look at the logs, figured out that heketi is not able to reach kube api master and the one of the node went to Not Ready state. Expected results: Node should not go to NotReady state and PLEG errors should not be seen. Additional info: [root@dhcp46-132 ~]# oc describe node/dhcp46-202.lab.eng.blr.redhat.com Name: dhcp46-202.lab.eng.blr.redhat.com Roles: compute Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux glusterfs=storage-host kubernetes.io/hostname=dhcp46-202.lab.eng.blr.redhat.com node-role.kubernetes.io/compute=true Annotations: node.openshift.io/md5sum=87a96161baed65840e0b353532346bd2 volumes.kubernetes.io/controller-managed-attach-detach=true CreationTimestamp: Sat, 29 Sep 2018 22:41:50 +0530 Taints: <none> Unschedulable: false Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- OutOfDisk False Thu, 04 Oct 2018 16:20:55 +0530 Sat, 29 Sep 2018 22:41:50 +0530 KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False Thu, 04 Oct 2018 16:20:55 +0530 Sat, 29 Sep 2018 22:41:50 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Thu, 04 Oct 2018 16:20:55 +0530 Sat, 29 Sep 2018 22:41:50 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Thu, 04 Oct 2018 16:20:55 +0530 Sat, 29 Sep 2018 22:41:50 +0530 KubeletHasSufficientPID kubelet has sufficient PID available Ready False Thu, 04 Oct 2018 16:20:55 +0530 Tue, 02 Oct 2018 11:48:52 +0530 KubeletNotReady PLEG is not healthy: pleg was last seen active 52h35m10.665630586s ago; threshold is 3m0s Addresses: InternalIP: 10.70.46.202 Hostname: dhcp46-202.lab.eng.blr.redhat.com Capacity: cpu: 4 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32781380Ki pods: 250 Allocatable: cpu: 4 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 32678980Ki pods: 250 System Info: Machine ID: 6b88f33955cf44b0a9f302bc3f9819d5 System UUID: 42250B7C-6492-0E77-8DC7-5BF3DAF0FD38 Boot ID: 34cd0b9d-afe6-416e-a6cf-06eb973cc52d Kernel Version: 3.10.0-862.14.4.el7.x86_64 OS Image: Employee SKU Operating System: linux Architecture: amd64 Container Runtime Version: docker://1.13.1 Kubelet Version: v1.11.0+d4cacc0 Kube-Proxy Version: v1.11.0+d4cacc0 Non-terminated Pods: (5 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits --------- ---- ------------ ---------- --------------- ------------- glusterfs glusterfs-storage-pwbmk 100m (2%) 0 (0%) 100Mi (0%) 0 (0%) openshift-monitoring node-exporter-5rsfx 10m (0%) 20m (0%) 20Mi (0%) 40Mi (0%) openshift-node sync-5t627 0 (0%) 0 (0%) 0 (0%) 0 (0%) openshift-sdn ovs-zplh7 100m (2%) 200m (5%) 300Mi (0%) 400Mi (1%) openshift-sdn sdn-z2vn5 100m (2%) 0 (0%) 200Mi (0%) 0 (0%) Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 310m (7%) 220m (5%) memory 620Mi (1%) 440Mi (1%) Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ContainerGCFailed 5m (x1051 over 2d) kubelet, dhcp46-202.lab.eng.blr.redhat.com rpc error: code = DeadlineExceeded desc = context deadline exceeded
sosreports from all the nodes are present in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1636053
@seth @rama - Is this blocker for 3.11? We need to close 3.11 ASAP. So checking on this. Can you update the defect and send me a note.
PLEG is not healthy: pleg was last seen active 52h35m10.665630586s ago; threshold is 3m0s indicates that the runtime is either down or has been non-responsive for a long time. I'm on a bandwidth restricted connection atm. Can you confirm that a "docker run" from the command line is successful. If this can't be done, I'll send it to Containers.
Hello, I currently do not have the setup to do "docker run". But after restarting docker on that particular node which was down, node has come up and everything started working fine in my case. I have hit this issue only once and yesterday i did ran the same test which caused this issue, but did not find this again. Thanks kasturi
Seems like this was a runtime issue (either not started or locked up). I'll move to Container and close since it seems like it might have been an isolated thing. If you reopen, it'll be in the correct component.
Our client will migrate the very critical app to OpenShift 3.11 in middle of this Month , we often encounter this issue, Can we fix this issue as soon as possible. Is there another way to fix this issue without service restart, Please refer to https://access.redhat.com/solutions/3258011 Actually, I also find the issue from Kubernetes community: https://github.com/kubernetes/kubernetes/issues/45419. docker version 1.13.1-103 ocp v3.11.141 RHEL7.6 kernel: 3.10-1062.1.1
No root cause or reproducer determined.