Bug 1636053 - "PLEG is not healthy" errors on OpenShift nodes and the node state is seen as NotReady
Summary: "PLEG is not healthy" errors on OpenShift nodes and the node state is seen as...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Brent Baude
QA Contact: weiwei jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-10-04 10:54 UTC by RamaKasturi
Modified: 2020-05-20 14:21 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-20 14:21:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3258011 0 None None None 2019-12-04 19:25:16 UTC

Description RamaKasturi 2018-10-04 10:54:48 UTC
Description of problem:
I see that one of the openshift node says kubletNotReady with error PLEG is not healthy and due to this reason the glusterfspod running on this node went into NodeLost state

[root@dhcp46-132 ~]# oc get nodes
NAME                                STATUS     ROLES     AGE       VERSION
dhcp46-132.lab.eng.blr.redhat.com   NotReady   master    4d        v1.11.0+d4cacc0
dhcp46-2.lab.eng.blr.redhat.com     Ready      infra     4d        v1.11.0+d4cacc0
dhcp46-202.lab.eng.blr.redhat.com   NotReady   compute   4d        v1.11.0+d4cacc0
dhcp46-225.lab.eng.blr.redhat.com   Ready      compute   4d        v1.11.0+d4cacc0
dhcp47-15.lab.eng.blr.redhat.com    Ready      compute   4d        v1.11.0+d4cacc0


Version-Release number of selected component (if applicable):
[root@dhcp46-132 ~]# oc version
oc v3.11.16
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://dhcp46-132.lab.eng.blr.redhat.com:8443
openshift v3.11.17
kubernetes v1.11.0+d4cacc0

[root@dhcp46-132 ~]# rpm -qa | grep docker
docker-rhel-push-plugin-1.13.1-75.git8633870.el7_5.x86_64
docker-1.13.1-75.git8633870.el7_5.x86_64
docker-client-1.13.1-75.git8633870.el7_5.x86_64
python-docker-pycreds-1.10.6-4.el7.noarch
python-docker-2.4.2-1.3.el7.noarch
atomic-openshift-docker-excluder-3.11.16-1.git.0.b48b8f8.el7.noarch
cockpit-docker-176-2.el7.x86_64
docker-common-1.13.1-75.git8633870.el7_5.x86_64


How reproducible:
Hit it once

Steps to Reproduce:
1. create a setup with 1 master and four worker nodes
2. Install RHOCS by running the deploy_cluster.yaml file
3. Run the script attached to test the heketi OOM kill issue.

Actual results:
I see that when the number of pvc's start's to get Bound and reaches 3 i see that it get stuck there and when i look at the logs, figured out that heketi is not able to reach kube api master and the one of the node went to Not Ready state.

Expected results:
Node should not go to NotReady state and PLEG errors should not be seen.

Additional info:

[root@dhcp46-132 ~]# oc describe node/dhcp46-202.lab.eng.blr.redhat.com
Name:               dhcp46-202.lab.eng.blr.redhat.com
Roles:              compute
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    glusterfs=storage-host
                    kubernetes.io/hostname=dhcp46-202.lab.eng.blr.redhat.com
                    node-role.kubernetes.io/compute=true
Annotations:        node.openshift.io/md5sum=87a96161baed65840e0b353532346bd2
                    volumes.kubernetes.io/controller-managed-attach-detach=true
CreationTimestamp:  Sat, 29 Sep 2018 22:41:50 +0530
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  OutOfDisk        False   Thu, 04 Oct 2018 16:20:55 +0530   Sat, 29 Sep 2018 22:41:50 +0530   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure   False   Thu, 04 Oct 2018 16:20:55 +0530   Sat, 29 Sep 2018 22:41:50 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Thu, 04 Oct 2018 16:20:55 +0530   Sat, 29 Sep 2018 22:41:50 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Thu, 04 Oct 2018 16:20:55 +0530   Sat, 29 Sep 2018 22:41:50 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            False   Thu, 04 Oct 2018 16:20:55 +0530   Tue, 02 Oct 2018 11:48:52 +0530   KubeletNotReady              PLEG is not healthy: pleg was last seen active 52h35m10.665630586s ago; threshold is 3m0s
Addresses:
  InternalIP:  10.70.46.202
  Hostname:    dhcp46-202.lab.eng.blr.redhat.com
Capacity:
 cpu:            4
 hugepages-1Gi:  0
 hugepages-2Mi:  0
 memory:         32781380Ki
 pods:           250
Allocatable:
 cpu:            4
 hugepages-1Gi:  0
 hugepages-2Mi:  0
 memory:         32678980Ki
 pods:           250
System Info:
 Machine ID:                 6b88f33955cf44b0a9f302bc3f9819d5
 System UUID:                42250B7C-6492-0E77-8DC7-5BF3DAF0FD38
 Boot ID:                    34cd0b9d-afe6-416e-a6cf-06eb973cc52d
 Kernel Version:             3.10.0-862.14.4.el7.x86_64
 OS Image:                   Employee SKU
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://1.13.1
 Kubelet Version:            v1.11.0+d4cacc0
 Kube-Proxy Version:         v1.11.0+d4cacc0
Non-terminated Pods:         (5 in total)
  Namespace                  Name                       CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                       ------------  ----------  ---------------  -------------
  glusterfs                  glusterfs-storage-pwbmk    100m (2%)     0 (0%)      100Mi (0%)       0 (0%)
  openshift-monitoring       node-exporter-5rsfx        10m (0%)      20m (0%)    20Mi (0%)        40Mi (0%)
  openshift-node             sync-5t627                 0 (0%)        0 (0%)      0 (0%)           0 (0%)
  openshift-sdn              ovs-zplh7                  100m (2%)     200m (5%)   300Mi (0%)       400Mi (1%)
  openshift-sdn              sdn-z2vn5                  100m (2%)     0 (0%)      200Mi (0%)       0 (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource  Requests    Limits
  --------  --------    ------
  cpu       310m (7%)   220m (5%)
  memory    620Mi (1%)  440Mi (1%)
Events:
  Type     Reason             Age                 From                                        Message
  ----     ------             ----                ----                                        -------
  Warning  ContainerGCFailed  5m (x1051 over 2d)  kubelet, dhcp46-202.lab.eng.blr.redhat.com  rpc error: code = DeadlineExceeded desc = context deadline exceeded

Comment 1 RamaKasturi 2018-10-04 13:06:24 UTC
sosreports from all the nodes are present in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/cns/bugs/BZ-1636053

Comment 2 Sudha Ponnaganti 2018-10-08 17:40:11 UTC
@seth @rama - Is this blocker for 3.11? We need to close 3.11 ASAP. So checking on this. Can you update the defect and send me a note.

Comment 3 Seth Jennings 2018-10-08 17:49:04 UTC
PLEG is not healthy: pleg was last seen active 52h35m10.665630586s ago; threshold is 3m0s

indicates that the runtime is either down or has been non-responsive for a long time.

I'm on a bandwidth restricted connection atm.  Can you confirm that a "docker run" from the command line is successful.  If this can't be done, I'll send it to Containers.

Comment 4 RamaKasturi 2018-10-09 05:07:54 UTC
Hello,

   I currently do not have the setup to do "docker run". But after restarting docker on that particular node which was down, node has come up and everything started working fine in my case.

  I have hit this issue only once and yesterday i did ran the same test which caused this issue, but did not find this again.

Thanks
kasturi

Comment 5 Seth Jennings 2018-10-09 21:20:11 UTC
Seems like this was a runtime issue (either not started or locked up).

I'll move to Container and close since it seems like it might have been an isolated thing.  If you reopen, it'll be in the correct component.

Comment 7 Jimmy Zhang 2019-10-11 13:08:38 UTC
Our client will migrate the very critical app to OpenShift 3.11 in middle of this Month , we often encounter this issue, Can we fix this issue as soon as possible.

Is there another way to fix this issue without service restart, Please refer to https://access.redhat.com/solutions/3258011

Actually, I also find the issue from Kubernetes community: https://github.com/kubernetes/kubernetes/issues/45419. 



docker version 1.13.1-103 
ocp v3.11.141 
RHEL7.6 
kernel: 3.10-1062.1.1

Comment 8 Brent Baude 2020-05-20 14:21:41 UTC
No root cause or reproducer determined.


Note You need to log in before you can comment on or make changes to this bug.