Bug 1895445 - [CRI-O] Pleg not healthy due to a unresponding crio when terminationGracePeriod is greater than the PLEG timeout
Summary: [CRI-O] Pleg not healthy due to a unresponding crio when terminationGracePeri...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.11.z
Assignee: Peter Hunt
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-06 17:03 UTC by Mohamed Tleilia
Modified: 2024-03-25 16:59 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-16 12:35:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github cri-o cri-o pull 4352 0 None closed [3.11] container status: check whether the container is stopped 2021-01-05 06:39:11 UTC
Red Hat Product Errata RHSA-2020:5363 0 None None None 2020-12-16 12:35:50 UTC

Description Mohamed Tleilia 2020-11-06 17:03:36 UTC
Description of problem:
On OCP 3.11, we have an issue on nodes turns NotReady due to a PLEG not Healthy whith CRIO-O, when deleating a pod, CRIO don't report anymore the status to the kubelet, the issue can be visible when we define a terminationGracePeriod in the deploymentconfig that exceed the PLEG timeout defined here:
https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pleg/generic.go?fbclid=IwAR2jPIs9w38V1qwBhRucJ6SbP6XJVD0qUTpcMFUMcnjOlzoNbB7LKq4TYK0#L83

The pleg tries to update the status of the pod, but asking the status of the not stopped container encounters a lock in crio and doesn't get any reply before the end of the grace period. So the pleg is blocked for the grace period of the pod and trigger the NotReady status.

We have this issue visible on multiple deployment on pods containing Jboss containers, jboss containers took too long to stop and thus we need to implement the terminationGracePeriod to allow the jboss pod to terminate correctly.



Version-Release number of selected component (if applicable):
- openshift v3.11.248 / kubernetes v1.11.0+d4cacc0
- crio version 1.11.16-0.11.dev.rhaos3.11.gitd6a416d.el7-dev

How reproducible:


Steps to Reproduce:
1. Create a pod with 2 containers:
~~~
cat pod.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dummy
  labels:
    app: dummy
spec:
  replicas: 3
  selector:
    matchLabels:
        app: dummy
  template:
    metadata:
      labels:
        app: dummy
    spec:
      terminationGracePeriodSeconds: 300
      containers:
        - name: rootcontainer1
          image: rhel:latest
          command : ["sleep"]
          args: ["30000"]
          imagePullPolicy: "IfNotPresent"
        - name: rootcontainer3
          image: rhel:latest
          command : ["sleep"]
          args: ["60"]
          imagePullPolicy: "IfNotPresent"
~~~

2. kill the containers 
3. Pleg truns unhealthy and we get the Node NotReady state on the node

Actual results:

Node not ready

Expected results:

Node still ready and the pods terminate correctly

Additional info:

Additional logs from one affected cluster:

Killing containers with a 300 second grace period:

~~~
Nov 03 14:02:52 node-xxxxxxxxxx atomic-openshift-node[7317]: I1103 14:02:52.482765    7317 kubelet.go:1836] SyncLoop (DELETE, "api"): "dummy-6f94547d76-4g9kt_redisfs(3a2f932c-1ddd-11eb-b4d6-fa163e23b0c6)"
Nov 03 14:02:52 node-xxxxxxxxxx atomic-openshift-node[7317]: I1103 14:02:52.483067    7317 kuberuntime_container.go:553] Killing container "cri-o://b5fdf91787be8993dae2a5aff90d078f10265a4d60667f50529e52f16f617d74" with 300 second grace period
Nov 03 14:02:52 node-xxxxxxxxxx atomic-openshift-node[7317]: I1103 14:02:52.483161    7317 kuberuntime_container.go:553] Killing container "cri-o://146ad0d01f73dc9764db5c50c7eae3f32957884a11ac0fd99b9299bf1c451e72" with 300 second grace period
~~~

Kuebelet is trying to get containers status from CRIO but no response:

~~~
Nov 03 14:03:08 node-xxxxxxxxxxxx atomic-openshift-node[7317]: E1103 14:03:08.659867    7317 summary.go:102] Failed to get system container stats for "/system.slice/atomic-openshift-node.service": failed to get cgroup stats for "/system.slice/atomic-openshift-node.service": failed to get container info for "/system.slice/atomic-openshift-node.service": unknown container "/system.slice/atomic-openshift-node.service"
~~~

After the 3 minutes defined on PLEG timeout, the node become NotReady:
 
~~~
Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.056518    7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.740511634s ago; threshold is 3m0s]
Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.156738    7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.84075748s ago; threshold is 3m0s]
Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.356947    7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m1.040968257s ago; threshold is 3m0s]
Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.757147    7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m1.441163584s ago; threshold is 3m0s]
Nov 03 14:06:42 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:42.557356    7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m2.241378099s ago; threshold is 3m0s]
Nov 03 14:06:42 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:42.718523    7317 kubelet_node_status.go:441] Recording NodeNotReady event message for node  node-xxxxxxxxxxxx
Nov 03 14:06:42  node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:42.718557    7317 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-11-03 14:06:42.718506339 +0000 UTC m=+1046674.088487382 LastTransitionTime:2020-11-03 14:06:42.718506339 +0000 UTC m=+1046674.088487382 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m2.402555237s ago; threshold is 3m0s}
~~~

Please let me know if you need more details, logs ...

Comment 2 Peter Hunt 2020-11-06 17:37:07 UTC
this is fixed in https://github.com/cri-o/cri-o/pull/3881

which should be in the latest 3.11 release (and which I see is newer than the version you have)

Comment 3 Mohamed Tleilia 2020-11-06 17:45:00 UTC
Peter, the latest crio version is cri-o-1.11.16-0.13 available on the latest OCP 3.11 release "OpenShift Container Platform 3.11.306":

https://access.redhat.com/errata/RHBA-2020:4170

Due to the heavy workload we have, OCP upgrade will be difficult right now and need to be planned.

Can you confirm please that we can upgrade only crio from the current version cri-o-1.11.16-0.11 to the latest version cri-o-1.11.16-0.13 that suppose to contain the fix?

Thanks for your support

Comment 4 Peter Hunt 2020-11-06 18:32:26 UTC
yes that should be all that's necessary

Comment 10 errata-xmlrpc 2020-12-16 12:35:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 3.11.343 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5363


Note You need to log in before you can comment on or make changes to this bug.