Description of problem: On OCP 3.11, we have an issue on nodes turns NotReady due to a PLEG not Healthy whith CRIO-O, when deleating a pod, CRIO don't report anymore the status to the kubelet, the issue can be visible when we define a terminationGracePeriod in the deploymentconfig that exceed the PLEG timeout defined here: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/pleg/generic.go?fbclid=IwAR2jPIs9w38V1qwBhRucJ6SbP6XJVD0qUTpcMFUMcnjOlzoNbB7LKq4TYK0#L83 The pleg tries to update the status of the pod, but asking the status of the not stopped container encounters a lock in crio and doesn't get any reply before the end of the grace period. So the pleg is blocked for the grace period of the pod and trigger the NotReady status. We have this issue visible on multiple deployment on pods containing Jboss containers, jboss containers took too long to stop and thus we need to implement the terminationGracePeriod to allow the jboss pod to terminate correctly. Version-Release number of selected component (if applicable): - openshift v3.11.248 / kubernetes v1.11.0+d4cacc0 - crio version 1.11.16-0.11.dev.rhaos3.11.gitd6a416d.el7-dev How reproducible: Steps to Reproduce: 1. Create a pod with 2 containers: ~~~ cat pod.yaml apiVersion: apps/v1 kind: Deployment metadata: name: dummy labels: app: dummy spec: replicas: 3 selector: matchLabels: app: dummy template: metadata: labels: app: dummy spec: terminationGracePeriodSeconds: 300 containers: - name: rootcontainer1 image: rhel:latest command : ["sleep"] args: ["30000"] imagePullPolicy: "IfNotPresent" - name: rootcontainer3 image: rhel:latest command : ["sleep"] args: ["60"] imagePullPolicy: "IfNotPresent" ~~~ 2. kill the containers 3. Pleg truns unhealthy and we get the Node NotReady state on the node Actual results: Node not ready Expected results: Node still ready and the pods terminate correctly Additional info: Additional logs from one affected cluster: Killing containers with a 300 second grace period: ~~~ Nov 03 14:02:52 node-xxxxxxxxxx atomic-openshift-node[7317]: I1103 14:02:52.482765 7317 kubelet.go:1836] SyncLoop (DELETE, "api"): "dummy-6f94547d76-4g9kt_redisfs(3a2f932c-1ddd-11eb-b4d6-fa163e23b0c6)" Nov 03 14:02:52 node-xxxxxxxxxx atomic-openshift-node[7317]: I1103 14:02:52.483067 7317 kuberuntime_container.go:553] Killing container "cri-o://b5fdf91787be8993dae2a5aff90d078f10265a4d60667f50529e52f16f617d74" with 300 second grace period Nov 03 14:02:52 node-xxxxxxxxxx atomic-openshift-node[7317]: I1103 14:02:52.483161 7317 kuberuntime_container.go:553] Killing container "cri-o://146ad0d01f73dc9764db5c50c7eae3f32957884a11ac0fd99b9299bf1c451e72" with 300 second grace period ~~~ Kuebelet is trying to get containers status from CRIO but no response: ~~~ Nov 03 14:03:08 node-xxxxxxxxxxxx atomic-openshift-node[7317]: E1103 14:03:08.659867 7317 summary.go:102] Failed to get system container stats for "/system.slice/atomic-openshift-node.service": failed to get cgroup stats for "/system.slice/atomic-openshift-node.service": failed to get container info for "/system.slice/atomic-openshift-node.service": unknown container "/system.slice/atomic-openshift-node.service" ~~~ After the 3 minutes defined on PLEG timeout, the node become NotReady: ~~~ Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.056518 7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.740511634s ago; threshold is 3m0s] Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.156738 7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m0.84075748s ago; threshold is 3m0s] Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.356947 7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m1.040968257s ago; threshold is 3m0s] Nov 03 14:06:41 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:41.757147 7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m1.441163584s ago; threshold is 3m0s] Nov 03 14:06:42 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:42.557356 7317 kubelet.go:1758] skipping pod synchronization - [PLEG is not healthy: pleg was last seen active 3m2.241378099s ago; threshold is 3m0s] Nov 03 14:06:42 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:42.718523 7317 kubelet_node_status.go:441] Recording NodeNotReady event message for node node-xxxxxxxxxxxx Nov 03 14:06:42 node-xxxxxxxxxxxx atomic-openshift-node[7317]: I1103 14:06:42.718557 7317 kubelet_node_status.go:834] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2020-11-03 14:06:42.718506339 +0000 UTC m=+1046674.088487382 LastTransitionTime:2020-11-03 14:06:42.718506339 +0000 UTC m=+1046674.088487382 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m2.402555237s ago; threshold is 3m0s} ~~~ Please let me know if you need more details, logs ...
this is fixed in https://github.com/cri-o/cri-o/pull/3881 which should be in the latest 3.11 release (and which I see is newer than the version you have)
Peter, the latest crio version is cri-o-1.11.16-0.13 available on the latest OCP 3.11 release "OpenShift Container Platform 3.11.306": https://access.redhat.com/errata/RHBA-2020:4170 Due to the heavy workload we have, OCP upgrade will be difficult right now and need to be planned. Can you confirm please that we can upgrade only crio from the current version cri-o-1.11.16-0.11 to the latest version cri-o-1.11.16-0.13 that suppose to contain the fix? Thanks for your support
yes that should be all that's necessary
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 3.11.343 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5363