Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2116461

Summary: Stale pod sandbox remains on the node due to "Kill container failed"
Product: OpenShift Container Platform Reporter: Chen <cchen>
Component: NodeAssignee: Sascha Grunert <sgrunert>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DEFERRED Docs Contact:
Severity: medium    
Priority: medium CC: amulmule, assingh, cgaynor, dgupte, helwazer, jhonce, sgrunert, yasingh
Version: 4.10Keywords: Reopened
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-24 08:29:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chen 2022-08-08 14:41:20 UTC
Description of problem:

Removing and creating Pods continuously could leave stale pod sandbox on the node forever:

$ crictl pods | grep NotReady

40f3504e1fe08       4 days ago          NotReady            hello-openshift-6-7d59599c8b-wpf5f                               test-vdu                                           0                   (default)
41a9c8e1cf681       4 days ago          NotReady            hello-openshift-7-7d59599c8b-dqcxz                               test-vdu                                           0                   (default)
4d4697e8c54d1       4 days ago          NotReady            hello-openshift-9-7d59599c8b-52xw6                               test-vdu                                           0                   (default)
700c81bf5ca49       4 days ago          NotReady            hello-openshift-2-7d59599c8b-hfcwb                               test-vdu                                           0                   (default)

Seems due to the "failed to unmount container" error, the kubelet doesn't send RemovePodSandbox to CRI-O ?

$ grep 'Kill container failed' /tmp/journal.log | grep pod=
Aug 04 04:06:42 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 04:06:42.396552    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container 558ca630c2f246ee9e4bcbfb0d23e1ab371c4564f0707101a3c547965baeb8a1: layer not known" pod="test-vdu/hello-openshift-2-7d59599c8b-hfcwb" podUID=76663117-5e77-4162-a50e-dc0c8b23457f containerName="hello-openshift-2" containerID={Type:cri-o ID:558ca630c2f246ee9e4bcbfb0d23e1ab371c4564f0707101a3c547965baeb8a1}
Aug 04 06:39:34 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 06:39:34.456189    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container a0871468b3a6fb6cce3f9e05b3970c28d301467fd68d69c5377d26b211977ee6: layer not known" pod="test-vdu/hello-openshift-9-7d59599c8b-52xw6" podUID=2b89475a-1eb6-485b-ba88-90ed269019e2 containerName="hello-openshift-1" containerID={Type:cri-o ID:a0871468b3a6fb6cce3f9e05b3970c28d301467fd68d69c5377d26b211977ee6}
Aug 04 10:16:48 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 10:16:48.369613    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container ffc760c4d866de3a13d8234788338589a4ec53c98407ec69cc90fae7803d8036: layer not known" pod="test-vdu/hello-openshift-7-7d59599c8b-dqcxz" podUID=e6f839f5-e5ce-42b0-bec7-7003a17fe209 containerName="hello-openshift-1" containerID={Type:cri-o ID:ffc760c4d866de3a13d8234788338589a4ec53c98407ec69cc90fae7803d8036}
Aug 04 13:29:54 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 13:29:54.367788    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container fc89cc22821c511013b99e9c3ce9a97a4aced693454cd911df3b22a3654361f5: layer not known" pod="test-vdu/hello-openshift-6-7d59599c8b-wpf5f" podUID=0d327307-c76b-4cb5-b53a-230fc0b1fc89 containerName="hello-openshift-1" containerID={Type:cri-o ID:fc89cc22821c511013b99e9c3ce9a97a4aced693454cd911df3b22a3654361f5}

Version-Release number of selected component (if applicable):

4.10.10
With http://brew-task-repos.usersys.redhat.com/repos/scratch/pehunt/cri-o/1.23.3/12.rhaos4.10.gitddf4b1a.1.el8/x86_64/ patched

How reproducible:

Quite often

Steps to Reproduce:

1. 

$ ssh core.48.25 (password: redhatgss)
$ sudo su -
$ cd ~/helm/test-mychart
$ oc new-project <your project>
$ for i in `crictl pods | grep NotReady | awk '{ print $1}'`; do crictl rmp $i; done
$ ./script.sh

2. Run the script for sometime and there will be NotReady pod sandbox remaining on the node

Actual results:


Expected results:


Additional info:

Comment 5 yadvendra singh 2022-10-21 05:37:24 UTC
 @colum

The issue is reproduced again after the upgrade from 4.10.10 to 4.10.32

 As per findings, there is a Pod  listed in crictl command on node:- pod name statefulset-under-monitor-2-0  in cran2 namespace
    In  oc command, it shows no pod 

  -> $crictl pods ps|grep cran
     571cbec3f1d68       2 hours ago         Ready               statefulset-under-monitor-2-0                                cran2                                              0                   (default)
   
  -> and pod containers are excited
  
     $sudo crictl ps -a|grep 571cbec3f1d68
       bafaec077424f       1559fd7e3ba0cc0c54242d429d65d1b722977d9800ee6a4427b93abdcab86c4a                                                                                                                                           
                                         2 hours ago         Exited              container-g-non-critical                      0                   571cbec3f1d68
       4e07d496eb695       0488dfea855f84ab3564b6ce047e0e0011570f648a09ff9a172725b6ad9964d6                                                                                                                                           
                                        2 hours ago         Exited              haagent                                       0                   571cbec3f1d68
  -> no output in oc command

    $oc get pod -n cran2
       No resources were found in the cran2 namespace.

Comment 8 yadvendra singh 2022-11-09 10:17:00 UTC
Hey @sgrunert  you asked for 

1)Is the node under some CPU/memory pressure by any chance
   They cannot confirm as they don't know what the situation is while the issue happen. About performance, they share another case: https://access.redhat.com/support/cases/#/case/03352031
   according to another case
                         1) env
                             Test environment: ocp4.10.32  HOST: 48core(10 cores for system), 192GB memory, five tenants(one namespace for one tenants)
                              
                         2).at the same time 115 pods in the host; 23 pods for every tenant.
   

2) Do we still have no reproducer for 4.11 yet?

they don't use ocp4.11.1 as they stated this version has several serious issues, and the latest version ocp4.11.9 was verifying in recent days, so almost no ocp4.11 are used except for verification.

please let I know if need further info 



Regards,
Yadvendra Singh
Red Hat