Bug 2116461 - Stale pod sandbox remains on the node due to "Kill container failed"
Summary: Stale pod sandbox remains on the node due to "Kill container failed"
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Sascha Grunert
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-08 14:41 UTC by Chen
Modified: 2023-08-25 16:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-24 08:29:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Chen 2022-08-08 14:41:20 UTC
Description of problem:

Removing and creating Pods continuously could leave stale pod sandbox on the node forever:

$ crictl pods | grep NotReady

40f3504e1fe08       4 days ago          NotReady            hello-openshift-6-7d59599c8b-wpf5f                               test-vdu                                           0                   (default)
41a9c8e1cf681       4 days ago          NotReady            hello-openshift-7-7d59599c8b-dqcxz                               test-vdu                                           0                   (default)
4d4697e8c54d1       4 days ago          NotReady            hello-openshift-9-7d59599c8b-52xw6                               test-vdu                                           0                   (default)
700c81bf5ca49       4 days ago          NotReady            hello-openshift-2-7d59599c8b-hfcwb                               test-vdu                                           0                   (default)

Seems due to the "failed to unmount container" error, the kubelet doesn't send RemovePodSandbox to CRI-O ?

$ grep 'Kill container failed' /tmp/journal.log | grep pod=
Aug 04 04:06:42 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 04:06:42.396552    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container 558ca630c2f246ee9e4bcbfb0d23e1ab371c4564f0707101a3c547965baeb8a1: layer not known" pod="test-vdu/hello-openshift-2-7d59599c8b-hfcwb" podUID=76663117-5e77-4162-a50e-dc0c8b23457f containerName="hello-openshift-2" containerID={Type:cri-o ID:558ca630c2f246ee9e4bcbfb0d23e1ab371c4564f0707101a3c547965baeb8a1}
Aug 04 06:39:34 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 06:39:34.456189    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container a0871468b3a6fb6cce3f9e05b3970c28d301467fd68d69c5377d26b211977ee6: layer not known" pod="test-vdu/hello-openshift-9-7d59599c8b-52xw6" podUID=2b89475a-1eb6-485b-ba88-90ed269019e2 containerName="hello-openshift-1" containerID={Type:cri-o ID:a0871468b3a6fb6cce3f9e05b3970c28d301467fd68d69c5377d26b211977ee6}
Aug 04 10:16:48 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 10:16:48.369613    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container ffc760c4d866de3a13d8234788338589a4ec53c98407ec69cc90fae7803d8036: layer not known" pod="test-vdu/hello-openshift-7-7d59599c8b-dqcxz" podUID=e6f839f5-e5ce-42b0-bec7-7003a17fe209 containerName="hello-openshift-1" containerID={Type:cri-o ID:ffc760c4d866de3a13d8234788338589a4ec53c98407ec69cc90fae7803d8036}
Aug 04 13:29:54 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 13:29:54.367788    5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container fc89cc22821c511013b99e9c3ce9a97a4aced693454cd911df3b22a3654361f5: layer not known" pod="test-vdu/hello-openshift-6-7d59599c8b-wpf5f" podUID=0d327307-c76b-4cb5-b53a-230fc0b1fc89 containerName="hello-openshift-1" containerID={Type:cri-o ID:fc89cc22821c511013b99e9c3ce9a97a4aced693454cd911df3b22a3654361f5}

Version-Release number of selected component (if applicable):

4.10.10
With http://brew-task-repos.usersys.redhat.com/repos/scratch/pehunt/cri-o/1.23.3/12.rhaos4.10.gitddf4b1a.1.el8/x86_64/ patched

How reproducible:

Quite often

Steps to Reproduce:

1. 

$ ssh core.48.25 (password: redhatgss)
$ sudo su -
$ cd ~/helm/test-mychart
$ oc new-project <your project>
$ for i in `crictl pods | grep NotReady | awk '{ print $1}'`; do crictl rmp $i; done
$ ./script.sh

2. Run the script for sometime and there will be NotReady pod sandbox remaining on the node

Actual results:


Expected results:


Additional info:

Comment 5 yadvendra singh 2022-10-21 05:37:24 UTC
 @colum

The issue is reproduced again after the upgrade from 4.10.10 to 4.10.32

 As per findings, there is a Pod  listed in crictl command on node:- pod name statefulset-under-monitor-2-0  in cran2 namespace
    In  oc command, it shows no pod 

  -> $crictl pods ps|grep cran
     571cbec3f1d68       2 hours ago         Ready               statefulset-under-monitor-2-0                                cran2                                              0                   (default)
   
  -> and pod containers are excited
  
     $sudo crictl ps -a|grep 571cbec3f1d68
       bafaec077424f       1559fd7e3ba0cc0c54242d429d65d1b722977d9800ee6a4427b93abdcab86c4a                                                                                                                                           
                                         2 hours ago         Exited              container-g-non-critical                      0                   571cbec3f1d68
       4e07d496eb695       0488dfea855f84ab3564b6ce047e0e0011570f648a09ff9a172725b6ad9964d6                                                                                                                                           
                                        2 hours ago         Exited              haagent                                       0                   571cbec3f1d68
  -> no output in oc command

    $oc get pod -n cran2
       No resources were found in the cran2 namespace.

Comment 8 yadvendra singh 2022-11-09 10:17:00 UTC
Hey @sgrunert  you asked for 

1)Is the node under some CPU/memory pressure by any chance
   They cannot confirm as they don't know what the situation is while the issue happen. About performance, they share another case: https://access.redhat.com/support/cases/#/case/03352031
   according to another case
                         1) env
                             Test environment: ocp4.10.32  HOST: 48core(10 cores for system), 192GB memory, five tenants(one namespace for one tenants)
                              
                         2).at the same time 115 pods in the host; 23 pods for every tenant.
   

2) Do we still have no reproducer for 4.11 yet?

they don't use ocp4.11.1 as they stated this version has several serious issues, and the latest version ocp4.11.9 was verifying in recent days, so almost no ocp4.11 are used except for verification.

please let I know if need further info 



Regards,
Yadvendra Singh
Red Hat


Note You need to log in before you can comment on or make changes to this bug.