Bug 2116461
| Summary: | Stale pod sandbox remains on the node due to "Kill container failed" | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Chen <cchen> |
| Component: | Node | Assignee: | Sascha Grunert <sgrunert> |
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | amulmule, assingh, cgaynor, dgupte, helwazer, jhonce, sgrunert, yasingh |
| Version: | 4.10 | Keywords: | Reopened |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-01-24 08:29:41 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@colum
The issue is reproduced again after the upgrade from 4.10.10 to 4.10.32
As per findings, there is a Pod listed in crictl command on node:- pod name statefulset-under-monitor-2-0 in cran2 namespace
In oc command, it shows no pod
-> $crictl pods ps|grep cran
571cbec3f1d68 2 hours ago Ready statefulset-under-monitor-2-0 cran2 0 (default)
-> and pod containers are excited
$sudo crictl ps -a|grep 571cbec3f1d68
bafaec077424f 1559fd7e3ba0cc0c54242d429d65d1b722977d9800ee6a4427b93abdcab86c4a
2 hours ago Exited container-g-non-critical 0 571cbec3f1d68
4e07d496eb695 0488dfea855f84ab3564b6ce047e0e0011570f648a09ff9a172725b6ad9964d6
2 hours ago Exited haagent 0 571cbec3f1d68
-> no output in oc command
$oc get pod -n cran2
No resources were found in the cran2 namespace.
Hey @sgrunert you asked for 1)Is the node under some CPU/memory pressure by any chance They cannot confirm as they don't know what the situation is while the issue happen. About performance, they share another case: https://access.redhat.com/support/cases/#/case/03352031 according to another case 1) env Test environment: ocp4.10.32 HOST: 48core(10 cores for system), 192GB memory, five tenants(one namespace for one tenants) 2).at the same time 115 pods in the host; 23 pods for every tenant. 2) Do we still have no reproducer for 4.11 yet? they don't use ocp4.11.1 as they stated this version has several serious issues, and the latest version ocp4.11.9 was verifying in recent days, so almost no ocp4.11 are used except for verification. please let I know if need further info Regards, Yadvendra Singh Red Hat |
Description of problem: Removing and creating Pods continuously could leave stale pod sandbox on the node forever: $ crictl pods | grep NotReady 40f3504e1fe08 4 days ago NotReady hello-openshift-6-7d59599c8b-wpf5f test-vdu 0 (default) 41a9c8e1cf681 4 days ago NotReady hello-openshift-7-7d59599c8b-dqcxz test-vdu 0 (default) 4d4697e8c54d1 4 days ago NotReady hello-openshift-9-7d59599c8b-52xw6 test-vdu 0 (default) 700c81bf5ca49 4 days ago NotReady hello-openshift-2-7d59599c8b-hfcwb test-vdu 0 (default) Seems due to the "failed to unmount container" error, the kubelet doesn't send RemovePodSandbox to CRI-O ? $ grep 'Kill container failed' /tmp/journal.log | grep pod= Aug 04 04:06:42 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 04:06:42.396552 5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container 558ca630c2f246ee9e4bcbfb0d23e1ab371c4564f0707101a3c547965baeb8a1: layer not known" pod="test-vdu/hello-openshift-2-7d59599c8b-hfcwb" podUID=76663117-5e77-4162-a50e-dc0c8b23457f containerName="hello-openshift-2" containerID={Type:cri-o ID:558ca630c2f246ee9e4bcbfb0d23e1ab371c4564f0707101a3c547965baeb8a1} Aug 04 06:39:34 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 06:39:34.456189 5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container a0871468b3a6fb6cce3f9e05b3970c28d301467fd68d69c5377d26b211977ee6: layer not known" pod="test-vdu/hello-openshift-9-7d59599c8b-52xw6" podUID=2b89475a-1eb6-485b-ba88-90ed269019e2 containerName="hello-openshift-1" containerID={Type:cri-o ID:a0871468b3a6fb6cce3f9e05b3970c28d301467fd68d69c5377d26b211977ee6} Aug 04 10:16:48 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 10:16:48.369613 5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container ffc760c4d866de3a13d8234788338589a4ec53c98407ec69cc90fae7803d8036: layer not known" pod="test-vdu/hello-openshift-7-7d59599c8b-dqcxz" podUID=e6f839f5-e5ce-42b0-bec7-7003a17fe209 containerName="hello-openshift-1" containerID={Type:cri-o ID:ffc760c4d866de3a13d8234788338589a4ec53c98407ec69cc90fae7803d8036} Aug 04 13:29:54 dell-per730-08.gsslab.pek2.redhat.com hyperkube[5705]: E0804 13:29:54.367788 5705 kuberuntime_container.go:762] "Kill container failed" err="rpc error: code = Unknown desc = failed to unmount container fc89cc22821c511013b99e9c3ce9a97a4aced693454cd911df3b22a3654361f5: layer not known" pod="test-vdu/hello-openshift-6-7d59599c8b-wpf5f" podUID=0d327307-c76b-4cb5-b53a-230fc0b1fc89 containerName="hello-openshift-1" containerID={Type:cri-o ID:fc89cc22821c511013b99e9c3ce9a97a4aced693454cd911df3b22a3654361f5} Version-Release number of selected component (if applicable): 4.10.10 With http://brew-task-repos.usersys.redhat.com/repos/scratch/pehunt/cri-o/1.23.3/12.rhaos4.10.gitddf4b1a.1.el8/x86_64/ patched How reproducible: Quite often Steps to Reproduce: 1. $ ssh core.48.25 (password: redhatgss) $ sudo su - $ cd ~/helm/test-mychart $ oc new-project <your project> $ for i in `crictl pods | grep NotReady | awk '{ print $1}'`; do crictl rmp $i; done $ ./script.sh 2. Run the script for sometime and there will be NotReady pod sandbox remaining on the node Actual results: Expected results: Additional info: