Created attachment 1728048 [details] virt-luncher-pod-get.yaml Description of problem: After running kubevirt test suite sometimes we see stuck virt-luncher in termination state. [cloud-user@ocp-psi-executor ~]$ oc get all -n kubevirt-test-default1 NAME READY STATUS RESTARTS AGE pod/virt-launcher-testvmitsz6cldfpkfdf8zph5ztk7vlkcg7c6lf8clhsvblqh 0/2 Terminating 0 8h NAME AGE vmimportconfig.v2v.kubevirt.io/vmimport-kubevirt-hyperconverged 7h38m All nodes ready [cloud-user@ocp-psi-executor ~]$ oc get nodes NAME STATUS ROLES AGE VERSION kvirt-night46-flmfr-master-0 Ready master 15h v1.19.0+d59ce34 kvirt-night46-flmfr-master-1 Ready master 15h v1.19.0+d59ce34 kvirt-night46-flmfr-master-2 Ready master 15h v1.19.0+d59ce34 kvirt-night46-flmfr-worker-0-kwj79 Ready worker 15h v1.19.0+d59ce34 kvirt-night46-flmfr-worker-0-q5sfz Ready worker 15h v1.19.0+d59ce34 kvirt-night46-flmfr-worker-0-v9hnq Ready worker 15h v1.19.0+d59ce34 Please see attached files about pod. Version-Release number of selected component (if applicable): HCO-v2.5.0-432 OCP-4.6.3 How reproducible: Is not reproducible always Steps to Reproduce: 1. Run kubevirt test 2. Observe stuck virt-launcher pods 3. Actual results: stuck pod Expected results: pod is deleted successfully Additional info:
Created attachment 1728049 [details] virt-luncher-pod-describe.log
Happening again, this time there were two leftovers. kubevirt-test-default1 virt-launcher-testvmihlxwx7ntbcb97c6tcztjd42k28qx8k6dv6j8vghk4b 0/2 Terminating 0 162m kubevirt-test-default1 virt-launcher-testvmik2xjd2pt8tpmxrsbtjkxng5b7gwrp8v9tfsq7c7mxz 0/3 Terminating 0 3h12m It is related to migration tests, performing pod eviction. Attaching test console log.
Created attachment 1728305 [details] test-console.log
We have ``` deletionGracePeriodSeconds: 30 deletionTimestamp: "2020-11-03T04:54:46Z" ``` and there are no finalizers left on the pod. It looks like the pod hangs indefinitely on the init containers. Even if our init container binary would not cooperate with the kubelety it would be killed forcefully after 30 seconds. We recently experienced this too in kubevirt CI when we updated containerd/runc to resolve this issue: https://github.com/kubernetes/kubernetes/issues/95296. We are there now on runc version `1.0.0-rc92` and since then we experience the issue too. Lukas can we get the kubelet logs? There can be a CNV issue, but it sounds less likely than a kubelet/runc issue.
Created attachment 1728348 [details] kubelet.service.log.gz
I described Peter Hunt the symptoms and he pointed me to https://bugzilla.redhat.com/show_bug.cgi?id=1883991. Indeed, in our kubelet logs I can find the error mentioned there: ``` 445337 Nov 11 11:16:50 kvirt-night46-m8pm7-worker-0-fc2w9 hyperkube[1976]: I1111 11:16:50.330294 1976 event.go:291] "Event occurred" object="kubevirt-test-default1/virt-launcher-testvmi5jmvx2shpmd5tdgsnkzm pqxbljmmv89gxx8hwwlnsd" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probab ly exited) json message: EOF" ``` Stu, it may make sense to assign this issue to OpenShift Platform/Node.
@lbednar are we still seeing this bug today?
On a cluster which ran Tier2 4.8 tests: Output from [kbidarka@localhost auth]$ oc get csv -n openshift-cnv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.8.0 OpenShift Virtualization 4.8.0 kubevirt-hyperconverged-operator.v2.6.2 Succeeded [kbidarka@localhost auth]$ oc get vmi --all-namespaces No resources found [kbidarka@localhost auth]$ oc get pods --all-namespaces -l "kubevirt.io=virt-launcher" No resources found
After running Tier1 KubeVirt Tests ]$ oc get pods --all-namespaces -l "kubevirt.io=virt-launcher" No resources found Summary: 1) Had to taken automation help to see if we still find any pods in Terminating state. 2) With the above data provided, it appears that no pods are found in Terminating state. 3) Checked for after running, both Tier1 and Tier2 tests on the clusters.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2920
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days