Bug 1896387

Summary: [CNV-2.5] virt-launcher pod being stuck in termination state
Product: Container Native Virtualization (CNV) Reporter: Lukas Bednar <lbednar>
Component: VirtualizationAssignee: sgott
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: urgent    
Version: 2.5.0CC: cnv-qe-bugs, fdeutsch, kbidarka, lbednar, pehunt, rmohr, rphillips, sgott
Target Milestone: ---Keywords: TestOnly
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 14:21:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1883991, 1914022, 1915085    
Bug Blocks:    
Attachments:
Description Flags
virt-luncher-pod-get.yaml
none
virt-luncher-pod-describe.log
none
test-console.log
none
kubelet.service.log.gz none

Description Lukas Bednar 2020-11-10 13:02:32 UTC
Created attachment 1728048 [details]
virt-luncher-pod-get.yaml

Description of problem:

After running kubevirt test suite sometimes we see stuck virt-luncher in termination state.

[cloud-user@ocp-psi-executor ~]$ oc get all -n kubevirt-test-default1
NAME                                                                  READY   STATUS        RESTARTS   AGE
pod/virt-launcher-testvmitsz6cldfpkfdf8zph5ztk7vlkcg7c6lf8clhsvblqh   0/2     Terminating   0          8h
NAME                                                              AGE
vmimportconfig.v2v.kubevirt.io/vmimport-kubevirt-hyperconverged   7h38m


All nodes ready
[cloud-user@ocp-psi-executor ~]$ oc get nodes
NAME                                 STATUS   ROLES    AGE   VERSION
kvirt-night46-flmfr-master-0         Ready    master   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-master-1         Ready    master   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-master-2         Ready    master   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-worker-0-kwj79   Ready    worker   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-worker-0-q5sfz   Ready    worker   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-worker-0-v9hnq   Ready    worker   15h   v1.19.0+d59ce34

Please see attached files about pod.

Version-Release number of selected component (if applicable):
HCO-v2.5.0-432
OCP-4.6.3

How reproducible: Is not reproducible always


Steps to Reproduce:
1. Run kubevirt test
2. Observe stuck virt-launcher pods
3.

Actual results: stuck pod


Expected results: pod is deleted successfully 


Additional info:

Comment 1 Lukas Bednar 2020-11-10 13:03:01 UTC
Created attachment 1728049 [details]
virt-luncher-pod-describe.log

Comment 3 Lukas Bednar 2020-11-11 12:19:39 UTC
Happening again, this time there were two leftovers.

kubevirt-test-default1                             virt-launcher-testvmihlxwx7ntbcb97c6tcztjd42k28qx8k6dv6j8vghk4b   0/2     Terminating   0          162m
kubevirt-test-default1                             virt-launcher-testvmik2xjd2pt8tpmxrsbtjkxng5b7gwrp8v9tfsq7c7mxz   0/3     Terminating   0          3h12m


It is related to migration tests, performing pod eviction.
Attaching test console log.

Comment 4 Lukas Bednar 2020-11-11 12:20:10 UTC
Created attachment 1728305 [details]
test-console.log

Comment 6 Roman Mohr 2020-11-11 13:21:41 UTC
We have 

```
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2020-11-03T04:54:46Z"
```

and there are no finalizers left on the pod. It looks like the pod hangs indefinitely on the init containers. Even if our init container binary would not cooperate with the kubelety it would be killed forcefully after 30 seconds.

We recently experienced this too in kubevirt CI when we updated containerd/runc to resolve this issue: https://github.com/kubernetes/kubernetes/issues/95296.

We are there now on runc version `1.0.0-rc92` and since then we experience the issue too.

Lukas can we get the kubelet logs? There can be a CNV issue, but it sounds less likely than a kubelet/runc issue.

Comment 7 Lukas Bednar 2020-11-11 14:51:37 UTC
Created attachment 1728348 [details]
kubelet.service.log.gz

Comment 8 Roman Mohr 2020-11-11 15:33:53 UTC
I described Peter Hunt the symptoms and he pointed me to https://bugzilla.redhat.com/show_bug.cgi?id=1883991.

Indeed, in our kubelet logs I can find the error mentioned there:


```
445337 Nov 11 11:16:50 kvirt-night46-m8pm7-worker-0-fc2w9 hyperkube[1976]: I1111 11:16:50.330294    1976 event.go:291] "Event occurred" object="kubevirt-test-default1/virt-launcher-testvmi5jmvx2shpmd5tdgsnkzm       pqxbljmmv89gxx8hwwlnsd" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probab       ly exited) json message: EOF"

```

Stu, it may make sense to assign this issue to OpenShift Platform/Node.

Comment 9 Fabian Deutsch 2021-04-26 09:26:08 UTC
@lbednar are we still seeing this bug today?

Comment 13 Kedar Bidarkar 2021-06-23 10:05:22 UTC
On a cluster which ran Tier2 4.8 tests:


Output from 
[kbidarka@localhost auth]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.6.2   Succeeded

[kbidarka@localhost auth]$ oc get vmi --all-namespaces
No resources found
[kbidarka@localhost auth]$ oc get pods --all-namespaces -l "kubevirt.io=virt-launcher"
No resources found

Comment 17 Kedar Bidarkar 2021-06-23 10:40:13 UTC
After running Tier1 KubeVirt Tests

]$ oc get pods --all-namespaces -l "kubevirt.io=virt-launcher"
No resources found


Summary: 
1) Had to taken automation help to see if we still find any pods in Terminating state.
2) With the above data provided, it appears that no pods are found in Terminating state.
3) Checked for after running, both Tier1 and Tier2 tests on the clusters.

Comment 20 errata-xmlrpc 2021-07-27 14:21:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920

Comment 21 Red Hat Bugzilla 2023-09-15 00:50:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days