Bug 1896387 - [CNV-2.5] virt-launcher pod being stuck in termination state
Summary: [CNV-2.5] virt-launcher pod being stuck in termination state
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.5.0
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.8.0
Assignee: sgott
QA Contact: Kedar Bidarkar
URL:
Whiteboard:
Depends On: 1883991 1914022 1915085
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-10 13:02 UTC by Lukas Bednar
Modified: 2023-09-15 00:50 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 14:21:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
virt-luncher-pod-get.yaml (15.54 KB, text/plain)
2020-11-10 13:02 UTC, Lukas Bednar
no flags Details
virt-luncher-pod-describe.log (6.77 KB, text/plain)
2020-11-10 13:03 UTC, Lukas Bednar
no flags Details
test-console.log (204.78 KB, text/plain)
2020-11-11 12:20 UTC, Lukas Bednar
no flags Details
kubelet.service.log.gz (18.79 MB, application/gzip)
2020-11-11 14:51 UTC, Lukas Bednar
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2920 0 None None None 2021-07-27 14:22:27 UTC

Description Lukas Bednar 2020-11-10 13:02:32 UTC
Created attachment 1728048 [details]
virt-luncher-pod-get.yaml

Description of problem:

After running kubevirt test suite sometimes we see stuck virt-luncher in termination state.

[cloud-user@ocp-psi-executor ~]$ oc get all -n kubevirt-test-default1
NAME                                                                  READY   STATUS        RESTARTS   AGE
pod/virt-launcher-testvmitsz6cldfpkfdf8zph5ztk7vlkcg7c6lf8clhsvblqh   0/2     Terminating   0          8h
NAME                                                              AGE
vmimportconfig.v2v.kubevirt.io/vmimport-kubevirt-hyperconverged   7h38m


All nodes ready
[cloud-user@ocp-psi-executor ~]$ oc get nodes
NAME                                 STATUS   ROLES    AGE   VERSION
kvirt-night46-flmfr-master-0         Ready    master   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-master-1         Ready    master   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-master-2         Ready    master   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-worker-0-kwj79   Ready    worker   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-worker-0-q5sfz   Ready    worker   15h   v1.19.0+d59ce34
kvirt-night46-flmfr-worker-0-v9hnq   Ready    worker   15h   v1.19.0+d59ce34

Please see attached files about pod.

Version-Release number of selected component (if applicable):
HCO-v2.5.0-432
OCP-4.6.3

How reproducible: Is not reproducible always


Steps to Reproduce:
1. Run kubevirt test
2. Observe stuck virt-launcher pods
3.

Actual results: stuck pod


Expected results: pod is deleted successfully 


Additional info:

Comment 1 Lukas Bednar 2020-11-10 13:03:01 UTC
Created attachment 1728049 [details]
virt-luncher-pod-describe.log

Comment 3 Lukas Bednar 2020-11-11 12:19:39 UTC
Happening again, this time there were two leftovers.

kubevirt-test-default1                             virt-launcher-testvmihlxwx7ntbcb97c6tcztjd42k28qx8k6dv6j8vghk4b   0/2     Terminating   0          162m
kubevirt-test-default1                             virt-launcher-testvmik2xjd2pt8tpmxrsbtjkxng5b7gwrp8v9tfsq7c7mxz   0/3     Terminating   0          3h12m


It is related to migration tests, performing pod eviction.
Attaching test console log.

Comment 4 Lukas Bednar 2020-11-11 12:20:10 UTC
Created attachment 1728305 [details]
test-console.log

Comment 6 Roman Mohr 2020-11-11 13:21:41 UTC
We have 

```
  deletionGracePeriodSeconds: 30
  deletionTimestamp: "2020-11-03T04:54:46Z"
```

and there are no finalizers left on the pod. It looks like the pod hangs indefinitely on the init containers. Even if our init container binary would not cooperate with the kubelety it would be killed forcefully after 30 seconds.

We recently experienced this too in kubevirt CI when we updated containerd/runc to resolve this issue: https://github.com/kubernetes/kubernetes/issues/95296.

We are there now on runc version `1.0.0-rc92` and since then we experience the issue too.

Lukas can we get the kubelet logs? There can be a CNV issue, but it sounds less likely than a kubelet/runc issue.

Comment 7 Lukas Bednar 2020-11-11 14:51:37 UTC
Created attachment 1728348 [details]
kubelet.service.log.gz

Comment 8 Roman Mohr 2020-11-11 15:33:53 UTC
I described Peter Hunt the symptoms and he pointed me to https://bugzilla.redhat.com/show_bug.cgi?id=1883991.

Indeed, in our kubelet logs I can find the error mentioned there:


```
445337 Nov 11 11:16:50 kvirt-night46-m8pm7-worker-0-fc2w9 hyperkube[1976]: I1111 11:16:50.330294    1976 event.go:291] "Event occurred" object="kubevirt-test-default1/virt-launcher-testvmi5jmvx2shpmd5tdgsnkzm       pqxbljmmv89gxx8hwwlnsd" kind="Pod" apiVersion="v1" type="Warning" reason="FailedCreatePodSandBox" message="Failed to create pod sandbox: rpc error: code = Unknown desc = error reading container (probab       ly exited) json message: EOF"

```

Stu, it may make sense to assign this issue to OpenShift Platform/Node.

Comment 9 Fabian Deutsch 2021-04-26 09:26:08 UTC
@lbednar are we still seeing this bug today?

Comment 13 Kedar Bidarkar 2021-06-23 10:05:22 UTC
On a cluster which ran Tier2 4.8 tests:


Output from 
[kbidarka@localhost auth]$ oc get csv -n openshift-cnv 
NAME                                      DISPLAY                    VERSION   REPLACES                                  PHASE
kubevirt-hyperconverged-operator.v4.8.0   OpenShift Virtualization   4.8.0     kubevirt-hyperconverged-operator.v2.6.2   Succeeded

[kbidarka@localhost auth]$ oc get vmi --all-namespaces
No resources found
[kbidarka@localhost auth]$ oc get pods --all-namespaces -l "kubevirt.io=virt-launcher"
No resources found

Comment 17 Kedar Bidarkar 2021-06-23 10:40:13 UTC
After running Tier1 KubeVirt Tests

]$ oc get pods --all-namespaces -l "kubevirt.io=virt-launcher"
No resources found


Summary: 
1) Had to taken automation help to see if we still find any pods in Terminating state.
2) With the above data provided, it appears that no pods are found in Terminating state.
3) Checked for after running, both Tier1 and Tier2 tests on the clusters.

Comment 20 errata-xmlrpc 2021-07-27 14:21:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.8.0 Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2920

Comment 21 Red Hat Bugzilla 2023-09-15 00:50:55 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.