Bug 2017255 - Migration of VM doesn't clean up the target pod in time in case of failed migration
Summary: Migration of VM doesn't clean up the target pod in time in case of failed mig...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Networking
Version: 4.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 4.10.0
Assignee: Radim Hrazdil
QA Contact: Yossi Segev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-10-26 07:20 UTC by lpivarc
Modified: 2022-03-16 15:56 UTC (History)
3 users (show)

Fixed In Version: v4.10.0-172
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-16 15:56:33 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt kubevirt pull 6714 0 None open virt-launcher: decrease timeout for http get when detecting Istio 2021-11-02 15:43:46 UTC
Red Hat Product Errata RHSA-2022:0947 0 None None None 2022-03-16 15:56:49 UTC

Description lpivarc 2021-10-26 07:20:01 UTC
Description of problem:
Migration of VM doesn't clean up the target pod in time in case of failed migration.

This is caused by our detection if istio proxy is present:
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher.go:567","timestamp":"2021-10-14T22:04:06.674177Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:10.781733Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:13.853706Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:16.925749Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:19.998665Z"}


It takes more than 10 seconds to clean up.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start migration and make it fail.
2. Observe if target pods get to cleaned up in less than 10 seconds
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yossi Segev 2022-01-13 14:23:50 UTC
Verified with the following scenario:
1. Create a simple fedora VM.

2. View the virt-launcher pod
$ oc get pods
NAME                            READY   STATUS    RESTARTS   AGE
virt-launcher-vm-fedora-bdmqg   2/2     Running   0          100s

3. Start tailing the virt-launcher pods (with `oc get pods -w`), when initlally there is only the source virt-launcher.

4. On a different shell - start a simple migration, to migrate the VM.

5. As soon as the migration is running - delete the source virt-launcher pod, and follow the pods status in the first terminal:
$ oc delete virt-launcher-vm-fedora-bdmqg

[cnv-qe-jenkins@n-yoss-410-sm-chjhf-executor ~]$ oc get pods -w
NAME                            READY   STATUS    RESTARTS   AGE
virt-launcher-vm-fedora-bdmqg   2/2     Running   0          100s
virt-launcher-vm-fedora-x6c6r   0/2     Pending   0          0s
virt-launcher-vm-fedora-x6c6r   0/2     Pending   0          1s
virt-launcher-vm-fedora-x6c6r   0/2     Init:0/2   0          1s
virt-launcher-vm-fedora-bdmqg   2/2     Terminating   0          2m11s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          3s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          4s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          5s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          6s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          6s
virt-launcher-vm-fedora-bdmqg   0/2     Terminating   0          2m16s
virt-launcher-vm-fedora-bdmqg   0/2     Terminating   0          2m16s
virt-launcher-vm-fedora-bdmqg   0/2     Terminating   0          2m17s
virt-launcher-vm-fedora-cc4d4   0/2     Pending       0          0s
virt-launcher-vm-fedora-cc4d4   0/2     Pending       0          0s
virt-launcher-vm-fedora-cc4d4   0/2     Pending       0          0s
virt-launcher-vm-fedora-cc4d4   0/2     Init:0/2      0          1s
virt-launcher-vm-fedora-cc4d4   0/2     Init:0/2      0          3s
virt-launcher-vm-fedora-cc4d4   0/2     Init:1/2      0          4s
virt-launcher-vm-fedora-cc4d4   0/2     PodInitializing   0          5s
virt-launcher-vm-fedora-cc4d4   2/2     Running           0          7s

As can be seen - the first target pod is the one with the "x6c6r" suffix.
It starts its expected initialization, and then, when the source pod "bdmqg" is deleted - the target pod also starts termination.
Eventually a new target pod cc4d4 is initialized, and ends up in Running state.
I can't say exactly how long it took, but it as a matter of few seconds (less than 10 seconds for sure).
I repeated the scenario twice, and in both cases I viewed the same outcome.

OCP version 4.10.0-fc.0
CNV version 4.10.0

Comment 6 errata-xmlrpc 2022-03-16 15:56:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947


Note You need to log in before you can comment on or make changes to this bug.