Bug 2017255

Summary:	Migration of VM doesn't clean up the target pod in time in case of failed migration
Product:	Container Native Virtualization (CNV)	Reporter:	lpivarc
Component:	Networking	Assignee:	Radim Hrazdil <rhrazdil>
Status:	CLOSED ERRATA	QA Contact:	Yossi Segev <ysegev>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.9.0	CC:	cnv-qe-bugs, phoracek, ysegev
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	v4.10.0-172	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-16 15:56:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description lpivarc 2021-10-26 07:20:01 UTC

Description of problem:
Migration of VM doesn't clean up the target pod in time in case of failed migration.

This is caused by our detection if istio proxy is present:
{"component":"virt-launcher","level":"error","msg":"dirty virt-launcher shutdown: exit-code 2","pos":"virt-launcher.go:567","timestamp":"2021-10-14T22:04:06.674177Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:10.781733Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:13.853706Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:16.925749Z"}
{"component":"virt-launcher","level":"error","msg":"error when checking for istio-proxy presence","pos":"virt-launcher.go:657","reason":"Get \"http://localhost:15021/healthz/ready\": dial tcp [::1]:15021: connect: connection refused","timestamp":"2021-10-14T22:04:19.998665Z"}


It takes more than 10 seconds to clean up.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Start migration and make it fail.
2. Observe if target pods get to cleaned up in less than 10 seconds
3.

Actual results:


Expected results:


Additional info:

Comment 1 Yossi Segev 2022-01-13 14:23:50 UTC

Verified with the following scenario:
1. Create a simple fedora VM.

2. View the virt-launcher pod
$ oc get pods
NAME                            READY   STATUS    RESTARTS   AGE
virt-launcher-vm-fedora-bdmqg   2/2     Running   0          100s

3. Start tailing the virt-launcher pods (with `oc get pods -w`), when initlally there is only the source virt-launcher.

4. On a different shell - start a simple migration, to migrate the VM.

5. As soon as the migration is running - delete the source virt-launcher pod, and follow the pods status in the first terminal:
$ oc delete virt-launcher-vm-fedora-bdmqg

[cnv-qe-jenkins@n-yoss-410-sm-chjhf-executor ~]$ oc get pods -w
NAME                            READY   STATUS    RESTARTS   AGE
virt-launcher-vm-fedora-bdmqg   2/2     Running   0          100s
virt-launcher-vm-fedora-x6c6r   0/2     Pending   0          0s
virt-launcher-vm-fedora-x6c6r   0/2     Pending   0          1s
virt-launcher-vm-fedora-x6c6r   0/2     Init:0/2   0          1s
virt-launcher-vm-fedora-bdmqg   2/2     Terminating   0          2m11s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          3s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          4s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          5s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          6s
virt-launcher-vm-fedora-x6c6r   0/2     Terminating   0          6s
virt-launcher-vm-fedora-bdmqg   0/2     Terminating   0          2m16s
virt-launcher-vm-fedora-bdmqg   0/2     Terminating   0          2m16s
virt-launcher-vm-fedora-bdmqg   0/2     Terminating   0          2m17s
virt-launcher-vm-fedora-cc4d4   0/2     Pending       0          0s
virt-launcher-vm-fedora-cc4d4   0/2     Pending       0          0s
virt-launcher-vm-fedora-cc4d4   0/2     Pending       0          0s
virt-launcher-vm-fedora-cc4d4   0/2     Init:0/2      0          1s
virt-launcher-vm-fedora-cc4d4   0/2     Init:0/2      0          3s
virt-launcher-vm-fedora-cc4d4   0/2     Init:1/2      0          4s
virt-launcher-vm-fedora-cc4d4   0/2     PodInitializing   0          5s
virt-launcher-vm-fedora-cc4d4   2/2     Running           0          7s

As can be seen - the first target pod is the one with the "x6c6r" suffix.
It starts its expected initialization, and then, when the source pod "bdmqg" is deleted - the target pod also starts termination.
Eventually a new target pod cc4d4 is initialized, and ends up in Running state.
I can't say exactly how long it took, but it as a matter of few seconds (less than 10 seconds for sure).
I repeated the scenario twice, and in both cases I viewed the same outcome.

OCP version 4.10.0-fc.0
CNV version 4.10.0

Comment 6 errata-xmlrpc 2022-03-16 15:56:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Virtualization 4.10.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0947