Bug 2161184

Summary: Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration
Product: Container Native Virtualization (CNV) Reporter: nijin ashok <nashok>
Component: VirtualizationAssignee: Itamar Holder <iholder>
Status: CLOSED ERRATA QA Contact: Denys Shchedrivyi <dshchedr>
Severity: high Docs Contact:
Priority: medium    
Version: 4.11.2CC: acardace, rgertzbe, sgott, usurse
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: v4.14.0.rhel9-1469 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-11-08 14:05:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description nijin ashok 2023-01-16 09:10:07 UTC
Description of problem:

If the virtual machine migration is canceled before the virt-launcher detects the qemu-kvm process pid, the target virt-launcher is not cleaned up immediately and waits for the qemu-timeout.


It will wait in the refresh monitor here https://github.com/kubevirt/kubevirt/blob/f77d50591ddd0f74c0c876e38fdf14ca3fe54be8/pkg/virt-launcher/monitor.go#L126.

Since the virt-launcher didn't find the pid yet, mon.pid will be always 0, and mon.isDone will be false.


Migration was canceled here:

~~~
{"component":"virt-launcher","kind":"","level":"info","msg":"Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup","name":"rhel7-quick-halibut","namespace":"default","pos":"server.go:151","timestamp":"2023-01-16T08:31:05.730454Z","uid":"64f6bc95-0b0d-4cb2-b954-69318cc409a3"}
{"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 76 with status 0","pos":"virt-launcher-monitor.go:125","timestamp":"2023-01-16T08:31:05.983324Z"}
{"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:5894","subcomponent":"libvirt","thread":"26","timestamp":"2023-01-16T08:31:06.070000Z"}
~~~

Then it waits for the qemu pid and finally timeout after qemu-timeout which here is 5m11s:

~~~
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:06.420909Z"}

{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:07.421195Z"}

{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:29.420918Z"}
.....
.....
.....
{"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:36:16.421068Z"}

{"component":"virt-launcher","level":"info","msg":"default_rhel7-quick-halibut not found after timeout","pos":"monitor.go:129","timestamp":"2023-01-16T08:36:16.421119Z"}
{"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:270","timestamp":"2023-01-16T08:36:16.421153Z"}
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:501","timestamp":"2023-01-16T08:36:16.422034Z"}
~~~

Although I can also see the message "Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup", it doesn't seem to have any effect here since it is setting  receivedEarlyExitSignalEnvVar and is only queried in waitForDomainUUID which is before the refresh monitor.


Version-Release number of selected component (if applicable):

OpenShift Virtualization 4.11.2

How reproducible:

100 %

Steps to Reproduce:

1. Start a virtual machine migration.
2. Cancel the VM migration. We have to cancel before the virt-launcher detects qemu pid. I was able to reproduce this easily when I cancel the migration immediately after the target pod was scheduled.


Actual results:

Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration

Expected results:

Since the user is canceling the migration, it is expected to immediately terminate the resources created for the migration instead of waiting for a timeout to hit.

Additional info:

Comment 1 Kedar Bidarkar 2023-01-18 13:24:03 UTC
As this requires some design work to happen, we are targeting this bug to CNV 4.14.

Comment 3 Itamar Holder 2023-07-13 14:05:44 UTC
Fix PR: https://github.com/kubevirt/kubevirt/pull/10099

Comment 8 Denys Shchedrivyi 2023-08-24 20:48:04 UTC
verified on CNV-v4.14.0.rhel9-1711
target pod does not wait for timeout and exits in less than a second after message `migration successfully aborted` appeared in logs:

{"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:6832","subcomponent":"libvirt","thread":"28","timestamp":"2023-08-24T20:40:14.424000Z"}
.
{"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:497","timestamp":"2023-08-24T20:40:14.429811Z"}

Comment 11 errata-xmlrpc 2023-11-08 14:05:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817