Description of problem: If the virtual machine migration is canceled before the virt-launcher detects the qemu-kvm process pid, the target virt-launcher is not cleaned up immediately and waits for the qemu-timeout. It will wait in the refresh monitor here https://github.com/kubevirt/kubevirt/blob/f77d50591ddd0f74c0c876e38fdf14ca3fe54be8/pkg/virt-launcher/monitor.go#L126. Since the virt-launcher didn't find the pid yet, mon.pid will be always 0, and mon.isDone will be false. Migration was canceled here: ~~~ {"component":"virt-launcher","kind":"","level":"info","msg":"Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup","name":"rhel7-quick-halibut","namespace":"default","pos":"server.go:151","timestamp":"2023-01-16T08:31:05.730454Z","uid":"64f6bc95-0b0d-4cb2-b954-69318cc409a3"} {"component":"virt-launcher-monitor","level":"info","msg":"Reaped pid 76 with status 0","pos":"virt-launcher-monitor.go:125","timestamp":"2023-01-16T08:31:05.983324Z"} {"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:5894","subcomponent":"libvirt","thread":"26","timestamp":"2023-01-16T08:31:06.070000Z"} ~~~ Then it waits for the qemu pid and finally timeout after qemu-timeout which here is 5m11s: ~~~ {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:06.420909Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:07.421195Z"} {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:31:29.420918Z"} ..... ..... ..... {"component":"virt-launcher","level":"info","msg":"Still missing PID for default_rhel7-quick-halibut, open /run/libvirt/qemu/run/default_rhel7-quick-halibut.pid: no such file or directory","pos":"monitor.go:125","timestamp":"2023-01-16T08:36:16.421068Z"} {"component":"virt-launcher","level":"info","msg":"default_rhel7-quick-halibut not found after timeout","pos":"monitor.go:129","timestamp":"2023-01-16T08:36:16.421119Z"} {"component":"virt-launcher","level":"info","msg":"Waiting on final notifications to be sent to virt-handler.","pos":"virt-launcher.go:270","timestamp":"2023-01-16T08:36:16.421153Z"} {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:501","timestamp":"2023-01-16T08:36:16.422034Z"} ~~~ Although I can also see the message "Signaled target pod virt-launcher-rhel7-quick-halibut-g7kzc to cleanup", it doesn't seem to have any effect here since it is setting receivedEarlyExitSignalEnvVar and is only queried in waitForDomainUUID which is before the refresh monitor. Version-Release number of selected component (if applicable): OpenShift Virtualization 4.11.2 How reproducible: 100 % Steps to Reproduce: 1. Start a virtual machine migration. 2. Cancel the VM migration. We have to cancel before the virt-launcher detects qemu pid. I was able to reproduce this easily when I cancel the migration immediately after the target pod was scheduled. Actual results: Target pod waits for "qemu-timeout" to cleanup after cancelling the VM live migration Expected results: Since the user is canceling the migration, it is expected to immediately terminate the resources created for the migration instead of waiting for a timeout to hit. Additional info:
As this requires some design work to happen, we are targeting this bug to CNV 4.14.
Fix PR: https://github.com/kubevirt/kubevirt/pull/10099
verified on CNV-v4.14.0.rhel9-1711 target pod does not wait for timeout and exits in less than a second after message `migration successfully aborted` appeared in logs: {"component":"virt-launcher","level":"error","msg":"migration successfully aborted","pos":"qemuMigrationDstFinish:6832","subcomponent":"libvirt","thread":"28","timestamp":"2023-08-24T20:40:14.424000Z"} . {"component":"virt-launcher","level":"info","msg":"Exiting...","pos":"virt-launcher.go:497","timestamp":"2023-08-24T20:40:14.429811Z"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:6817