Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1959436

Summary:	VMs stuck in "migrating" status since it's unable to acquire the migration semaphore
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	nijin ashok <nashok>
Component:	vdsm	Assignee:	Milan Zamazal <mzamazal>
Status:	CLOSED ERRATA	QA Contact:	Qin Yuan <qiyuan>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.7	CC:	ahadas, emarcus, gveitmic, lsurette, mzamazal, srevivo, ycui
Target Milestone:	ovirt-4.4.8	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	vdsm-4.40.80.2	Doc Type:	Bug Fix
Doc Text:	Previously, when a virtual machine was powered off on the source host of a live migration and the migration finished successfully at the same time, the two events interfered with each other, and sometimes prevented migration cleanup resulting in additional migrations from the host being blocked. In this release, additional migrations are not blocked.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-08 14:11:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Virt	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1966121
Bug Blocks:

Description nijin ashok 2021-05-11 14:10:02 UTC

Description of problem:

If a VM was destroyed during migration and if the destroy request comes almost at the same time when the libvirtd suspends the CPU, the migration source thread fails to exit and holds the semaphore blocking the subsequent migration.

Migration attempt.

===
2021-05-03 17:36:29,699+0200 DEBUG (jsonrpc/0) [vds] {u'incomingLimit': 2, u'src': u'source.example.com', u'dstqemu': u'destination.example.com', u'autoConverge': u'false', u'tunneled': u'false', u'enableGuestEvents': False, u'dst': u'destination.example.com:54321', u'vmId': u'6d69ba9d-5616-4d35-8411-33522337a547', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 1250, u'method': u'online', 'mode': 'remote'} (API:493)
===

Got VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED from libvirtd.

===
2021-05-03 17:36:39,896+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') event Suspended detail 1 opaque None (vm:5460)
2021-05-03 17:36:39,897+0200 INFO  (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') CPU stopped: onSuspend (vm:6062)
===

The user asked to stop the VM.

engine log
===
2021-05-03 17:36:39,030+02 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-49631) [f4cf8fc2-555a-40ce-8341-46e41b524f04] Running command: StopVmCommand internal: false. Entities affected :  ID: 6d69ba9d-5616-4d35-8411-33522337a547 Type: VMAction group STOP_VM with role type USER
===

vdsm log
===
2021-05-03 17:36:40,471+0200 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Calling 'VM.destroy' in bridge with {u'vmID': u'6d69ba9d-5616-4d35-8411-33522337a547'} (__init__:329)
2021-05-03 17:36:40,472+0200 DEBUG (jsonrpc/1) [vds] About to destroy VM 6d69ba9d-5616-4d35-8411-33522337a547 (API:330)
2021-05-03 17:36:41,502+0200 INFO  (jsonrpc/1) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:, flow_id=f4cf8fc2-555a-40ce-8341-46e41b524f04, vmId=6d69ba9d-5616-4d35-8411-33522337a547 (api:54)
===

Although the destroy failed, it looks like somehow the vmstatus got changed.

At the same time, vdsm got VIR_DOMAIN_EVENT_STOPPED_MIGRATED from libvirtd.

===
2021-05-03 17:36:41,501+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') event Stopped detail 3 opaque None (vm:5460)
====

And it seems to be vmstatus was _not_ MIGRATION_SOURCE and hence self._onQemuDeath was called from _handle_libvirt_domain_stopped.

===
5373     def _handle_libvirt_domain_stopped(self, detail):
5374         if (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_MIGRATED and
5375                 self.lastStatus == vmstatus.MIGRATION_SOURCE):
5376             try:
5377                 hooks.after_vm_migrate_source(
5378                     self._domain.xml, self._custom)
5379                 for dev in self._customDevices():
5380                     hooks.after_device_migrate_source(
5381                         dev._deviceXML, self._custom, dev.custom)
5382             finally:
5383                 self.stopped_migrated_event_processed.set()
5384         elif (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_SAVED and
5385               self.lastStatus == vmstatus.SAVING_STATE):
5386             hooks.after_vm_hibernate(self._domain.xml, self._custom)
5387         else:
5388             exit_code, reason = self._getShutdownReason()
5389             self._onQemuDeath(exit_code, reason)
===

The logs also suggest it went into _onQemuDeath.

===
2021-05-03 17:36:41,501+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') shutdown reason: 6 (vm:1073)
2021-05-03 17:36:41,501+0200 INFO  (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') underlying process disconnected (vm:1077)
2021-05-03 17:36:41,851+0200 INFO  (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') Changed state to Down: Admin shut down from the engine (code=6) (vm:1701)
===

So "stopped_migrated_event_processed" was not set and the migration source thread will wait forever and will hold the semaphore.

===
322     def _finishSuccessfully(self, machineParams):
323         with self._lock:
324             self._progress = 100
325         if not self.hibernating:
326             # TODO: We could use a timeout on the wait to be more robust
327             # against "impossible" failures. But we don't have a good value to
328             # use here now.
329             self._vm.stopped_migrated_event_processed.wait()
===

If there are two such incidents, the subsequent migration will get hung since it won't be able to acquire the semaphore.

===
2021-05-03 21:53:52,150+0200 DEBUG (migsrc/07020194) [virt.vm] (vmId='07020194-bad7-4566-be19-f295516923b5') Migration started (migration:276)
2021-05-03 21:53:52,150+0200 INFO  (migsrc/07020194) [virt.vm] (vmId='07020194-bad7-4566-be19-f295516923b5') Migration semaphore: acquiring (migration:407)
===

The only way to release the migration semaphore is to restart the vdsm service.

Version-Release number of selected component (if applicable):

vdsm-4.30.38-1.el7ev.x86_64


How reproducible:

Observed 2 times in customer's environment.

Steps to Reproduce:

I am unable to reproduce this yet in 4.4 by stopping the VM in the middle of migration. But I guess the timing should be correct to hit the issue.

Actual results:

VMs stuck in "migrating" status since it's unable to acquire the migration semaphore.

Expected results:

Migration should work.

Additional info:

Comment 2 nijin ashok 2021-05-13 09:26:42 UTC

(In reply to nijin ashok from comment #0)
> Although the destroy failed, it looks like somehow the vmstatus got changed.
 
The lastStatus is actually set to POWERING_DOWN before _destroyVm during the shutdown process.

===
5017     def releaseVm(self, gracefulAttempts=1):
5018         """
5019         Stop VM and release all resources
5020         """
.........
........
5042             self.log.info('Release VM resources')
........
.......
5047             self.lastStatus = vmstatus.POWERING_DOWN
5048             # Terminate the VM's creation thread.
5049             self._incoming_migration_vm_running.set()
5050             self.guestAgent.stop()
5051             if self._dom.connected:
5052                 result = self._destroyVm(gracefulAttempts)
===

So if lastStatus=POWERING_DOWN is set just before the libvirt event VIR_DOMAIN_EVENT_STOPPED_MIGRATED, it won't set stopped_migrated_event_processed.set().

===
5373     def _handle_libvirt_domain_stopped(self, detail):
5374         if (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_MIGRATED and
5375                 self.lastStatus == vmstatus.MIGRATION_SOURCE):
5376             try:
5377                 hooks.after_vm_migrate_source(
5378                     self._domain.xml, self._custom)
5379                 for dev in self._customDevices():
5380                     hooks.after_device_migrate_source(
5381                         dev._deviceXML, self._custom, dev.custom)
5382             finally:
5383                 self.stopped_migrated_event_processed.set()
5384         elif (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_SAVED and
5385               self.lastStatus == vmstatus.SAVING_STATE):
5386             hooks.after_vm_hibernate(self._domain.xml, self._custom)
5387         else:
5388             exit_code, reason = self._getShutdownReason()
5389             self._onQemuDeath(exit_code, reason)
===

Logs.
---

Event just before "self.lastStatus = vmstatus.POWERING_DOWN".

2021-05-03 17:36:40,473+0200 INFO  (jsonrpc/1) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') Release VM resources (vm:5182)

VIR_DOMAIN_EVENT_STOPPED_MIGRATED
----

2021-05-03 17:36:41,501+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') event Stopped detail 3 opaque None (vm:5460)

Comment 3 nijin ashok 2021-05-13 09:40:23 UTC

(In reply to nijin ashok from comment #0)
 
> I am unable to reproduce this yet in 4.4 by stopping the VM in the middle of
> migration. But I guess the timing should be correct to hit the issue.
> 

I was able to reproduce the issue on 4.4 with vdsm-4.40.50.10-1.el8ev.x86_64. I shut down the VM while migrating and took me around 30+ attempts to hit the above race condition.


Logs from my test environment.
----

Logs of VMs which doesn't release the semaphore.
=====

2021-05-13 14:04:31,748+0530 DEBUG (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') event Suspended detail 1 opaque None (vm:5341)
2021-05-13 14:04:31,749+0530 INFO  (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') CPU stopped: onSuspend (vm:5742)

2021-05-13 14:04:32,229+0530 INFO  (jsonrpc/3) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') Release VM resources (vm:5042)

2021-05-13 14:04:32,922+0530 DEBUG (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') event Stopped detail 3 opaque None (vm:5341)
2021-05-13 14:04:32,922+0530 DEBUG (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') shutdown reason: 6 (vm:1118)
2021-05-13 14:04:32,922+0530 INFO  (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') underlying process disconnected (vm:1122)


2021-05-13 14:24:40,456+0530 DEBUG (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') event Suspended detail 1 opaque None (vm:5341)
2021-05-13 14:24:40,457+0530 INFO  (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') CPU stopped: onSuspend (vm:5742)

2021-05-13 14:24:41,070+0530 INFO  (jsonrpc/5) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') Release VM resources (vm:5042)

2021-05-13 14:24:41,624+0530 INFO  (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') underlying process disconnected (vm:1122)

2021-05-13 14:24:41,624+0530 DEBUG (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') event Stopped detail 3 opaque None (vm:5341)
2021-05-13 14:24:41,624+0530 DEBUG (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') shutdown reason: 6 (vm:1118)
2021-05-13 14:24:41,624+0530 INFO  (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') underlying process disconnected (vm:1122)
====

All future migrations from this host will be stuck on acquiring semaphore.

===
2021-05-13 14:25:39,152+0530 INFO  (migsrc/b9b65ab5) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') Migration semaphore: acquiring (migration:424)
===

Attaching the vdsm log.

Comment 6 Milan Zamazal 2021-05-25 17:14:58 UTC

Good analysis, thank you. Indeed, the check for migration based on the status is not reliable and it should be fixed. Additionally, I think it makes no sense to continue with a migration when the VM is powering down so the migration should be canceled, if possible, when initiating shutdown, which should also avoid some possible problems. I'll prepare patches for both.

Comment 8 Milan Zamazal 2021-06-03 15:45:35 UTC

I could reproduce the issue but only by modifying Vdsm sources and inserting an artificial delay there. It can be difficult to verify the fix in QE. When shutdown from Engine is requested, Engine calls VM.destroy on both the ends of the migration. In order to invoke the issue, the following conditions must be met:

- The migration succeeds.
- It implies the migration must be completed before the destroy call on the destination cancels it.
- VM.destroy must be called on the source while the VM is still migrating.
- The corresponding power off (i.e. killing the QEMU process) must not bring the VM down before the migration finishes.

This is very tight and I can think about the following arrangements to increase the chances to reproduce the issue:

- Calling VM.destroy using vdsm-client on the source, to avoid the race with VM.destroy on the destination.
- Making the QEMU process slow to react on TERM signal or, preferably, making it not to react on it at all and increasing gracefulAttempts argument value in VM.destroy call. I don't know how to arrange such a thing in QEMU though.

Any better or additional ideas?

Comment 9 Milan Zamazal 2021-06-03 15:55:09 UTC

Exploring the bug also revealed two additional issues:

- When a VM is destroyed on the source during a migration, libvirt becomes unresponsive until it is restarted. The symptoms are very similar to BZ 1966121.

- When a VM is migrating and shutdown is requested from the webadmin UI, Engine calls power off instead, because the VM is not in UP state. I'm not sure whether it's a bug or a feature.

Comment 10 Arik 2021-06-06 13:32:42 UTC

(In reply to Milan Zamazal from comment #9)
> - When a VM is migrating and shutdown is requested from the webadmin UI,
> Engine calls power off instead, because the VM is not in UP state. I'm not
> sure whether it's a bug or a feature.

Yeah, it doesn't sound right.
When the VM is up and is installed with guest agent, the VM can react to the shut down call.
When the VM is up + not installed with guest agent + configured with ACPI, it should be able to react to the shut down call (assuming the guest completed booting).
I believe that's why we check on ShutdownVm whether the VM is up and call power-off otherwise - since it's less likely that the VM could react to the shut down call then.

If we see the VM in migrating status, we don't really know if the guest completed booting or not so we don't know if we can expect it to react to the shut down call.
Maybe we can improve this by checking if guest agent data was received since booting the VM as that's an indication that the guest finished booting.

Comment 11 Arik 2021-06-06 21:12:49 UTC

(In reply to Milan Zamazal from comment #8)
> This is very tight and I can think about the following arrangements to
> increase the chances to reproduce the issue:
> 
> - Calling VM.destroy using vdsm-client on the source, to avoid the race with
> VM.destroy on the destination.
> - Making the QEMU process slow to react on TERM signal or, preferably,
> making it not to react on it at all and increasing gracefulAttempts argument
> value in VM.destroy call. I don't know how to arrange such a thing in QEMU
> though.
> 
> Any better or additional ideas?

It's one of those timing issues that would be difficult to QE to reproduce with "conventional" tools.
As you managed to reproduce it with code changes that enable simulating the problematic interactions and assuming you won't be able to reproduce it with the fix, I think it would be enough for QE to automate the procedure that was described in comment 3 (shut down a migrating VM 30+ times, we can make it 50) and if that passes + regression tests pass then we're good

Comment 17 Milan Zamazal 2021-06-22 11:43:55 UTC

Verification requires fix of BZ 1967715 and the same issue with migration semaphores can be hit, although for different reasons, until the libvirt bug is fixed. For those reasons, moving the bug back to modified and adding dependency on the libvirt bug.

Comment 22 Qin Yuan 2021-08-15 23:52:04 UTC

Verified with:
vdsm-4.40.80.5-1.el8ev.x86_64
libvirt-7.0.0-14.3.module+el8.4.0+11878+84e54169.x86_64
ovirt-engine-4.4.8.4-0.7.el8ev.noarch

Steps:
The same as steps in comment #15

Result:
Shut down a migrating VM 50 times, the unable to acquire the migration semaphore issue wasn't reproduced.

Besides, there is no regression in automation tests. So move this bug to VERIFIED.

Comment 29 errata-xmlrpc 2021-09-08 14:11:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Virtualization Host security and bug fix update [ovirt-4.4.8]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3459