Bug 1959436
| Summary: | VMs stuck in "migrating" status since it's unable to acquire the migration semaphore | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | nijin ashok <nashok> |
| Component: | vdsm | Assignee: | Milan Zamazal <mzamazal> |
| Status: | CLOSED ERRATA | QA Contact: | Qin Yuan <qiyuan> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.3.7 | CC: | ahadas, emarcus, gveitmic, lsurette, mzamazal, srevivo, ycui |
| Target Milestone: | ovirt-4.4.8 | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | vdsm-4.40.80.2 | Doc Type: | Bug Fix |
| Doc Text: |
Previously, when a virtual machine was powered off on the source host of a live migration and the migration finished successfully at the same time, the two events interfered with each other, and sometimes prevented migration cleanup resulting in additional migrations from the host being blocked.
In this release, additional migrations are not blocked.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-08 14:11:50 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1966121 | ||
| Bug Blocks: | |||
(In reply to nijin ashok from comment #0) > Although the destroy failed, it looks like somehow the vmstatus got changed. The lastStatus is actually set to POWERING_DOWN before _destroyVm during the shutdown process. === 5017 def releaseVm(self, gracefulAttempts=1): 5018 """ 5019 Stop VM and release all resources 5020 """ ......... ........ 5042 self.log.info('Release VM resources') ........ ....... 5047 self.lastStatus = vmstatus.POWERING_DOWN 5048 # Terminate the VM's creation thread. 5049 self._incoming_migration_vm_running.set() 5050 self.guestAgent.stop() 5051 if self._dom.connected: 5052 result = self._destroyVm(gracefulAttempts) === So if lastStatus=POWERING_DOWN is set just before the libvirt event VIR_DOMAIN_EVENT_STOPPED_MIGRATED, it won't set stopped_migrated_event_processed.set(). === 5373 def _handle_libvirt_domain_stopped(self, detail): 5374 if (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_MIGRATED and 5375 self.lastStatus == vmstatus.MIGRATION_SOURCE): 5376 try: 5377 hooks.after_vm_migrate_source( 5378 self._domain.xml, self._custom) 5379 for dev in self._customDevices(): 5380 hooks.after_device_migrate_source( 5381 dev._deviceXML, self._custom, dev.custom) 5382 finally: 5383 self.stopped_migrated_event_processed.set() 5384 elif (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_SAVED and 5385 self.lastStatus == vmstatus.SAVING_STATE): 5386 hooks.after_vm_hibernate(self._domain.xml, self._custom) 5387 else: 5388 exit_code, reason = self._getShutdownReason() 5389 self._onQemuDeath(exit_code, reason) === Logs. --- Event just before "self.lastStatus = vmstatus.POWERING_DOWN". 2021-05-03 17:36:40,473+0200 INFO (jsonrpc/1) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') Release VM resources (vm:5182) VIR_DOMAIN_EVENT_STOPPED_MIGRATED ---- 2021-05-03 17:36:41,501+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') event Stopped detail 3 opaque None (vm:5460) (In reply to nijin ashok from comment #0) > I am unable to reproduce this yet in 4.4 by stopping the VM in the middle of > migration. But I guess the timing should be correct to hit the issue. > I was able to reproduce the issue on 4.4 with vdsm-4.40.50.10-1.el8ev.x86_64. I shut down the VM while migrating and took me around 30+ attempts to hit the above race condition. Logs from my test environment. ---- Logs of VMs which doesn't release the semaphore. ===== 2021-05-13 14:04:31,748+0530 DEBUG (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') event Suspended detail 1 opaque None (vm:5341) 2021-05-13 14:04:31,749+0530 INFO (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') CPU stopped: onSuspend (vm:5742) 2021-05-13 14:04:32,229+0530 INFO (jsonrpc/3) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') Release VM resources (vm:5042) 2021-05-13 14:04:32,922+0530 DEBUG (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') event Stopped detail 3 opaque None (vm:5341) 2021-05-13 14:04:32,922+0530 DEBUG (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') shutdown reason: 6 (vm:1118) 2021-05-13 14:04:32,922+0530 INFO (libvirt/events) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') underlying process disconnected (vm:1122) 2021-05-13 14:24:40,456+0530 DEBUG (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') event Suspended detail 1 opaque None (vm:5341) 2021-05-13 14:24:40,457+0530 INFO (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') CPU stopped: onSuspend (vm:5742) 2021-05-13 14:24:41,070+0530 INFO (jsonrpc/5) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') Release VM resources (vm:5042) 2021-05-13 14:24:41,624+0530 INFO (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') underlying process disconnected (vm:1122) 2021-05-13 14:24:41,624+0530 DEBUG (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') event Stopped detail 3 opaque None (vm:5341) 2021-05-13 14:24:41,624+0530 DEBUG (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') shutdown reason: 6 (vm:1118) 2021-05-13 14:24:41,624+0530 INFO (libvirt/events) [virt.vm] (vmId='e13ede82-b270-4b03-ba75-ed019f9af46c') underlying process disconnected (vm:1122) ==== All future migrations from this host will be stuck on acquiring semaphore. === 2021-05-13 14:25:39,152+0530 INFO (migsrc/b9b65ab5) [virt.vm] (vmId='b9b65ab5-f440-4cf9-9ee4-4a0bafb65200') Migration semaphore: acquiring (migration:424) === Attaching the vdsm log. Good analysis, thank you. Indeed, the check for migration based on the status is not reliable and it should be fixed. Additionally, I think it makes no sense to continue with a migration when the VM is powering down so the migration should be canceled, if possible, when initiating shutdown, which should also avoid some possible problems. I'll prepare patches for both. I could reproduce the issue but only by modifying Vdsm sources and inserting an artificial delay there. It can be difficult to verify the fix in QE. When shutdown from Engine is requested, Engine calls VM.destroy on both the ends of the migration. In order to invoke the issue, the following conditions must be met: - The migration succeeds. - It implies the migration must be completed before the destroy call on the destination cancels it. - VM.destroy must be called on the source while the VM is still migrating. - The corresponding power off (i.e. killing the QEMU process) must not bring the VM down before the migration finishes. This is very tight and I can think about the following arrangements to increase the chances to reproduce the issue: - Calling VM.destroy using vdsm-client on the source, to avoid the race with VM.destroy on the destination. - Making the QEMU process slow to react on TERM signal or, preferably, making it not to react on it at all and increasing gracefulAttempts argument value in VM.destroy call. I don't know how to arrange such a thing in QEMU though. Any better or additional ideas? Exploring the bug also revealed two additional issues: - When a VM is destroyed on the source during a migration, libvirt becomes unresponsive until it is restarted. The symptoms are very similar to BZ 1966121. - When a VM is migrating and shutdown is requested from the webadmin UI, Engine calls power off instead, because the VM is not in UP state. I'm not sure whether it's a bug or a feature. (In reply to Milan Zamazal from comment #9) > - When a VM is migrating and shutdown is requested from the webadmin UI, > Engine calls power off instead, because the VM is not in UP state. I'm not > sure whether it's a bug or a feature. Yeah, it doesn't sound right. When the VM is up and is installed with guest agent, the VM can react to the shut down call. When the VM is up + not installed with guest agent + configured with ACPI, it should be able to react to the shut down call (assuming the guest completed booting). I believe that's why we check on ShutdownVm whether the VM is up and call power-off otherwise - since it's less likely that the VM could react to the shut down call then. If we see the VM in migrating status, we don't really know if the guest completed booting or not so we don't know if we can expect it to react to the shut down call. Maybe we can improve this by checking if guest agent data was received since booting the VM as that's an indication that the guest finished booting. (In reply to Milan Zamazal from comment #8) > This is very tight and I can think about the following arrangements to > increase the chances to reproduce the issue: > > - Calling VM.destroy using vdsm-client on the source, to avoid the race with > VM.destroy on the destination. > - Making the QEMU process slow to react on TERM signal or, preferably, > making it not to react on it at all and increasing gracefulAttempts argument > value in VM.destroy call. I don't know how to arrange such a thing in QEMU > though. > > Any better or additional ideas? It's one of those timing issues that would be difficult to QE to reproduce with "conventional" tools. As you managed to reproduce it with code changes that enable simulating the problematic interactions and assuming you won't be able to reproduce it with the fix, I think it would be enough for QE to automate the procedure that was described in comment 3 (shut down a migrating VM 30+ times, we can make it 50) and if that passes + regression tests pass then we're good Verification requires fix of BZ 1967715 and the same issue with migration semaphores can be hit, although for different reasons, until the libvirt bug is fixed. For those reasons, moving the bug back to modified and adding dependency on the libvirt bug. Verified with: vdsm-4.40.80.5-1.el8ev.x86_64 libvirt-7.0.0-14.3.module+el8.4.0+11878+84e54169.x86_64 ovirt-engine-4.4.8.4-0.7.el8ev.noarch Steps: The same as steps in comment #15 Result: Shut down a migrating VM 50 times, the unable to acquire the migration semaphore issue wasn't reproduced. Besides, there is no regression in automation tests. So move this bug to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Virtualization Host security and bug fix update [ovirt-4.4.8]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3459 |
Description of problem: If a VM was destroyed during migration and if the destroy request comes almost at the same time when the libvirtd suspends the CPU, the migration source thread fails to exit and holds the semaphore blocking the subsequent migration. Migration attempt. === 2021-05-03 17:36:29,699+0200 DEBUG (jsonrpc/0) [vds] {u'incomingLimit': 2, u'src': u'source.example.com', u'dstqemu': u'destination.example.com', u'autoConverge': u'false', u'tunneled': u'false', u'enableGuestEvents': False, u'dst': u'destination.example.com:54321', u'vmId': u'6d69ba9d-5616-4d35-8411-33522337a547', u'abortOnError': u'true', u'outgoingLimit': 2, u'compressed': u'false', u'maxBandwidth': 1250, u'method': u'online', 'mode': 'remote'} (API:493) === Got VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED from libvirtd. === 2021-05-03 17:36:39,896+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') event Suspended detail 1 opaque None (vm:5460) 2021-05-03 17:36:39,897+0200 INFO (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') CPU stopped: onSuspend (vm:6062) === The user asked to stop the VM. engine log === 2021-05-03 17:36:39,030+02 INFO [org.ovirt.engine.core.bll.StopVmCommand] (default task-49631) [f4cf8fc2-555a-40ce-8341-46e41b524f04] Running command: StopVmCommand internal: false. Entities affected : ID: 6d69ba9d-5616-4d35-8411-33522337a547 Type: VMAction group STOP_VM with role type USER === vdsm log === 2021-05-03 17:36:40,471+0200 DEBUG (jsonrpc/1) [jsonrpc.JsonRpcServer] Calling 'VM.destroy' in bridge with {u'vmID': u'6d69ba9d-5616-4d35-8411-33522337a547'} (__init__:329) 2021-05-03 17:36:40,472+0200 DEBUG (jsonrpc/1) [vds] About to destroy VM 6d69ba9d-5616-4d35-8411-33522337a547 (API:330) 2021-05-03 17:36:41,502+0200 INFO (jsonrpc/1) [api.virt] FINISH destroy return={'status': {'message': 'Virtual machine destroy error', 'code': 42}} from=::ffff:, flow_id=f4cf8fc2-555a-40ce-8341-46e41b524f04, vmId=6d69ba9d-5616-4d35-8411-33522337a547 (api:54) === Although the destroy failed, it looks like somehow the vmstatus got changed. At the same time, vdsm got VIR_DOMAIN_EVENT_STOPPED_MIGRATED from libvirtd. === 2021-05-03 17:36:41,501+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') event Stopped detail 3 opaque None (vm:5460) ==== And it seems to be vmstatus was _not_ MIGRATION_SOURCE and hence self._onQemuDeath was called from _handle_libvirt_domain_stopped. === 5373 def _handle_libvirt_domain_stopped(self, detail): 5374 if (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_MIGRATED and 5375 self.lastStatus == vmstatus.MIGRATION_SOURCE): 5376 try: 5377 hooks.after_vm_migrate_source( 5378 self._domain.xml, self._custom) 5379 for dev in self._customDevices(): 5380 hooks.after_device_migrate_source( 5381 dev._deviceXML, self._custom, dev.custom) 5382 finally: 5383 self.stopped_migrated_event_processed.set() 5384 elif (detail == libvirt.VIR_DOMAIN_EVENT_STOPPED_SAVED and 5385 self.lastStatus == vmstatus.SAVING_STATE): 5386 hooks.after_vm_hibernate(self._domain.xml, self._custom) 5387 else: 5388 exit_code, reason = self._getShutdownReason() 5389 self._onQemuDeath(exit_code, reason) === The logs also suggest it went into _onQemuDeath. === 2021-05-03 17:36:41,501+0200 DEBUG (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') shutdown reason: 6 (vm:1073) 2021-05-03 17:36:41,501+0200 INFO (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') underlying process disconnected (vm:1077) 2021-05-03 17:36:41,851+0200 INFO (libvirt/events) [virt.vm] (vmId='6d69ba9d-5616-4d35-8411-33522337a547') Changed state to Down: Admin shut down from the engine (code=6) (vm:1701) === So "stopped_migrated_event_processed" was not set and the migration source thread will wait forever and will hold the semaphore. === 322 def _finishSuccessfully(self, machineParams): 323 with self._lock: 324 self._progress = 100 325 if not self.hibernating: 326 # TODO: We could use a timeout on the wait to be more robust 327 # against "impossible" failures. But we don't have a good value to 328 # use here now. 329 self._vm.stopped_migrated_event_processed.wait() === If there are two such incidents, the subsequent migration will get hung since it won't be able to acquire the semaphore. === 2021-05-03 21:53:52,150+0200 DEBUG (migsrc/07020194) [virt.vm] (vmId='07020194-bad7-4566-be19-f295516923b5') Migration started (migration:276) 2021-05-03 21:53:52,150+0200 INFO (migsrc/07020194) [virt.vm] (vmId='07020194-bad7-4566-be19-f295516923b5') Migration semaphore: acquiring (migration:407) === The only way to release the migration semaphore is to restart the vdsm service. Version-Release number of selected component (if applicable): vdsm-4.30.38-1.el7ev.x86_64 How reproducible: Observed 2 times in customer's environment. Steps to Reproduce: I am unable to reproduce this yet in 4.4 by stopping the VM in the middle of migration. But I guess the timing should be correct to hit the issue. Actual results: VMs stuck in "migrating" status since it's unable to acquire the migration semaphore. Expected results: Migration should work. Additional info: