Bug 1028917
Summary: | Resource lock split brain causes VM to get paused after migration | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Tomas Dosek <tdosek> | |
Component: | vdsm | Assignee: | Vinzenz Feenstra [evilissimo] <vfeenstr> | |
Status: | CLOSED ERRATA | QA Contact: | Pavel Novotny <pnovotny> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 3.2.0 | CC: | bazulay, danken, iheim, lbopf, lpeer, lyarwood, mavital, michal.skrivanek, pablo.iranzo, pep, sherold, tdosek, yeylon | |
Target Milestone: | --- | Keywords: | ZStream | |
Target Release: | 3.4.0 | |||
Hardware: | All | |||
OS: | All | |||
Whiteboard: | virt | |||
Fixed In Version: | av1 | Doc Type: | Bug Fix | |
Doc Text: |
Virtual machines are no longer paused after migrations; hosts now correctly acquire resource locks for recently migrated virtual machines.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1059129 (view as bug list) | Environment: | ||
Last Closed: | 2014-06-09 13:26:27 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1059129, 1078909, 1142926 |
Description
Tomas Dosek
2013-11-11 08:37:46 UTC
Michal, Lee, would you explain why this could be a Vdsm networking issue? It's hard to guess. Vdsm must handle communication failures during migration. The logs suggests there's a bug in how dst vdsm treats timeout. _waitForIncomingMigrationFinish claims that # Would fail if migration isn't successful, # or restart vdsm if connection to libvirt was lost self._dom = NotifyingVirDomain( self._connection.lookupByUUIDString(self.id), self._timeoutExperienced) however the logs show that it has actually succeeded, despite the timeout. _waitForIncomingMigrationFinish() should not depend on the failure to initialize _dom. It should probably check that _dom has finished migrating, and raise an exception otherwise. the question is what to do in this case. If incoming migration doesn't finish we should probably try to cancel it on the destination first before moving the VM to down. This should happen only after source tries to do that first as it should be initiating the abort from src side in first place. Then this dst check would be just a safety net in case of communication issues. We can still end up with split brain when there is a libvirt communication problem and it refuses to cancel it (from dst point of view) - then we should probably try to cancel and if raises error other than "job not running" check again if VM is Up and if not forcefully destroy it. Otherwise we're risking setting it down on dst (current code) which would in turn trigger HA in engine to start the VM once again and we end up with a split brain. looks quite convoluted… I wonder if the current actual behavior (i.e, proceeding with initialization of the paused VM) is problematic at all. What if we focus on destroying the src properly. Right now if it fails to destroy it (which is the case in the bug) we do nothing. This may be even worse though, so perhaps better to leave the dst paused VM there for manual resolution? ideas? Michal, you know my opinion here: only Engine can tell which of the qemu processes should die. The code I referred to should spring into action only when Engine is unavailable (if at all; I do not mind to drop the vdsm-side timeout). Do note that until Vdsm receives a notification from libvirt about the successful finish of migration, the VM must not leave its "migration destination" state. It is unusable, until we have confirmation that we have confirmation from the source for having all information. If there's communication between source and dest Vdsms, we do not have a problem here. The interesting case is when destination Vdsm cannot connect with the source. In that case, lacking a confirmation from libvirt, I do not see anything we can do beyond canceling migration explicitly or by means of destroying qemu at the destination; either my Vdsm's out decision or by Engine's explicit request. At this point there's little we can do. The only thing we should address right now is the timeout situation at destination (not moving on expecting success) Other that that we can only make things worse from vdsm point of view so indeed it should be resolved from the engine side. At least manually which should be good enough for now. decreasing Sev as till the dst VM is manually run there's no splitbrain. It actually wouldn't run at all till the migration finishes. decreasing Prio as the bug is targetted to 3.3.z Merged u/s to master (and included in ovirt-3.4 branch) as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=69cb5099ac7f395010e40179a2e48ecc0e3b1f24 Verified upstream in vdsm-4.14.1-2.el6.x86_64. Verification steps: 1. Preparation: On destination migration host, set 'migration_destination_timeout' to '120' in VDSM config.py (located at /usr/lib64/python2.6/site-packages/vdsm/config.py). This reduces the verification time, otherwise the default is 6 hours. 2. Have a running VM (F19 in my case) with some ongoing memory-stressing operation (I used `memtester` utility). This should make the migration process long enough to give us time in step 3 to simulate the error-prone environment. 2. Migrate the VM from source host1 do destination host2. 3. Immediately after migration starts, block on the source host1: - connection to destination host VDSM (simulating connection loss to dest. VDSM) `iptables -I OUTPUT 1 -p tcp -d <host2> --dport 54321 -j DROP` - connection to the storage (simulating migration error) `iptables -I OUTPUT 1 -d <storage> -j DROP` 4. Wait `migration_destination_timeout` seconds (120). Results: The migration fails (due to our blocking of storage) and is aborted. On destination host, the migrating VM is destroyed (the host shows 0 running VMs and no VM migrating). The VM stays on the source host (paused due to inaccessible storage; after unblocking the storage the VM should run as if nothing happened). The source host shows 1 running VM and no VM migrating. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html |