1028917 – Resource lock split brain causes VM to get paused after migration

Bug 1028917 - Resource lock split brain causes VM to get paused after migration

Summary: Resource lock split brain causes VM to get paused after migration

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.2.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.4.0
Assignee:	Vinzenz Feenstra [evilissimo]
QA Contact:	Pavel Novotny
Docs Contact:
URL:
Whiteboard:	virt
Depends On:
Blocks:	1059129 rhev3.4beta 1142926
TreeView+	depends on / blocked

Reported:	2013-11-11 08:37 UTC by Tomas Dosek
Modified:	2018-12-09 17:17 UTC (History)
CC List:	13 users (show)
Fixed In Version:	av1
Doc Type:	Bug Fix
Doc Text:	Virtual machines are no longer paused after migrations; hosts now correctly acquire resource locks for recently migrated virtual machines.
Clone Of:
Clones:	1059129 (view as bug list)
Environment:
Last Closed:	2014-06-09 13:26:27 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2014:0504	0	normal	SHIPPED_LIVE	vdsm 3.4.0 bug fix and enhancement update	2014-06-09 17:21:35 UTC
oVirt gerrit	21963	0	None	MERGED	Avoid going into 'Paused' status during long lasting migrations	2020-12-11 14:12:16 UTC

Description Tomas Dosek 2013-11-11 08:37:46 UTC

Description of problem:
In case of one customer's scenario VMs get paused after migration which seems to be result of destination host not being able to acquire resource lock for VM's storage.

On source somewhat after starting the migration:
SRC host:
-----------------
Nov  4 15:46:48 clx29bb13-00 vdsm vm.Vm WARNING vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Migration stalling: dataRemaining (700MiB) > smallest_dataRemaining (700MiB). Refer to RHBZ#919201.
Nov  4 15:46:58 clx29bb13-00 vdsm vm.Vm WARNING vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Migration stalling: dataRemaining (700MiB) > smallest_dataRemaining (700MiB). Refer to RHBZ#919201.
-------------------------------------------> Migration got stale (not sure of the reason, probably network load between the hosts) - in that case we usually kill the VM on destination and rerun it on source with migration failed error message

Nov  4 15:46:59 clx29bb13-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::operation failed: Lost connection to destination host
Nov  4 15:48:02 clx29bb13-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Failed to destroy remote VM#012Traceback (most recent call last):#012  File "/usr/share/vdsm/vm.py", line 200, in _recover#012    self.destServer.destroy(self._vm.id)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__#012    return self.__send(self.__name, args)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request#012    verbose=self.__verbose#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request#012    self.send_content(h, request_body)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content#012    connection.endheaders()#012  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders#012    self._send_output()#012  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output#012    self.send(msg)#012  File "/usr/lib64/python2.6/httplib.py", line 739, in send#012    self.connect()#012  File "/usr/share/vdsm/kaxmlrpclib.py", line 151, in connect#012    SecureXMLRPCServer.VerifyingHTTPSConnection.connect(self)#012  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 165, in connect#012    sock = socket.create_connection((self.host, self.port), self.timeout)#012  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection#012    raise error, msg#012error: [Errno 110] Connection timed out
Nov  4 15:50:17 clx29bb13-00 vdsm vm.Vm ERROR vmId=`e8acff0c-50d2-49d1-b150-bd51bbe835c7`::[Errno 110] Connection timed out
Nov  4 15:51:20 clx29bb13-00 vdsm vm.Vm ERROR vmId=`e8acff0c-50d2-49d1-b150-bd51bbe835c7`::Failed to destroy remote VM#012Traceback (most recent call last):#012  File "/usr/share/vdsm/vm.py", line 200, in _recover#012    self.destServer.destroy(self._vm.id)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__#012    return self.__send(self.__name, args)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request#012    verbose=self.__verbose#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request#012    self.send_content(h, request_body)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content#012    connection.endheaders()#012  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders#012    self._send_output()#012  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output#012    self.send(msg)#012  File "/usr/lib64/python2.6/httplib.py", line 739, in send#012    self.connect()#012  File "/usr/share/vdsm/kaxmlrpclib.py", line 151, in connect#012    SecureXMLRPCServer.VerifyingHTTPSConnection.connect(self)#012  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 165, in connect#012    sock = socket.create_connection((self.host, self.port), self.timeout)#012  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection#012    raise error, msg#012error: [Errno 110] Connection timed out
----------------------------------------------> Here's a failure to destroy the machine on destiantion

Nov  4 15:51:20 clx29bb13-00 vdsm vm.Vm ERROR vmId=`e8acff0c-50d2-49d1-b150-bd51bbe835c7`::Failed to migrate#012Traceback (most recent call last):#012  File "/usr/share/vdsm/vm.py", line 270, in run#012    self._startUnderlyingMigration()#012  File "/usr/share/vdsm/libvirtvm.py", line 459, in _startUnderlyingMigration#012    response = self.destServer.migrationCreate(self._machineParams)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__#012    return self.__send(self.__name, args)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request#012    verbose=self.__verbose#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request#012    self.send_content(h, request_body)#012  File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content#012    connection.endheaders()#012  File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders#012    self._send_output()#012  File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output#012    self.send(msg)#012  File "/usr/lib64/python2.6/httplib.py", line 739, in send#012    self.connect()#012  File "/usr/share/vdsm/kaxmlrpclib.py", line 151, in connect#012    SecureXMLRPCServer.VerifyingHTTPSConnection.connect(self)#012  File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 165, in connect#012    sock = socket.create_connection((self.host, self.port), self.timeout)#012  File "/usr/lib64/python2.6/socket.py", line 567, in create_connection#012    raise error, msg#012error: [Errno 110] Connection timed out

On destination host:
Nov  4 15:51:38 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDisk at 0x2da4ea0>#012Traceback (most recent call last):#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012    statsFunction()#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012    retValue = self._function(*args, **kwargs)#012  File "/usr/share/vdsm/libvirtvm.py", line 164, in _sampleDisk#012    diskSamples[vmDrive.name] = self._vm._dom.blockStats(vmDrive.name)#012  File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012    raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock
Nov  4 15:52:08 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x2da1570>#012Traceback (most recent call last):#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012    statsFunction()#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012    retValue = self._function(*args, **kwargs)#012  File "/usr/share/vdsm/libvirtvm.py", line 179, in _sampleDiskLatency#012    vmDrive.name, flags=libvirt.VIR_TYPED_PARAM_STRING_OKAY)#012  File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012    raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock
Nov  4 15:53:38 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDisk at 0x2da4ea0>#012Traceback (most recent call last):#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012    statsFunction()#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012    retValue = self._function(*args, **kwargs)#012  File "/usr/share/vdsm/libvirtvm.py", line 164, in _sampleDisk#012    diskSamples[vmDrive.name] = self._vm._dom.blockStats(vmDrive.name)#012  File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012    raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock
Nov  4 15:54:08 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x2da1570>#012Traceback (most recent call last):#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012    statsFunction()#012  File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012    retValue = self._function(*args, **kwargs)#012  File "/usr/share/vdsm/libvirtvm.py", line 179, in _sampleDiskLatency#012    vmDrive.name, flags=libvirt.VIR_TYPED_PARAM_STRING_OKAY)#012  File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012    raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock
-----------------------------------------------------------> This is the reason for VM to pause on destination (IO error whose roots are in not being able to acquire lock for resource)


From the above description it seems like the scenario is:
Migration starts - VM duplicate is created on DST host, data are migrating - migrations gets stale - we try to destroy the VM - *this fails* - the machine on DST fails to acquire resource storage lock (as it's still held by SRC host) - VM on destination gets paused on IO error.

Version-Release number of selected component (if applicable):
SRC host: vdsm-4.10.2-25.1.el6ev.x86_64
DST host: vdsm-4.10.2-25.1.el6ev.x86_64

How reproducible:
Very unstable reproducer, they accidentally hit this in very large scale environment

Steps to Reproduce:
Basically described by the scenario itself.

Actual results:
VM is paused on DST host, SRC host fails to destroy DST VM and after timeout kills SRC VM and releases resource locks

Expected results:
Not sure here, but current scenario doesn't seem much sane.

Comment 4 Dan Kenigsberg 2013-11-20 23:39:32 UTC

Michal, Lee, would you explain why this could be a Vdsm networking issue? It's hard to guess. Vdsm must handle communication failures during migration.

The logs suggests there's a bug in how dst vdsm treats timeout. _waitForIncomingMigrationFinish claims that

                # Would fail if migration isn't successful,
                # or restart vdsm if connection to libvirt was lost
                self._dom = NotifyingVirDomain(
                                self._connection.lookupByUUIDString(self.id),
                                self._timeoutExperienced)

however the logs show that it has actually succeeded, despite the timeout. _waitForIncomingMigrationFinish() should not depend on the failure to initialize _dom. It should probably check that _dom has finished migrating, and raise an exception otherwise.

Comment 5 Michal Skrivanek 2013-11-26 13:54:02 UTC

the question is what to do in this case. If incoming migration doesn't finish we should probably try to cancel it on the destination first before moving the VM to down. This should happen only after source tries to do that first as it should be initiating the abort from src side in first place. Then this dst check would be just a safety net in case of communication issues. We can still end up with split brain when there is a libvirt communication problem and it refuses to cancel it (from dst point of view) - then we should probably try to cancel and if raises error other than "job not running" check again if VM is Up and if not forcefully destroy it. Otherwise we're risking setting it down on dst (current code) which would in turn trigger HA in engine to start the VM once again and we end up with a split brain.

looks quite convoluted…

I wonder if the current actual behavior (i.e, proceeding with initialization of the paused VM) is problematic at all. What if  we focus on destroying the src properly. Right now if it fails to destroy it (which is the case in the bug) we do nothing. 
This may be even worse though, so perhaps better to leave the dst paused VM there for manual resolution?

ideas?

Comment 6 Dan Kenigsberg 2013-11-26 15:30:40 UTC

Michal, you know my opinion here: only Engine can tell which of the qemu processes should die. The code I referred to should spring into action only when Engine is unavailable (if at all; I do not mind to drop the vdsm-side timeout).

Do note that until Vdsm receives a notification from libvirt about the successful finish of migration, the VM must not leave its "migration destination" state. It is unusable, until we have confirmation that we have confirmation from the source for having all information.

If there's communication between source and dest Vdsms, we do not have a problem here. The interesting case is when destination Vdsm cannot connect with the source. In that case, lacking a confirmation from libvirt, I do not see anything we can do beyond canceling migration explicitly or by means of destroying qemu at the destination; either my Vdsm's out decision or by Engine's explicit request.

Comment 7 Michal Skrivanek 2013-12-03 14:14:24 UTC

At this point there's little we can do. The only thing we should address right now is the timeout situation at destination (not moving on expecting success)

Other that that we can only make things worse from vdsm point of view so indeed it should be resolved from the engine side. At least manually which should be good enough for now.

Comment 11 Michal Skrivanek 2013-12-25 12:06:53 UTC

decreasing Sev as till the dst VM is manually run there's no splitbrain. It actually wouldn't run at all till the migration finishes. 
decreasing Prio as the bug is targetted to 3.3.z

Comment 12 Vinzenz Feenstra [evilissimo] 2014-01-22 14:51:08 UTC

Merged u/s to master (and included in ovirt-3.4 branch) as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=69cb5099ac7f395010e40179a2e48ecc0e3b1f24

Comment 14 Pavel Novotny 2014-02-18 20:29:40 UTC

Verified upstream in vdsm-4.14.1-2.el6.x86_64.

Verification steps:
1. Preparation: On destination migration host, set 'migration_destination_timeout' to '120' in VDSM config.py (located at /usr/lib64/python2.6/site-packages/vdsm/config.py).
   This reduces the verification time, otherwise the default is 6 hours.
2. Have a running VM (F19 in my case) with some ongoing memory-stressing operation (I used `memtester` utility). This should make the migration process long enough to give us time in step 3 to simulate the error-prone environment.
2. Migrate the VM from source host1 do destination host2. 
3. Immediately after migration starts, block on the source host1:
  - connection to destination host VDSM (simulating connection loss to dest. VDSM)
  `iptables -I OUTPUT 1 -p tcp -d <host2> --dport 54321 -j DROP`
  - connection to the storage (simulating migration error)
  `iptables -I OUTPUT 1 -d <storage> -j DROP`
4. Wait `migration_destination_timeout` seconds (120).

Results:
The migration fails (due to our blocking of storage) and is aborted.
On destination host, the migrating VM is destroyed (the host shows 0 running VMs and no VM migrating).
The VM stays on the source host (paused due to inaccessible storage; after unblocking the storage the VM should run as if nothing happened). 
The source host shows 1 running VM and no VM migrating.

Comment 17 errata-xmlrpc 2014-06-09 13:26:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0504.html

Note You need to log in before you can comment on or make changes to this bug.