Description of problem: In case of one customer's scenario VMs get paused after migration which seems to be result of destination host not being able to acquire resource lock for VM's storage. On source somewhat after starting the migration: SRC host: ----------------- Nov 4 15:46:48 clx29bb13-00 vdsm vm.Vm WARNING vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Migration stalling: dataRemaining (700MiB) > smallest_dataRemaining (700MiB). Refer to RHBZ#919201. Nov 4 15:46:58 clx29bb13-00 vdsm vm.Vm WARNING vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Migration stalling: dataRemaining (700MiB) > smallest_dataRemaining (700MiB). Refer to RHBZ#919201. -------------------------------------------> Migration got stale (not sure of the reason, probably network load between the hosts) - in that case we usually kill the VM on destination and rerun it on source with migration failed error message Nov 4 15:46:59 clx29bb13-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::operation failed: Lost connection to destination host Nov 4 15:48:02 clx29bb13-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Failed to destroy remote VM#012Traceback (most recent call last):#012 File "/usr/share/vdsm/vm.py", line 200, in _recover#012 self.destServer.destroy(self._vm.id)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__#012 return self.__send(self.__name, args)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request#012 verbose=self.__verbose#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request#012 self.send_content(h, request_body)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content#012 connection.endheaders()#012 File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders#012 self._send_output()#012 File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output#012 self.send(msg)#012 File "/usr/lib64/python2.6/httplib.py", line 739, in send#012 self.connect()#012 File "/usr/share/vdsm/kaxmlrpclib.py", line 151, in connect#012 SecureXMLRPCServer.VerifyingHTTPSConnection.connect(self)#012 File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 165, in connect#012 sock = socket.create_connection((self.host, self.port), self.timeout)#012 File "/usr/lib64/python2.6/socket.py", line 567, in create_connection#012 raise error, msg#012error: [Errno 110] Connection timed out Nov 4 15:50:17 clx29bb13-00 vdsm vm.Vm ERROR vmId=`e8acff0c-50d2-49d1-b150-bd51bbe835c7`::[Errno 110] Connection timed out Nov 4 15:51:20 clx29bb13-00 vdsm vm.Vm ERROR vmId=`e8acff0c-50d2-49d1-b150-bd51bbe835c7`::Failed to destroy remote VM#012Traceback (most recent call last):#012 File "/usr/share/vdsm/vm.py", line 200, in _recover#012 self.destServer.destroy(self._vm.id)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__#012 return self.__send(self.__name, args)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request#012 verbose=self.__verbose#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request#012 self.send_content(h, request_body)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content#012 connection.endheaders()#012 File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders#012 self._send_output()#012 File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output#012 self.send(msg)#012 File "/usr/lib64/python2.6/httplib.py", line 739, in send#012 self.connect()#012 File "/usr/share/vdsm/kaxmlrpclib.py", line 151, in connect#012 SecureXMLRPCServer.VerifyingHTTPSConnection.connect(self)#012 File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 165, in connect#012 sock = socket.create_connection((self.host, self.port), self.timeout)#012 File "/usr/lib64/python2.6/socket.py", line 567, in create_connection#012 raise error, msg#012error: [Errno 110] Connection timed out ----------------------------------------------> Here's a failure to destroy the machine on destiantion Nov 4 15:51:20 clx29bb13-00 vdsm vm.Vm ERROR vmId=`e8acff0c-50d2-49d1-b150-bd51bbe835c7`::Failed to migrate#012Traceback (most recent call last):#012 File "/usr/share/vdsm/vm.py", line 270, in run#012 self._startUnderlyingMigration()#012 File "/usr/share/vdsm/libvirtvm.py", line 459, in _startUnderlyingMigration#012 response = self.destServer.migrationCreate(self._machineParams)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1199, in __call__#012 return self.__send(self.__name, args)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1489, in __request#012 verbose=self.__verbose#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1235, in request#012 self.send_content(h, request_body)#012 File "/usr/lib64/python2.6/xmlrpclib.py", line 1349, in send_content#012 connection.endheaders()#012 File "/usr/lib64/python2.6/httplib.py", line 908, in endheaders#012 self._send_output()#012 File "/usr/lib64/python2.6/httplib.py", line 780, in _send_output#012 self.send(msg)#012 File "/usr/lib64/python2.6/httplib.py", line 739, in send#012 self.connect()#012 File "/usr/share/vdsm/kaxmlrpclib.py", line 151, in connect#012 SecureXMLRPCServer.VerifyingHTTPSConnection.connect(self)#012 File "/usr/lib64/python2.6/site-packages/vdsm/SecureXMLRPCServer.py", line 165, in connect#012 sock = socket.create_connection((self.host, self.port), self.timeout)#012 File "/usr/lib64/python2.6/socket.py", line 567, in create_connection#012 raise error, msg#012error: [Errno 110] Connection timed out On destination host: Nov 4 15:51:38 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDisk at 0x2da4ea0>#012Traceback (most recent call last):#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012 statsFunction()#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/libvirtvm.py", line 164, in _sampleDisk#012 diskSamples[vmDrive.name] = self._vm._dom.blockStats(vmDrive.name)#012 File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012 raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock Nov 4 15:52:08 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x2da1570>#012Traceback (most recent call last):#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012 statsFunction()#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/libvirtvm.py", line 179, in _sampleDiskLatency#012 vmDrive.name, flags=libvirt.VIR_TYPED_PARAM_STRING_OKAY)#012 File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012 raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock Nov 4 15:53:38 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDisk at 0x2da4ea0>#012Traceback (most recent call last):#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012 statsFunction()#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/libvirtvm.py", line 164, in _sampleDisk#012 diskSamples[vmDrive.name] = self._vm._dom.blockStats(vmDrive.name)#012 File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012 raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock Nov 4 15:54:08 clx29bb06-00 vdsm vm.Vm ERROR vmId=`7d6d4bb9-06c1-4bd2-bdda-15fac52a699b`::Stats function failed: <AdvancedStatsFunction _sampleDiskLatency at 0x2da1570>#012Traceback (most recent call last):#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 412, in collect#012 statsFunction()#012 File "/usr/lib64/python2.6/site-packages/vdsm/utils.py", line 287, in __call__#012 retValue = self._function(*args, **kwargs)#012 File "/usr/share/vdsm/libvirtvm.py", line 179, in _sampleDiskLatency#012 vmDrive.name, flags=libvirt.VIR_TYPED_PARAM_STRING_OKAY)#012 File "/usr/share/vdsm/libvirtvm.py", line 539, in f#012 raise toe#012TimeoutError: Timed out during operation: cannot acquire state change lock -----------------------------------------------------------> This is the reason for VM to pause on destination (IO error whose roots are in not being able to acquire lock for resource) From the above description it seems like the scenario is: Migration starts - VM duplicate is created on DST host, data are migrating - migrations gets stale - we try to destroy the VM - *this fails* - the machine on DST fails to acquire resource storage lock (as it's still held by SRC host) - VM on destination gets paused on IO error. Version-Release number of selected component (if applicable): SRC host: vdsm-4.10.2-25.1.el6ev.x86_64 DST host: vdsm-4.10.2-25.1.el6ev.x86_64 How reproducible: Very unstable reproducer, they accidentally hit this in very large scale environment Steps to Reproduce: Basically described by the scenario itself. Actual results: VM is paused on DST host, SRC host fails to destroy DST VM and after timeout kills SRC VM and releases resource locks Expected results: Not sure here, but current scenario doesn't seem much sane.
Michal, Lee, would you explain why this could be a Vdsm networking issue? It's hard to guess. Vdsm must handle communication failures during migration. The logs suggests there's a bug in how dst vdsm treats timeout. _waitForIncomingMigrationFinish claims that # Would fail if migration isn't successful, # or restart vdsm if connection to libvirt was lost self._dom = NotifyingVirDomain( self._connection.lookupByUUIDString(self.id), self._timeoutExperienced) however the logs show that it has actually succeeded, despite the timeout. _waitForIncomingMigrationFinish() should not depend on the failure to initialize _dom. It should probably check that _dom has finished migrating, and raise an exception otherwise.
the question is what to do in this case. If incoming migration doesn't finish we should probably try to cancel it on the destination first before moving the VM to down. This should happen only after source tries to do that first as it should be initiating the abort from src side in first place. Then this dst check would be just a safety net in case of communication issues. We can still end up with split brain when there is a libvirt communication problem and it refuses to cancel it (from dst point of view) - then we should probably try to cancel and if raises error other than "job not running" check again if VM is Up and if not forcefully destroy it. Otherwise we're risking setting it down on dst (current code) which would in turn trigger HA in engine to start the VM once again and we end up with a split brain. looks quite convoluted… I wonder if the current actual behavior (i.e, proceeding with initialization of the paused VM) is problematic at all. What if we focus on destroying the src properly. Right now if it fails to destroy it (which is the case in the bug) we do nothing. This may be even worse though, so perhaps better to leave the dst paused VM there for manual resolution? ideas?
Michal, you know my opinion here: only Engine can tell which of the qemu processes should die. The code I referred to should spring into action only when Engine is unavailable (if at all; I do not mind to drop the vdsm-side timeout). Do note that until Vdsm receives a notification from libvirt about the successful finish of migration, the VM must not leave its "migration destination" state. It is unusable, until we have confirmation that we have confirmation from the source for having all information. If there's communication between source and dest Vdsms, we do not have a problem here. The interesting case is when destination Vdsm cannot connect with the source. In that case, lacking a confirmation from libvirt, I do not see anything we can do beyond canceling migration explicitly or by means of destroying qemu at the destination; either my Vdsm's out decision or by Engine's explicit request.
At this point there's little we can do. The only thing we should address right now is the timeout situation at destination (not moving on expecting success) Other that that we can only make things worse from vdsm point of view so indeed it should be resolved from the engine side. At least manually which should be good enough for now.
decreasing Sev as till the dst VM is manually run there's no splitbrain. It actually wouldn't run at all till the migration finishes. decreasing Prio as the bug is targetted to 3.3.z
Merged u/s to master (and included in ovirt-3.4 branch) as http://gerrit.ovirt.org/gitweb?p=vdsm.git;a=commit;h=69cb5099ac7f395010e40179a2e48ecc0e3b1f24
Verified upstream in vdsm-4.14.1-2.el6.x86_64. Verification steps: 1. Preparation: On destination migration host, set 'migration_destination_timeout' to '120' in VDSM config.py (located at /usr/lib64/python2.6/site-packages/vdsm/config.py). This reduces the verification time, otherwise the default is 6 hours. 2. Have a running VM (F19 in my case) with some ongoing memory-stressing operation (I used `memtester` utility). This should make the migration process long enough to give us time in step 3 to simulate the error-prone environment. 2. Migrate the VM from source host1 do destination host2. 3. Immediately after migration starts, block on the source host1: - connection to destination host VDSM (simulating connection loss to dest. VDSM) `iptables -I OUTPUT 1 -p tcp -d <host2> --dport 54321 -j DROP` - connection to the storage (simulating migration error) `iptables -I OUTPUT 1 -d <storage> -j DROP` 4. Wait `migration_destination_timeout` seconds (120). Results: The migration fails (due to our blocking of storage) and is aborted. On destination host, the migrating VM is destroyed (the host shows 0 running VMs and no VM migrating). The VM stays on the source host (paused due to inaccessible storage; after unblocking the storage the VM should run as if nothing happened). The source host shows 1 running VM and no VM migrating.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0504.html