Description of problem: With the introduction of 'migration_max_time_per_gib_mem' in vdsm-4.13.2-0.13, we have noticed some migrations failing due to an error similar to the following; "The migration took 133 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted." On closer inspection, it appears that the start time for this parameter may be captured too soon, i.e. in this specific case the actual data transfer had only been underway for 10 seconds before 'migration_max_time_per_gib_mem' was exceeded. The reason being that multiple migrations had been started simultaneously and the default value of 3 for 'max_outgoing_migrations' was limiting concurrent migrations to the point that some VMs took around 120 seconds just to acquire the migration semaphore. Details are provided below in a separate comment. Version-Release number of selected component (if applicable): RHEV 3.2 (in this specific customer case) RHEV-H 6.5 (20140407.0) with vdsm-4.13.2-0.13 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: The timer for 'migration_max_time_per_gib_mem' should probably be started once the semaphore has been acquired, as opposed to when the migration request was initiated. Expected results: Additional info:
The following shows extracts from 'vm.py' annotated relative to vdsm log messages. Notice the times of each. - extracts from 'vm.py'; 305 def run(self): 306 try: 307 startTime = time.time() 308 mstate = '' 309 self._setupVdsConnection() [1] 310 self._setupRemoteMachineParams() 311 self._prepareGuest() [2] 312 MigrationSourceThread._ongoingMigrations.acquire() 313 try: 314 if self._migrationCanceledEvt: 315 self._raiseAbortError() 316 self.log.debug("migration semaphore acquired") [3] 317 if not mstate: 318 self._vm.conf['_migrationParams'] = { 319 'dst': self._dst, 320 'mode': self._mode, 321 'method': self._method, 322 'dstparams': self._dstparams, 323 'dstqemu': self._dstqemu} 324 self._vm.saveState() 325 self._startUnderlyingMigration(startTime) 326 self._finishSuccessfully() 341 def _startUnderlyingMigration(self, startTime): 376 self._vm.log.debug('starting migration to %s ' [4] 377 'with miguri %s', duri, muri) 799 if 0 < migrationMaxTime < now - self._startTime: [5] 800 self._vm.log.warn('The migration took %d seconds which is ' 801 'exceeding the configured maximum time ' 802 'for migrations of %d seconds. The ' 803 'migration will be aborted.', 804 now - self._startTime, 805 migrationMaxTime) 806 abort = True - corresponding vdsm log messages; [1] Thread-1424::DEBUG::2014-04-14 18:50:22,115::vm::183::vm.Vm::(_setupVdsConnection) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::Initiating connection with destination [2] Thread-1424::DEBUG::2014-04-14 18:50:23,043::vm::233::vm.Vm::(_prepareGuest) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::Migration started [3] Thread-1424::DEBUG::2014-04-14 18:52:21,438::vm::316::vm.Vm::(run) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::migration semaphore acquired [4] Thread-1424::DEBUG::2014-04-14 18:52:25,947::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::starting migration to qemu+tls://10.10.x.y/system with miguri tcp://10.10.x.y [5] Thread-1517::WARNING::2014-04-14 18:52:35,952::vm::805::vm.Vm::(run) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::The migration took 133 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted. So, the migration was stalled for 2 minutes waiting for the migration semaphore. It then started to actually send data, but after 10 seconds it was aborted due to 'migration_max_time_per_gib_mem' having been exceeded. To me, the timer shouldn't start until the actual data transfer starts, as the intent of this parameter is presumably to make sure that data is being transferred at a reasonable rate.
(In reply to Gordon Watson from comment #1) indeed, this timer should be relative to the monitor thread start, by e.g. passing current time to _startUnderlyingMigration()'s startTime
Btw that issue was probably introduced by fix for bug 1069220 since message "The migration took xxx seconds which is exceeding the configured maximum time for migrations of yyy seconds. The migration will be aborted." is not present on vdsm 4.13.2-0.11.el6ev.
Hi, fix applied on vdsm. i have setup the the system with two bare-metal hosts and 5 vms, i have also using stress tool to simulate 90% memory allocation. while 5 vms migrate in parallel, looks like the semaphore acquired immediately, but the migration time out exceeding. see the logs results for the vm which fails to migrate: 1. 12:25:04,816::vm::223::vm.Vm::(_prepareGuest) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::Migration started 2. 12:25:04,842::vm::305::vm.Vm::(run) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::migration semaphore acquired 3. 12:25:04,984::vm::365::vm.Vm::(_startUnderlyingMigration) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::starting migration to qemu+t 4. 12:29:25,060::vm::798::vm.Vm::(run) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::The migration took 260 seconds which is exceeding the configured maximum time for migrations of 256 seconds. The migration will be aborted. and have also traceback for the failed vm and also attached logs (seems to be normal traceback since migration aborted) Thread-47::ERROR::2014-05-15 12:29:26,034::vm::327::vm.Vm::(run) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 313, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/vm.py", line 395, in _startUnderlyingMigration None, maxBandwidth) File "/usr/share/vdsm/vm.py", line 859, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1178, in migrateToURI2 if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
Created attachment 895912 [details] logs
when I set the 'migration_max_time_per_gib_mem', to '0', the migration failed due to 'Migration is stuck' Thread-15::DEBUG::2014-05-15 14:34:06,387::fileSD::259::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_BV_real/0c50b717-7b6c -4bdb-aaf0-920bebc2544b/dom_md/metadata bs=4096 count=1' (cwd None) Thread-15::DEBUG::2014-05-15 14:34:06,415::fileSD::259::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n495 bytes (495 B) copied, 0.00444558 s, 111 kB/s\n'; <rc> = 0 Thread-60::WARNING::2014-05-15 14:34:08,073::vm::807::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::Migration is stuck: Hasn't progressed in 150.043483019 seconds. Aborting. Thread-60::DEBUG::2014-05-15 14:34:08,090::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread Dummy-31::DEBUG::2014-05-15 14:34:08,207::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) 'dd if=/rhev/data-center/bdda5805-6d65-4b16-aca7-952f8b4fa2f5/mastersd/dom_md/inbox iflag=direct,fullblock count =1 bs=2457600' (cwd None) Dummy-31::DEBUG::2014-05-15 14:34:08,539::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n2457600 bytes (2.5 MB) copied, 0.124957 s, 19.7 MB/s\n'; <rc> = 0 Thread-62::DEBUG::2014-05-15 14:34:08,697::vm::742::vm.Vm::(run) vmId=`f34beb28-7122-42a4-aa08-c281072f4197`::setting migration downtime to 50 libvirtEventLoop::DEBUG::2014-05-15 14:34:08,715::vm::5138::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::event Resumed detail 1 opaque None Thread-63::INFO::2014-05-15 14:34:08,720::vm::833::vm.Vm::(run) vmId=`f34beb28-7122-42a4-aa08-c281072f4197`::Migration Progress: 60 seconds elapsed, 25% of data processed, 25% of mem processed Thread-52::DEBUG::2014-05-15 14:34:08,720::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 78 edom: 10 level: 2 message: operation aborted: migration job: canceled by client Thread-52::DEBUG::2014-05-15 14:34:08,724::vm::748::vm.Vm::(cancel) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::canceling migration downtime thread Thread-59::DEBUG::2014-05-15 14:34:08,725::vm::745::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::migration downtime thread exiting Thread-52::DEBUG::2014-05-15 14:34:08,725::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread Thread-52::ERROR::2014-05-15 14:34:08,729::vm::229::vm.Vm::(_recover) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::operation aborted: migration job: canceled by client Thread-65::INFO::2014-05-15 14:34:08,774::vm::833::vm.Vm::(run) vmId=`dad43c2b-721b-4fcb-8393-6de91d23775c`::Migration Progress: 60 seconds elapsed, 13% of data processed, 13% of mem processed Thread-52::ERROR::2014-05-15 14:34:08,832::vm::327::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/vm.py", line 313, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/vm.py", line 395, in _startUnderlyingMigration None, maxBandwidth) File "/usr/share/vdsm/vm.py", line 859, in f ret = attr(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1178, in migrateToURI2 if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
also.. migration failed while while /usr/lib64/python2.6/site-packages/vdsm/config.py is modified to 'migration_max_time_per_gib_mem' changed to 1024 sec which is should support 16GB RAM. attached logs the same exception: Thread-50::WARNING::2014-05-15 14:53:56,761::vm::807::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::Migration is stuck: Hasn't progressed in 150.047041893 seconds. Aborting. Thread-50::DEBUG::2014-05-15 14:53:56,779::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread Thread-44::DEBUG::2014-05-15 14:53:57,471::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 78 edom: 10 level: 2 message: operation aborted: migration job: canceled by client Thread-44::DEBUG::2014-05-15 14:53:57,472::vm::748::vm.Vm::(cancel) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::canceling migration downtime thread Thread-44::DEBUG::2014-05-15 14:53:57,472::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread Thread-49::DEBUG::2014-05-15 14:53:57,473::vm::745::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::migration downtime thread exiting Thread-44::ERROR::2014-05-15 14:53:57,473::vm::229::vm.Vm::(_recover) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::operation aborted: migration job: canceled by client self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/vm.py", line 395, in _startUnderlyingMigration libvirtError: operation aborted: migration job: canceled by client
Created attachment 895971 [details] vdsm.log
That's exactly how it should work considering the other migration parameters. Your migration went beyond migration_max_time_per_gib_mem check, but was cancelled exceeding migration_progress_timeout due to excessive memory thrashing after ~3 minutes and not converging (you can see in the log the memory to transfer grew twice as much, for >60s completely saturating the link (which is limited by migration_max_bandwidth 32MBps)
verified on VT7.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0159.html