1090109 – The start time for 'migration_max_time_per_gib_mem' appears to be calculated too early.

Bug 1090109 - The start time for 'migration_max_time_per_gib_mem' appears to be calculated too early.

Summary: The start time for 'migration_max_time_per_gib_mem' appears to be calculated ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	3.3.0
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	3.5.0
Assignee:	Vinzenz Feenstra [evilissimo]
QA Contact:	Eldad Marciano
Docs Contact:
URL:
Whiteboard:	virt
Depends On:	1097298
Blocks:	1097332 1097341 rhev3.5beta 1156165
TreeView+	depends on / blocked

Reported:	2014-04-22 15:36 UTC by Gordon Watson
Modified:	2019-04-28 10:00 UTC (History)
CC List:	14 users (show)
Fixed In Version:	ovirt-3.5.0-alpha2
Doc Type:	Bug Fix
Doc Text:	Previously, the start time for a virtual machine's maximum migration time was calculated too early. When more than three migrations were performed, later migrations would fail due to timeouts. This problem has been now resolved.
Clone Of:
Clones:	1097298 1097332 (view as bug list)
Environment:
Last Closed:	2015-02-11 21:10:41 UTC
oVirt Team:	---
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs (282.39 KB, application/x-gzip) 2014-05-15 12:57 UTC, Eldad Marciano	no flags	Details
vdsm.log (442.48 KB, application/x-gzip) 2014-05-15 15:08 UTC, Eldad Marciano	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	869483	None	None	None	Never
Red Hat Knowledge Base (Solution)	873473	None	None	None	Never
Red Hat Product Errata	RHBA-2015:0159	normal	SHIPPED_LIVE	vdsm 3.5.0 - bug fix and enhancement update	2015-02-12 01:35:58 UTC
oVirt gerrit	27135	None	None	None	Never
oVirt gerrit	27637	None	MERGED	virt: Capture migration start time after the semaphore was accquired	Never

Description Gordon Watson 2014-04-22 15:36:02 UTC

Description of problem:

With the introduction of 'migration_max_time_per_gib_mem' in vdsm-4.13.2-0.13, we have noticed some migrations failing due to an error similar to the following;

"The migration took 133 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted."


On closer inspection, it appears that the start time for this parameter may be captured too soon, i.e. in this specific case the actual data transfer had only been underway for 10 seconds before 'migration_max_time_per_gib_mem' was exceeded. The reason being that multiple migrations had been started simultaneously and the default value of 3 for 'max_outgoing_migrations' was limiting concurrent migrations to the point that some VMs took around 120 seconds just to acquire the migration semaphore.

Details are provided below in a separate comment.


Version-Release number of selected component (if applicable):

RHEV 3.2 (in this specific customer case)
RHEV-H 6.5 (20140407.0) with vdsm-4.13.2-0.13


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

The timer for 'migration_max_time_per_gib_mem' should probably be started once the semaphore has been acquired, as opposed to when the migration request was initiated.


Expected results:


Additional info:

Comment 1 Gordon Watson 2014-04-22 16:00:13 UTC

The following shows extracts from 'vm.py' annotated relative to vdsm log messages. Notice the times of each.


  -  extracts from 'vm.py';

    305     def run(self):
    306         try:
    307             startTime = time.time()
    308             mstate = ''
    309             self._setupVdsConnection()						[1]
    310             self._setupRemoteMachineParams()
    311             self._prepareGuest()						[2]
    312             MigrationSourceThread._ongoingMigrations.acquire()
    313             try:
    314                 if self._migrationCanceledEvt:
    315                     self._raiseAbortError()
    316                 self.log.debug("migration semaphore acquired")			[3]
    317                 if not mstate:
    318                     self._vm.conf['_migrationParams'] = {
    319                         'dst': self._dst,
    320                         'mode': self._mode,
    321                         'method': self._method,
    322                         'dstparams': self._dstparams,
    323                         'dstqemu': self._dstqemu}
    324                     self._vm.saveState()
    325                     self._startUnderlyingMigration(startTime)
    326                 self._finishSuccessfully()

    341     def _startUnderlyingMigration(self, startTime):

    376             self._vm.log.debug('starting migration to %s '			[4]
    377                                'with miguri %s', duri, muri)


    799             if 0 < migrationMaxTime < now - self._startTime:                    [5]
    800                 self._vm.log.warn('The migration took %d seconds which is '
    801                                   'exceeding the configured maximum time '
    802                                   'for migrations of %d seconds. The '
    803                                   'migration will be aborted.',
    804                                   now - self._startTime,
    805                                   migrationMaxTime)
    806                 abort = True



  -  corresponding vdsm log messages;

[1]	Thread-1424::DEBUG::2014-04-14 18:50:22,115::vm::183::vm.Vm::(_setupVdsConnection) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::Initiating connection with destination

[2]	Thread-1424::DEBUG::2014-04-14 18:50:23,043::vm::233::vm.Vm::(_prepareGuest) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::Migration started

[3]	Thread-1424::DEBUG::2014-04-14 18:52:21,438::vm::316::vm.Vm::(run) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::migration semaphore acquired

[4]     Thread-1424::DEBUG::2014-04-14 18:52:25,947::vm::377::vm.Vm::(_startUnderlyingMigration) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::starting migration to qemu+tls://10.10.x.y/system with miguri tcp://10.10.x.y

[5]     Thread-1517::WARNING::2014-04-14 18:52:35,952::vm::805::vm.Vm::(run) vmId=`09ec5d77-82b2-40ad-8d84-72fa6aa7ed27`::The migration took 133 seconds which is exceeding the configured maximum time for migrations of 128 seconds. The migration will be aborted.


So, the migration was stalled for 2 minutes waiting for the migration semaphore. It then started to actually send data, but after 10 seconds it was aborted due to 'migration_max_time_per_gib_mem' having been exceeded.

To me, the timer shouldn't start until the actual data transfer starts, as the intent of this parameter is presumably to make sure that data is being transferred at a reasonable rate.

Comment 2 Michal Skrivanek 2014-04-28 07:36:40 UTC

(In reply to Gordon Watson from comment #1)
indeed, this timer should be relative to the monitor thread start, by e.g. passing current time to _startUnderlyingMigration()'s startTime

Comment 5 Julio Entrena Perez 2014-04-29 12:34:19 UTC

Btw that issue was probably introduced by fix for bug 1069220 since message "The migration took xxx seconds which is exceeding the configured maximum time for migrations of yyy seconds. The migration will be aborted." is not present on vdsm 4.13.2-0.11.el6ev.

Comment 10 Eldad Marciano 2014-05-15 12:56:28 UTC

Hi,
fix applied on vdsm.
i have setup the the system with two bare-metal hosts and 5 vms,
i have also using stress tool to simulate 90% memory allocation.
 
while 5 vms migrate in parallel, looks like the semaphore acquired immediately, but the migration time out exceeding.

see the logs results for the vm which fails to migrate:


1. 12:25:04,816::vm::223::vm.Vm::(_prepareGuest) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::Migration started

2. 12:25:04,842::vm::305::vm.Vm::(run) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::migration semaphore acquired

3. 12:25:04,984::vm::365::vm.Vm::(_startUnderlyingMigration) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::starting migration to qemu+t

4. 12:29:25,060::vm::798::vm.Vm::(run) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::The migration took 260 seconds which is exceeding the configured maximum time for migrations of 256 seconds. The migration will be aborted.



and have also traceback for the failed vm and also attached logs
(seems to be normal traceback since migration aborted) 



Thread-47::ERROR::2014-05-15 12:29:26,034::vm::327::vm.Vm::(run) vmId=`240ccbaa-713e-454a-9ffc-371a59b4893b`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 313, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/vm.py", line 395, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/vm.py", line 859, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1178, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)

Comment 11 Eldad Marciano 2014-05-15 12:57:21 UTC

Created attachment 895912 [details]
logs

Comment 12 Eldad Marciano 2014-05-15 14:42:02 UTC

when I set the 'migration_max_time_per_gib_mem', to '0',
the migration failed due to 'Migration is stuck'



Thread-15::DEBUG::2014-05-15 14:34:06,387::fileSD::259::Storage.Misc.excCmd::(getReadDelay) '/bin/dd iflag=direct if=/rhev/data-center/mnt/cmplnt-nas01.scale.openstack.engineering.redhat.com:_BV_real/0c50b717-7b6c
-4bdb-aaf0-920bebc2544b/dom_md/metadata bs=4096 count=1' (cwd None)
Thread-15::DEBUG::2014-05-15 14:34:06,415::fileSD::259::Storage.Misc.excCmd::(getReadDelay) SUCCESS: <err> = '0+1 records in\n0+1 records out\n495 bytes (495 B) copied, 0.00444558 s, 111 kB/s\n'; <rc> = 0
Thread-60::WARNING::2014-05-15 14:34:08,073::vm::807::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::Migration is stuck: Hasn't progressed in 150.043483019 seconds. Aborting.
Thread-60::DEBUG::2014-05-15 14:34:08,090::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread
Dummy-31::DEBUG::2014-05-15 14:34:08,207::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) 'dd if=/rhev/data-center/bdda5805-6d65-4b16-aca7-952f8b4fa2f5/mastersd/dom_md/inbox iflag=direct,fullblock count
=1 bs=2457600' (cwd None)
Dummy-31::DEBUG::2014-05-15 14:34:08,539::storage_mailbox::731::Storage.Misc.excCmd::(_checkForMail) SUCCESS: <err> = '1+0 records in\n1+0 records out\n2457600 bytes (2.5 MB) copied, 0.124957 s, 19.7 MB/s\n'; <rc>
 = 0
Thread-62::DEBUG::2014-05-15 14:34:08,697::vm::742::vm.Vm::(run) vmId=`f34beb28-7122-42a4-aa08-c281072f4197`::setting migration downtime to 50
libvirtEventLoop::DEBUG::2014-05-15 14:34:08,715::vm::5138::vm.Vm::(_onLibvirtLifecycleEvent) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::event Resumed detail 1 opaque None
Thread-63::INFO::2014-05-15 14:34:08,720::vm::833::vm.Vm::(run) vmId=`f34beb28-7122-42a4-aa08-c281072f4197`::Migration Progress: 60 seconds elapsed, 25% of data processed, 25% of mem processed
Thread-52::DEBUG::2014-05-15 14:34:08,720::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 78 edom: 10 level: 2 message: operation aborted: migration job: canceled by client
Thread-52::DEBUG::2014-05-15 14:34:08,724::vm::748::vm.Vm::(cancel) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::canceling migration downtime thread
Thread-59::DEBUG::2014-05-15 14:34:08,725::vm::745::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::migration downtime thread exiting
Thread-52::DEBUG::2014-05-15 14:34:08,725::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread
Thread-52::ERROR::2014-05-15 14:34:08,729::vm::229::vm.Vm::(_recover) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::operation aborted: migration job: canceled by client
Thread-65::INFO::2014-05-15 14:34:08,774::vm::833::vm.Vm::(run) vmId=`dad43c2b-721b-4fcb-8393-6de91d23775c`::Migration Progress: 60 seconds elapsed, 13% of data processed, 13% of mem processed
Thread-52::ERROR::2014-05-15 14:34:08,832::vm::327::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/vm.py", line 313, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/vm.py", line 395, in _startUnderlyingMigration
    None, maxBandwidth)
  File "/usr/share/vdsm/vm.py", line 859, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/vdsm/libvirtconnection.py", line 92, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 1178, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)

Comment 13 Eldad Marciano 2014-05-15 15:05:29 UTC

also..
migration failed while while /usr/lib64/python2.6/site-packages/vdsm/config.py is modified to 'migration_max_time_per_gib_mem' changed to 1024 sec
which is should support 16GB RAM.

attached logs

the same exception:

Thread-50::WARNING::2014-05-15 14:53:56,761::vm::807::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::Migration is stuck: Hasn't progressed in 150.047041893 seconds. Aborting.                            
Thread-50::DEBUG::2014-05-15 14:53:56,779::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread                                                                     
Thread-44::DEBUG::2014-05-15 14:53:57,471::libvirtconnection::124::root::(wrapper) Unknown libvirterror: ecode: 78 edom: 10 level: 2 message: operation aborted: migration job: canceled by client                   
Thread-44::DEBUG::2014-05-15 14:53:57,472::vm::748::vm.Vm::(cancel) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::canceling migration downtime thread                                                                 
Thread-44::DEBUG::2014-05-15 14:53:57,472::vm::836::vm.Vm::(stop) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::stopping migration monitor thread                                                                     
Thread-49::DEBUG::2014-05-15 14:53:57,473::vm::745::vm.Vm::(run) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::migration downtime thread exiting                                                                      
Thread-44::ERROR::2014-05-15 14:53:57,473::vm::229::vm.Vm::(_recover) vmId=`67d847f9-93d0-46b0-888a-2a6e8f62d1b3`::operation aborted: migration job: canceled by client                                              
    self._startUnderlyingMigration(time.time())                                                                                                                                                                      
  File "/usr/share/vdsm/vm.py", line 395, in _startUnderlyingMigration                                                                                                                                               
libvirtError: operation aborted: migration job: canceled by client

Comment 14 Eldad Marciano 2014-05-15 15:08:12 UTC

Created attachment 895971 [details]
vdsm.log

Comment 15 Michal Skrivanek 2014-05-19 07:05:03 UTC

That's exactly how it should work considering the other migration parameters.
Your migration went beyond migration_max_time_per_gib_mem  check, but was cancelled exceeding migration_progress_timeout due to excessive memory thrashing after ~3 minutes and not converging (you can see in the log the memory to transfer grew twice as much, for >60s completely saturating the link (which is limited by migration_max_bandwidth 32MBps)

Comment 17 Eldad Marciano 2014-10-26 13:33:46 UTC

verified on VT7.

Comment 19 errata-xmlrpc 2015-02-11 21:10:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0159.html

Note You need to log in before you can comment on or make changes to this bug.