Bug 1188543 - Host fails to flip to maintenance mode due to failed live migrations.
Summary: Host fails to flip to maintenance mode due to failed live migrations.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.4.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-3.6.3
: 3.6.3
Assignee: Piotr Kliczewski
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On: 1274670
Blocks: 1221597
TreeView+ depends on / blocked
 
Reported: 2015-02-03 08:09 UTC by Roman Hodain
Modified: 2019-10-10 09:40 UTC (History)
22 users (show)

Fixed In Version: ovirt-3-6-0-2
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1221597 (view as bug list)
Environment:
Last Closed: 2016-03-09 19:31:20 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine_log (153.54 KB, application/zip)
2015-12-01 11:37 UTC, Israel Pinto
no flags Details
host_22_vdsm_log (908.90 KB, application/x-xz)
2015-12-01 11:38 UTC, Israel Pinto
no flags Details
Host_22_vdsm_log_1 (1.32 MB, application/zip)
2015-12-01 11:39 UTC, Israel Pinto
no flags Details
host_23_vdsm_log (332.96 KB, application/zip)
2015-12-01 11:39 UTC, Israel Pinto
no flags Details
host_23_vdsm_log_1 (1.79 MB, application/zip)
2015-12-01 11:40 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 0 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC
oVirt gerrit 38397 0 master MERGED virt: Do not count time spent on migrationCreate against timeout 2020-02-19 22:20:31 UTC
oVirt gerrit 51786 0 master MERGED migration: set timeout to 10 mins 2020-02-19 22:20:31 UTC
oVirt gerrit 52548 0 ovirt-3.6 MERGED migration: set timeout to 10 mins 2020-02-19 22:20:31 UTC

Description Roman Hodain 2015-02-03 08:09:25 UTC
Description of problem:
	Live migration fails because of migration_max_time_per_gib_mem timeour
	before the actual memory transfer starts.

Version-Release number of selected component (if applicable):
	vdsm-4.14.13-3

How reproducible:
	100% in specific environments.

Steps to Reproduce:
	Not clear, but it is most probably related to very high number VMs running 
	on the source and destination hypervisor. 100 and more on each

Actual results:
	Migration failed due to Error: Fatal error during migration while Host is in
	'preparing for maintenance' state.
	and consequent
	Failed to switch Host ..... to Maintenance mode.

Expected results:
	All migrations are finished successfully and host is set into the
	maintenance mode

Additional info:

All of the failed migrations are interrupted because of:

	117888 Thread-482::WARNING::2015-01-14 08:55:34,450::vm::830::vm.Vm::(run) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::The migration took 79 seconds which is exceeding the configured maximum time for migrations of 64 seconds. The migration will be aborted.
	
This is sometimes reported even before the memory transfer starts. 
We start to measure the time when we acquire the semaphore for the migration

	116712 Thread-366::DEBUG::2015-01-14 08:54:13,347::vm::306::vm.Vm::(run) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::migration semaphore acquired after 265 seconds
	...
	117765 Thread-366::DEBUG::2015-01-14 08:55:24,442::vm::366::vm.Vm::(_startUnderlyingMigration)	vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::starting migration to	qemu+tls://x.x.x.x/system with miguri tcp://x.x.x.x
	...
	117888 Thread-482::WARNING::2015-01-14 08:55:34,450::vm::830::vm.Vm::(run) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::The migration took 79 seconds which is exceeding the configured maximum time for migrations of 64 seconds. The migration will be aborted.

The time includes the time needed for VM start on the destination host inluding
path preparatin and so on. It can take a significant amount of time in some busy
environments.

There are two issues which causes the problem:

  1) First is related to LVM operations which takes long time. In this case we
  run _reloadlvs two times. This takes about 20sec in this case. See attachment
  TEST_S5a-36.

  2) Second issue is related to start of the VM. There is a locking mechanism
  which prevents more VMs to be started at the same time. This lock is taken
  always when we a createVM or destroyVM is triggered. If the operation holds
  the lock for a long time all the operations are delayed. In this case the
  operation was waiting for almost 2min. See TEST_S1-16

I will focus on [2].
We call v.destroy() when we are holding the lock. This is supposed to handle all
the hooks and actual VM destruction. If we get stuck here we prevents all the 
other VMs from being started or stopped.

The logs shows that the first delay was experienced by the following migration:

    26183 Thread-332::DEBUG::2015-01-27 08:27:14,805: ... client [x.x.x.x]::call vmMigrationCreate with ... 'vmId': '141ddf1d-23ee-4d70-9f00-26216605e608'
    26315 Thread-332::INFO::2015-01-27 08:27:27,967::clientIF::364::vds::(createVm) vmContainerLock acquired by vm 141ddf1d-23ee-4d70-9f00-26216605e608

It is not that much but still 13sec. This delay is growing later when the
operations are stacking on each other.

I can see that the lock is held by a destroy operation:

    26111 Thread-331::INFO::2015-01-27 08:27:11,896::API::326::vds::(destroy) vmContainerLock acquired by vm 7b81e4d8-9535-428a-9a86-edfd41710460

The Thred-331 is the following:

      Thread-331::DEBUG::2015-01-27 08:27:11,890::BindingXMLRPC::1067::vds::(wrapper) client [x.x.x.x]::call vmDestroy with ('7b81e4d8-9535-428a-9a86-edfd41710460',) {}
      Thread-331::INFO::2015-01-27 08:27:11,896::API::326::vds::(destroy) vmContainerLock acquired by vm 7b81e4d8-9535-428a-9a86-edfd41710460
      Thread-331::DEBUG::2015-01-27 08:27:11,903::vm::4689::vm.Vm::(destroy) vmId=`7b81e4d8-9535-428a-9a86-edfd41710460`::destroy Called
      Thread-331::DEBUG::2015-01-27 08:27:27,965::vm::4683::vm.Vm::(deleteVm) vmId=`7b81e4d8-9535-428a-9a86-edfd41710460`::Total desktops after destroy of 7b81e4d8-9535-428a-9a86-edfd41710460 is 66
      Thread-331::DEBUG::2015-01-27 08:27:27,966::BindingXMLRPC::1074::vds::(wrapper) return vmDestroy with {'status': {'message': 'Machine destroyed', 'code': 0}}

The destroy operation took about 16sec and that is the delay which we were
experiencing.

So why was this operation taking so long time? We destroy the VM in method
Vm.releaseVm

4613     def releaseVm(self):                                                        
4614         """                                                                     
4615         Stop VM and release all resources                                       
4616         """                                                                     
...
4630                                                                                 
4631         with self._releaseLock:                                                 
4632             if self._released:                                                  
4633                 return {'status': doneCode}                                     
4634                                                                                 
4635             self.log.info('Release VM resources')                               
4636             self.lastStatus = 'Powering down'       
...

If we are going to release the resources we should log "Release VM resources",
but we did not. The only possible reason is that we are waiting on the lock
self._releaseLock and then do return {'status': doneCode}. It means that
something else has taken the lock before this thread.

The logs shows:

     26109 libvirtEventLoop::INFO::2015-01-27 08:27:11,739::vm::4635::vm.Vm::(releaseVm) vmId=`7b81e4d8-9535-428a-9a86-edfd41710460`::Release VM resources

This is logged just before we triggered the destroy and it is called from
_onQemuDeath. This was most probably caused by one of the previous interrupted
migrations the source host (most probably failed because of [1]) The vmDestroy
was called by the same host just ot make sure that the VM is really down.

Further checking shows that the libvirtEventLoop got stuck on

    libvirtEventLoop::DEBUG::2015-01-27 08:27:17,755::utils::642::root::(execCmd) '/usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd' (cwd None)
    libvirtEventLoop::DEBUG::2015-01-27 08:27:27,954::utils::662::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0

This operation takes long time in other cases as well. I suppose it is related
to the very high number of VMs in the ENV.

The workaround should be to increase migration_max_time_per_gib_mem, but there
are a couple of questions:

   - Should we start measure the time later when we start transferring the data?
	 There was already a related bug which was changing this behaviour:
	    https://bugzilla.redhat.com/show_bug.cgi?id=1090109

   - The destroy operation in the lock seems to be a bit robust and can be
	 delayed for many reasons. Would it be possible to move some of the
	 operations out of the lock? This may cause many issues in environment with
	 huge numbers of small VMs running on one hypervisor.

Comment 8 Eyal Edri 2015-02-25 08:39:54 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2

Comment 25 Ilanit Stein 2015-09-03 18:39:08 UTC
Vinzenz,

How should we test the fix of this bug/RFE please?

Thanks,
Ilanit.

Comment 26 Ilanit Stein 2015-09-03 18:41:30 UTC
Found here verification steps already: bug 1221597, comment 4.

Comment 27 Israel Pinto 2015-12-01 11:37:02 UTC
Created attachment 1100847 [details]
engine_log

Comment 28 Israel Pinto 2015-12-01 11:38:13 UTC
Created attachment 1100849 [details]
host_22_vdsm_log

Comment 29 Israel Pinto 2015-12-01 11:39:08 UTC
Created attachment 1100850 [details]
Host_22_vdsm_log_1

Comment 30 Israel Pinto 2015-12-01 11:39:55 UTC
Created attachment 1100851 [details]
host_23_vdsm_log

Comment 31 Israel Pinto 2015-12-01 11:40:39 UTC
Created attachment 1100852 [details]
host_23_vdsm_log_1

Comment 32 Israel Pinto 2015-12-01 11:41:18 UTC
Reproduce the problem,
Setup:
RHEVM: 3.6.0.3-0.1
vdsm: vdsm-cli-4.17.2-1.el7ev.noarch
libvirt: libvirt-1.2.17-4.el7

Cluster with 2 host, on one host 76 VMs
Scenario: 
Set host to maintenance.
Host become Non responsive and failed to switch to maintenance, 
68 VMs migration to second host and 8 VMs stay on Host1.

From engine log:
host status:
"Host became non responsive and was not restarted due to Fencing Policy: 50.0 percents of the Hosts in the Cluster have connectivity issues. "
migration status:
"Migration failed  while Host is in 'preparing for maintenance' state.
  Consider manual intervention: stopping/migrating Vms as Host's state will not
  turn to maintenance while VMs are still running on it.(VM: real_load_1184, Source: host23"

From vdsm.log:
Source host:
Thread-391580::ERROR::2015-12-01 08:49:37,277::migration::310::virt.vm::(run) vmId=`f10f030c-31a0-41a4-884c-49161ba1d8b9`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 294, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration
    result = self._destServer.migrationCreate(self._machineParams)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate
    params)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod
    raise JsonRpcNoResponseError(method)
JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request.

Destination host:
Thread-2001::ERROR::2015-12-01 09:04:21,442::vm::769::virt.vm::(_startUnderlyingVm) vmId=`f10f030c-31a0-41a4-884c-49161ba1d8b9`::Failed to start a migration destination vm
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 739, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 2729, in _completeIncomingMigration
    self._incomingMigrationFinished.isSet(), usedTimeout)
  File "/usr/share/vdsm/virt/vm.py", line 2782, in _attachLibvirtDomainAfterMigration
    raise MigrationError(e.get_error_message())
MigrationError: Domain not found: no domain with matching uuid 'f10f030c-31a0-41a4-884c-49161ba1d8b9'


The maintenance operation failed, problem did not solved.

adding logs.

Comment 33 Francesco Romani 2015-12-01 12:46:27 UTC
I think you hit the same problem https://gerrit.ovirt.org/#/c/48803/ which is unrelated (but serious) issue.

Comment 34 Vinzenz Feenstra [evilissimo] 2015-12-01 13:25:17 UTC
Unrelated to this issue putting back on ON_QA however this is blocked by BZ#1274670 for testing

Comment 35 Michal Skrivanek 2015-12-07 12:00:44 UTC
(In reply to Israel Pinto from comment #32)
> Reproduce the problem,
> Setup:
> RHEVM: 3.6.0.3-0.1
> vdsm: vdsm-cli-4.17.2-1.el7ev.noarch

can you confirm the version? That's very old

Comment 36 Israel Pinto 2015-12-07 12:44:59 UTC
yes this is the version, it scale setup, I will check if it possible to upgrade the hosts and recheck it.

Comment 37 Israel Pinto 2015-12-07 15:49:35 UTC
retested with updated vdsm:
RHEVM: 3.6.0.3-0.1
VDSM: vdsm-4.17.12-0.el7ev.noarch

I see the same problem:

from event log:
Failed migration of vms:
Migration failed  while Host is in 'preparing for maintenance' state.
  Consider manual intervention: stopping/migrating Vms as Host's state will not
  turn to maintenance while VMs are still running on it.(VM: real_load_1118, Source: [host address], Destination: <UNKNOWN>)

vdsm log:

destination host: 
Thread-3879::ERROR::2015-12-07 15:35:08,114::vm::752::virt.vm::(_startUnderlyingVm) vmId=`94f70ca1-6747-4744-ba73-238e028283fd`::Failed to start a migration destination vm
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 721, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 2777, in _completeIncomingMigration
    self._incomingMigrationFinished.isSet(), usedTimeout)
  File "/usr/share/vdsm/virt/vm.py", line 2830, in _attachLibvirtDomainAfterMigration
    raise MigrationError(e.get_error_message())
MigrationError: Domain not found: no domain with matching uuid '94f70ca1-6747-4744-ba73-238e028283fd'


source host:
Thread-3712::DEBUG::2015-12-07 15:35:08,068::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4839239840, "94f70ca1-6747-4744-ba73-238e028283fd": {"status": "Migration Source"}}, "jsonrpc": "2.0", "method": "|virt|VM_status|94f70ca1-6747-4744-ba73-238e028283fd"}
Thread-3712::ERROR::2015-12-07 15:35:08,069::migration::310::virt.vm::(run) vmId=`94f70ca1-6747-4744-ba73-238e028283fd`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 294, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration
    result = self._destServer.migrationCreate(self._machineParams)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate
    params)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod
    raise JsonRpcNoResponseError(method)
JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request.

Comment 40 Michal Skrivanek 2015-12-10 12:33:25 UTC
I checked the hosts (host22 and host23) briefly and it seems the last occurrences did happen on 4.17.12-0 on both sides. I see the migration request arrived to destination and it started the operation, but the source side timed out with the mentioned JSON-RPC error.
Piotr, can you please take a look? If this is the same problem as we faced on master we need to address it in 3.6 ASAP

Comment 41 Piotr Kliczewski 2015-12-10 12:59:51 UTC
The issue that you mentioned was master only and the patch [1] causing it was never backported to 3.6. I specifically asked to check 3.6 and it was not affected.


[1] https://gerrit.ovirt.org/#/c/46911

Comment 42 Michal Skrivanek 2015-12-10 13:21:54 UTC
(In reply to Piotr Kliczewski from comment #41)
> The issue that you mentioned was master only and the patch [1] causing it
> was never backported to 3.6. I specifically asked to check 3.6 and it was
> not affected.
> 
> 
> [1] https://gerrit.ovirt.org/#/c/46911

Unfortunately this bug indicates/suggests the issue (or similar one) exist in 3.6 code as well.
I don't see any issue with migration flow, just the JSON-RPC related lack of response seem to be the only "error" here. Hence moving to infra for further investigation

Comment 43 Piotr Kliczewski 2015-12-10 15:10:21 UTC
Can we check whether the problem exists after applying master fix [1]? 

[1] https://gerrit.ovirt.org/#/c/50265/

Comment 44 Michal Skrivanek 2015-12-12 08:37:12 UTC
(In reply to Piotr Kliczewski from comment #43)
> Can we check whether the problem exists after applying master fix [1]? 
> 
> [1] https://gerrit.ovirt.org/#/c/50265/

I think Israel might be able to test it out

Comment 47 Piotr Kliczewski 2015-12-17 12:17:59 UTC
After the investigation it is not the same issue as on master. It looks like there is issue in propagating exceptions. During the tests there were many migrations and I can see many failures like:

one host:
Thread-7270::ERROR::2015-12-17 11:27:10,749::vm::752::virt.vm::(_startUnderlyingVm) vmId=`4612800c-de1c-4cf1-bf8e-0b96e84826ce`::Failed to start a migration destination vm
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 721, in _startUnderlyingVm
    self._completeIncomingMigration()
  File "/usr/share/vdsm/virt/vm.py", line 2777, in _completeIncomingMigration
    self._incomingMigrationFinished.isSet(), usedTimeout)
  File "/usr/share/vdsm/virt/vm.py", line 2830, in _attachLibvirtDomainAfterMigration
    raise MigrationError(e.get_error_message())
MigrationError: Domain not found: no domain with matching uuid '4612800c-de1c-4cf1-bf8e-0b96e84826ce'

second host:
Thread-7458::ERROR::2015-12-17 11:27:10,869::migration::310::virt.vm::(run) vmId=`4612800c-de1c-4cf1-bf8e-0b96e84826ce`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 294, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration
    result = self._destServer.migrationCreate(self._machineParams)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate
    params)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod
    raise JsonRpcNoResponseError(method)
JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request.

Please note that MigrationError occured not in context of API invocation so the exception was not handled by communication layer as a result the client timeout a bit later.

Comment 48 Piotr Kliczewski 2015-12-17 12:20:03 UTC
Michal it looks like virt issue. Can you confirm?

Comment 49 Michal Skrivanek 2016-01-04 15:01:16 UTC
the first host err is not important
the second host (source) is not receiving response to migrationCreate()...and that one should return in several seconds (in up to "self._loadCorrectedTimeout(config.getint('vars', 'migration_listener_timeout'), doubler=5)" seconds)

is there any timeout for jsonrpc calls? shorter than, say, a minute?

Comment 50 Michal Skrivanek 2016-01-04 20:25:41 UTC
I hope you wanted to answer:)
To rephrase - xmlrpc code relied on a potentially long-running migrationCreate call to destination host. Does this work the same with jsonrpc?

While Piotr is out, Francesco, can you please try that with e.g. sleeping on destination side artificially for, say, 2 minutes?

Comment 54 Francesco Romani 2016-01-07 17:00:39 UTC
(In reply to Michal Skrivanek from comment #50)
> I hope you wanted to answer:)
> To rephrase - xmlrpc code relied on a potentially long-running
> migrationCreate call to destination host. Does this work the same with
> jsonrpc?
> 
> While Piotr is out, Francesco, can you please try that with e.g. sleeping on
> destination side artificially for, say, 2 minutes?

-  added a completely bogus delay of 123s before the completion of Vm.migrationCreate() in API.py:
- tried migration. Dest side:

jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,361::API::578::vds::(migrationCreate) Migration create
jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,377::vm::4610::virt.vm::(waitForMigrationDestinationPrepare) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::migration destination: waiting for VM creation
Thread-113::DEBUG::2016-01-07 17:50:55,377::vm::674::virt.vm::(_startUnderlyingVm) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::Start
jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,378::vm::4615::virt.vm::(waitForMigrationDestinationPrepare) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::migration destination: waiting 36s for path preparation
(trimmed for brevity)

jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:58,143::API::596::vds::(migrationCreate) Destination VM creation succeeded, sleeping for 123s

(trimmed for brevity)
jsonrpc.Executor/4::DEBUG::2016-01-07 17:53:01,244::API::599::vds::(migrationCreate) Destination VM creation succeeded, waking up!


- Src side:
Thread-40::ERROR::2016-01-07 17:51:11,741::migration::311::virt.vm::(run) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::Failed to migrate
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/migration.py", line 295, in run
    self._startUnderlyingMigration(time.time())
  File "/usr/share/vdsm/virt/migration.py", line 332, in _startUnderlyingMigration
    result = self._destServer.migrationCreate(self._machineParams)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 116, in migrationCreate
    params)
  File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 98, in _callMethod
    raise JsonRpcNoResponseError(method)
JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request.

Comment 55 Michal Skrivanek 2016-01-11 14:05:54 UTC
moving back to infra, postponing to 3.6.3
json-rpc seems to assume a short enough response time...which was not the case for xml rpc. Some of the calls takes a bit longer to handle, like migrationCreate. destroy() comes in mind as well as it can take >60s too

Comment 56 Piotr Kliczewski 2016-01-11 14:42:26 UTC
We can extend the timeout. Is 60 seconds enough or we should have more?

Comment 57 Michal Skrivanek 2016-01-11 14:54:44 UTC
I'm afraid it's not enough. For destroy path it used to be 72s at most, perhaps it's reduced after Francesco's changes in 3.6.
But the migrationCreate is dynamic and potentially long. See prepareTimeout in waitForMigrationDestinationPrepare(). I think on a busy setup it can easily be minutes

Comment 58 Piotr Kliczewski 2016-01-11 15:09:20 UTC
OK, I am going to set 90 secs by default and ability to configure it. Once we perform verification we would check whether it is good enough value.

Comment 59 Michal Skrivanek 2016-01-12 07:47:00 UTC
I don't think it's enough, with load higher than 15 (which is high, I admit, but on busy systems running real workloads it's not unusual, I've seen >70) the storage creation timeout would exceed 90s
Unfortunately we can't make it proportional since the timeout is on source side...we should probably change the logic to include that time in the overall migration timeout, but that's a bit risky for 3.6
Francesco, thoughts?

Comment 60 Francesco Romani 2016-01-25 16:25:58 UTC
(In reply to Michal Skrivanek from comment #59)
> I don't think it's enough, with load higher than 15 (which is high, I admit,
> but on busy systems running real workloads it's not unusual, I've seen >70)
> the storage creation timeout would exceed 90s
> Unfortunately we can't make it proportional since the timeout is on source
> side...we should probably change the logic to include that time in the
> overall migration timeout, but that's a bit risky for 3.6
> Francesco, thoughts?

(late update)
We settled with 60 after all and I think is a step in the right direction.
Another sensible approach could be make timeouts inter-dependent, e.g.
migrationCreate timeout = storage timeout + delta, but this can lead to a too complex and fragile system after all.

Comment 61 Francesco Romani 2016-01-25 16:48:29 UTC
(In reply to Francesco Romani from comment #60)
> (late update)
> We settled with 60 after all and I think is a step in the right direction.
> Another sensible approach could be make timeouts inter-dependent, e.g.
> migrationCreate timeout = storage timeout + delta, but this can lead to a
> too complex and fragile system after all.

600s*

Comment 62 Israel Pinto 2016-02-15 18:30:58 UTC
verify with:
RHEVM: 3.6.2-0.1.el6
Hosts: 
OS Version: RHEL - 7.2 - 9.el7
Kernel Version: 3.10.0 - 327.4.4.el7.x86_64
KVM Version: 2.3.0 - 31.el7_2.7
LIBVIRT Version: libvirt-1.2.17-13.el7_2.2
VDSM Version:vdsm-4.17.19-0.el7ev


Scenario:
1. Cluster with 2 host
2. One host running 105 VMs
3. Switch host with vms to maintenance

Results:
All VMs migrate to second host and first host switch to maintenance 
successfully.

Comment 64 errata-xmlrpc 2016-03-09 19:31:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.