Description of problem: Live migration fails because of migration_max_time_per_gib_mem timeour before the actual memory transfer starts. Version-Release number of selected component (if applicable): vdsm-4.14.13-3 How reproducible: 100% in specific environments. Steps to Reproduce: Not clear, but it is most probably related to very high number VMs running on the source and destination hypervisor. 100 and more on each Actual results: Migration failed due to Error: Fatal error during migration while Host is in 'preparing for maintenance' state. and consequent Failed to switch Host ..... to Maintenance mode. Expected results: All migrations are finished successfully and host is set into the maintenance mode Additional info: All of the failed migrations are interrupted because of: 117888 Thread-482::WARNING::2015-01-14 08:55:34,450::vm::830::vm.Vm::(run) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::The migration took 79 seconds which is exceeding the configured maximum time for migrations of 64 seconds. The migration will be aborted. This is sometimes reported even before the memory transfer starts. We start to measure the time when we acquire the semaphore for the migration 116712 Thread-366::DEBUG::2015-01-14 08:54:13,347::vm::306::vm.Vm::(run) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::migration semaphore acquired after 265 seconds ... 117765 Thread-366::DEBUG::2015-01-14 08:55:24,442::vm::366::vm.Vm::(_startUnderlyingMigration) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::starting migration to qemu+tls://x.x.x.x/system with miguri tcp://x.x.x.x ... 117888 Thread-482::WARNING::2015-01-14 08:55:34,450::vm::830::vm.Vm::(run) vmId=`2bf4ac79-89c4-4434-baf4-c0d54e62d610`::The migration took 79 seconds which is exceeding the configured maximum time for migrations of 64 seconds. The migration will be aborted. The time includes the time needed for VM start on the destination host inluding path preparatin and so on. It can take a significant amount of time in some busy environments. There are two issues which causes the problem: 1) First is related to LVM operations which takes long time. In this case we run _reloadlvs two times. This takes about 20sec in this case. See attachment TEST_S5a-36. 2) Second issue is related to start of the VM. There is a locking mechanism which prevents more VMs to be started at the same time. This lock is taken always when we a createVM or destroyVM is triggered. If the operation holds the lock for a long time all the operations are delayed. In this case the operation was waiting for almost 2min. See TEST_S1-16 I will focus on [2]. We call v.destroy() when we are holding the lock. This is supposed to handle all the hooks and actual VM destruction. If we get stuck here we prevents all the other VMs from being started or stopped. The logs shows that the first delay was experienced by the following migration: 26183 Thread-332::DEBUG::2015-01-27 08:27:14,805: ... client [x.x.x.x]::call vmMigrationCreate with ... 'vmId': '141ddf1d-23ee-4d70-9f00-26216605e608' 26315 Thread-332::INFO::2015-01-27 08:27:27,967::clientIF::364::vds::(createVm) vmContainerLock acquired by vm 141ddf1d-23ee-4d70-9f00-26216605e608 It is not that much but still 13sec. This delay is growing later when the operations are stacking on each other. I can see that the lock is held by a destroy operation: 26111 Thread-331::INFO::2015-01-27 08:27:11,896::API::326::vds::(destroy) vmContainerLock acquired by vm 7b81e4d8-9535-428a-9a86-edfd41710460 The Thred-331 is the following: Thread-331::DEBUG::2015-01-27 08:27:11,890::BindingXMLRPC::1067::vds::(wrapper) client [x.x.x.x]::call vmDestroy with ('7b81e4d8-9535-428a-9a86-edfd41710460',) {} Thread-331::INFO::2015-01-27 08:27:11,896::API::326::vds::(destroy) vmContainerLock acquired by vm 7b81e4d8-9535-428a-9a86-edfd41710460 Thread-331::DEBUG::2015-01-27 08:27:11,903::vm::4689::vm.Vm::(destroy) vmId=`7b81e4d8-9535-428a-9a86-edfd41710460`::destroy Called Thread-331::DEBUG::2015-01-27 08:27:27,965::vm::4683::vm.Vm::(deleteVm) vmId=`7b81e4d8-9535-428a-9a86-edfd41710460`::Total desktops after destroy of 7b81e4d8-9535-428a-9a86-edfd41710460 is 66 Thread-331::DEBUG::2015-01-27 08:27:27,966::BindingXMLRPC::1074::vds::(wrapper) return vmDestroy with {'status': {'message': 'Machine destroyed', 'code': 0}} The destroy operation took about 16sec and that is the delay which we were experiencing. So why was this operation taking so long time? We destroy the VM in method Vm.releaseVm 4613 def releaseVm(self): 4614 """ 4615 Stop VM and release all resources 4616 """ ... 4630 4631 with self._releaseLock: 4632 if self._released: 4633 return {'status': doneCode} 4634 4635 self.log.info('Release VM resources') 4636 self.lastStatus = 'Powering down' ... If we are going to release the resources we should log "Release VM resources", but we did not. The only possible reason is that we are waiting on the lock self._releaseLock and then do return {'status': doneCode}. It means that something else has taken the lock before this thread. The logs shows: 26109 libvirtEventLoop::INFO::2015-01-27 08:27:11,739::vm::4635::vm.Vm::(releaseVm) vmId=`7b81e4d8-9535-428a-9a86-edfd41710460`::Release VM resources This is logged just before we triggered the destroy and it is called from _onQemuDeath. This was most probably caused by one of the previous interrupted migrations the source host (most probably failed because of [1]) The vmDestroy was called by the same host just ot make sure that the VM is really down. Further checking shows that the libvirtEventLoop got stuck on libvirtEventLoop::DEBUG::2015-01-27 08:27:17,755::utils::642::root::(execCmd) '/usr/libexec/vdsm/hooks/after_vm_destroy/50_vhostmd' (cwd None) libvirtEventLoop::DEBUG::2015-01-27 08:27:27,954::utils::662::root::(execCmd) SUCCESS: <err> = ''; <rc> = 0 This operation takes long time in other cases as well. I suppose it is related to the very high number of VMs in the ENV. The workaround should be to increase migration_max_time_per_gib_mem, but there are a couple of questions: - Should we start measure the time later when we start transferring the data? There was already a related bug which was changing this behaviour: https://bugzilla.redhat.com/show_bug.cgi?id=1090109 - The destroy operation in the lock seems to be a bit robust and can be delayed for many reasons. Would it be possible to move some of the operations out of the lock? This may cause many issues in environment with huge numbers of small VMs running on one hypervisor.
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2
Vinzenz, How should we test the fix of this bug/RFE please? Thanks, Ilanit.
Found here verification steps already: bug 1221597, comment 4.
Created attachment 1100847 [details] engine_log
Created attachment 1100849 [details] host_22_vdsm_log
Created attachment 1100850 [details] Host_22_vdsm_log_1
Created attachment 1100851 [details] host_23_vdsm_log
Created attachment 1100852 [details] host_23_vdsm_log_1
Reproduce the problem, Setup: RHEVM: 3.6.0.3-0.1 vdsm: vdsm-cli-4.17.2-1.el7ev.noarch libvirt: libvirt-1.2.17-4.el7 Cluster with 2 host, on one host 76 VMs Scenario: Set host to maintenance. Host become Non responsive and failed to switch to maintenance, 68 VMs migration to second host and 8 VMs stay on Host1. From engine log: host status: "Host became non responsive and was not restarted due to Fencing Policy: 50.0 percents of the Hosts in the Cluster have connectivity issues. " migration status: "Migration failed while Host is in 'preparing for maintenance' state. Consider manual intervention: stopping/migrating Vms as Host's state will not turn to maintenance while VMs are still running on it.(VM: real_load_1184, Source: host23" From vdsm.log: Source host: Thread-391580::ERROR::2015-12-01 08:49:37,277::migration::310::virt.vm::(run) vmId=`f10f030c-31a0-41a4-884c-49161ba1d8b9`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request. Destination host: Thread-2001::ERROR::2015-12-01 09:04:21,442::vm::769::virt.vm::(_startUnderlyingVm) vmId=`f10f030c-31a0-41a4-884c-49161ba1d8b9`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 739, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2729, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2782, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid 'f10f030c-31a0-41a4-884c-49161ba1d8b9' The maintenance operation failed, problem did not solved. adding logs.
I think you hit the same problem https://gerrit.ovirt.org/#/c/48803/ which is unrelated (but serious) issue.
Unrelated to this issue putting back on ON_QA however this is blocked by BZ#1274670 for testing
(In reply to Israel Pinto from comment #32) > Reproduce the problem, > Setup: > RHEVM: 3.6.0.3-0.1 > vdsm: vdsm-cli-4.17.2-1.el7ev.noarch can you confirm the version? That's very old
yes this is the version, it scale setup, I will check if it possible to upgrade the hosts and recheck it.
retested with updated vdsm: RHEVM: 3.6.0.3-0.1 VDSM: vdsm-4.17.12-0.el7ev.noarch I see the same problem: from event log: Failed migration of vms: Migration failed while Host is in 'preparing for maintenance' state. Consider manual intervention: stopping/migrating Vms as Host's state will not turn to maintenance while VMs are still running on it.(VM: real_load_1118, Source: [host address], Destination: <UNKNOWN>) vdsm log: destination host: Thread-3879::ERROR::2015-12-07 15:35:08,114::vm::752::virt.vm::(_startUnderlyingVm) vmId=`94f70ca1-6747-4744-ba73-238e028283fd`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 721, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2777, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2830, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid '94f70ca1-6747-4744-ba73-238e028283fd' source host: Thread-3712::DEBUG::2015-12-07 15:35:08,068::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4839239840, "94f70ca1-6747-4744-ba73-238e028283fd": {"status": "Migration Source"}}, "jsonrpc": "2.0", "method": "|virt|VM_status|94f70ca1-6747-4744-ba73-238e028283fd"} Thread-3712::ERROR::2015-12-07 15:35:08,069::migration::310::virt.vm::(run) vmId=`94f70ca1-6747-4744-ba73-238e028283fd`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request.
I checked the hosts (host22 and host23) briefly and it seems the last occurrences did happen on 4.17.12-0 on both sides. I see the migration request arrived to destination and it started the operation, but the source side timed out with the mentioned JSON-RPC error. Piotr, can you please take a look? If this is the same problem as we faced on master we need to address it in 3.6 ASAP
The issue that you mentioned was master only and the patch [1] causing it was never backported to 3.6. I specifically asked to check 3.6 and it was not affected. [1] https://gerrit.ovirt.org/#/c/46911
(In reply to Piotr Kliczewski from comment #41) > The issue that you mentioned was master only and the patch [1] causing it > was never backported to 3.6. I specifically asked to check 3.6 and it was > not affected. > > > [1] https://gerrit.ovirt.org/#/c/46911 Unfortunately this bug indicates/suggests the issue (or similar one) exist in 3.6 code as well. I don't see any issue with migration flow, just the JSON-RPC related lack of response seem to be the only "error" here. Hence moving to infra for further investigation
Can we check whether the problem exists after applying master fix [1]? [1] https://gerrit.ovirt.org/#/c/50265/
(In reply to Piotr Kliczewski from comment #43) > Can we check whether the problem exists after applying master fix [1]? > > [1] https://gerrit.ovirt.org/#/c/50265/ I think Israel might be able to test it out
After the investigation it is not the same issue as on master. It looks like there is issue in propagating exceptions. During the tests there were many migrations and I can see many failures like: one host: Thread-7270::ERROR::2015-12-17 11:27:10,749::vm::752::virt.vm::(_startUnderlyingVm) vmId=`4612800c-de1c-4cf1-bf8e-0b96e84826ce`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 721, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2777, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2830, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid '4612800c-de1c-4cf1-bf8e-0b96e84826ce' second host: Thread-7458::ERROR::2015-12-17 11:27:10,869::migration::310::virt.vm::(run) vmId=`4612800c-de1c-4cf1-bf8e-0b96e84826ce`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request. Please note that MigrationError occured not in context of API invocation so the exception was not handled by communication layer as a result the client timeout a bit later.
Michal it looks like virt issue. Can you confirm?
the first host err is not important the second host (source) is not receiving response to migrationCreate()...and that one should return in several seconds (in up to "self._loadCorrectedTimeout(config.getint('vars', 'migration_listener_timeout'), doubler=5)" seconds) is there any timeout for jsonrpc calls? shorter than, say, a minute?
I hope you wanted to answer:) To rephrase - xmlrpc code relied on a potentially long-running migrationCreate call to destination host. Does this work the same with jsonrpc? While Piotr is out, Francesco, can you please try that with e.g. sleeping on destination side artificially for, say, 2 minutes?
(In reply to Michal Skrivanek from comment #50) > I hope you wanted to answer:) > To rephrase - xmlrpc code relied on a potentially long-running > migrationCreate call to destination host. Does this work the same with > jsonrpc? > > While Piotr is out, Francesco, can you please try that with e.g. sleeping on > destination side artificially for, say, 2 minutes? - added a completely bogus delay of 123s before the completion of Vm.migrationCreate() in API.py: - tried migration. Dest side: jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,361::API::578::vds::(migrationCreate) Migration create jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,377::vm::4610::virt.vm::(waitForMigrationDestinationPrepare) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::migration destination: waiting for VM creation Thread-113::DEBUG::2016-01-07 17:50:55,377::vm::674::virt.vm::(_startUnderlyingVm) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::Start jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,378::vm::4615::virt.vm::(waitForMigrationDestinationPrepare) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::migration destination: waiting 36s for path preparation (trimmed for brevity) jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:58,143::API::596::vds::(migrationCreate) Destination VM creation succeeded, sleeping for 123s (trimmed for brevity) jsonrpc.Executor/4::DEBUG::2016-01-07 17:53:01,244::API::599::vds::(migrationCreate) Destination VM creation succeeded, waking up! - Src side: Thread-40::ERROR::2016-01-07 17:51:11,741::migration::311::virt.vm::(run) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 295, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 332, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 116, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 98, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request.
moving back to infra, postponing to 3.6.3 json-rpc seems to assume a short enough response time...which was not the case for xml rpc. Some of the calls takes a bit longer to handle, like migrationCreate. destroy() comes in mind as well as it can take >60s too
We can extend the timeout. Is 60 seconds enough or we should have more?
I'm afraid it's not enough. For destroy path it used to be 72s at most, perhaps it's reduced after Francesco's changes in 3.6. But the migrationCreate is dynamic and potentially long. See prepareTimeout in waitForMigrationDestinationPrepare(). I think on a busy setup it can easily be minutes
OK, I am going to set 90 secs by default and ability to configure it. Once we perform verification we would check whether it is good enough value.
I don't think it's enough, with load higher than 15 (which is high, I admit, but on busy systems running real workloads it's not unusual, I've seen >70) the storage creation timeout would exceed 90s Unfortunately we can't make it proportional since the timeout is on source side...we should probably change the logic to include that time in the overall migration timeout, but that's a bit risky for 3.6 Francesco, thoughts?
(In reply to Michal Skrivanek from comment #59) > I don't think it's enough, with load higher than 15 (which is high, I admit, > but on busy systems running real workloads it's not unusual, I've seen >70) > the storage creation timeout would exceed 90s > Unfortunately we can't make it proportional since the timeout is on source > side...we should probably change the logic to include that time in the > overall migration timeout, but that's a bit risky for 3.6 > Francesco, thoughts? (late update) We settled with 60 after all and I think is a step in the right direction. Another sensible approach could be make timeouts inter-dependent, e.g. migrationCreate timeout = storage timeout + delta, but this can lead to a too complex and fragile system after all.
(In reply to Francesco Romani from comment #60) > (late update) > We settled with 60 after all and I think is a step in the right direction. > Another sensible approach could be make timeouts inter-dependent, e.g. > migrationCreate timeout = storage timeout + delta, but this can lead to a > too complex and fragile system after all. 600s*
verify with: RHEVM: 3.6.2-0.1.el6 Hosts: OS Version: RHEL - 7.2 - 9.el7 Kernel Version: 3.10.0 - 327.4.4.el7.x86_64 KVM Version: 2.3.0 - 31.el7_2.7 LIBVIRT Version: libvirt-1.2.17-13.el7_2.2 VDSM Version:vdsm-4.17.19-0.el7ev Scenario: 1. Cluster with 2 host 2. One host running 105 VMs 3. Switch host with vms to maintenance Results: All VMs migrate to second host and first host switch to maintenance successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html