Bug 1188543
Summary: | Host fails to flip to maintenance mode due to failed live migrations. | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Roman Hodain <rhodain> | ||||||||||||
Component: | vdsm | Assignee: | Piotr Kliczewski <pkliczew> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Israel Pinto <ipinto> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | high | ||||||||||||||
Version: | 3.4.3 | CC: | amarchuk, amureini, bazulay, ederevea, fromani, gklein, iheim, ipinto, istein, lpeer, lsurette, mavital, mgoldboi, michal.skrivanek, oourfali, pkliczew, rhodain, sherold, vfeenstr, vicente.balaguer, yeylon, ykaul | ||||||||||||
Target Milestone: | ovirt-3.6.3 | Keywords: | ZStream | ||||||||||||
Target Release: | 3.6.3 | ||||||||||||||
Hardware: | x86_64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | ovirt-3-6-0-2 | Doc Type: | Bug Fix | ||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | |||||||||||||||
: | 1221597 (view as bug list) | Environment: | |||||||||||||
Last Closed: | 2016-03-09 19:31:20 UTC | Type: | Bug | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | 1274670 | ||||||||||||||
Bug Blocks: | 1221597 | ||||||||||||||
Attachments: |
|
Description
Roman Hodain
2015-02-03 08:09:25 UTC
3.5.1 is already full with bugs (over 80), and since none of these bugs were added as urgent for 3.5.1 release in the tracker bug, moving to 3.5.2 Vinzenz, How should we test the fix of this bug/RFE please? Thanks, Ilanit. Found here verification steps already: bug 1221597, comment 4. Created attachment 1100847 [details]
engine_log
Created attachment 1100849 [details]
host_22_vdsm_log
Created attachment 1100850 [details]
Host_22_vdsm_log_1
Created attachment 1100851 [details]
host_23_vdsm_log
Created attachment 1100852 [details]
host_23_vdsm_log_1
Reproduce the problem, Setup: RHEVM: 3.6.0.3-0.1 vdsm: vdsm-cli-4.17.2-1.el7ev.noarch libvirt: libvirt-1.2.17-4.el7 Cluster with 2 host, on one host 76 VMs Scenario: Set host to maintenance. Host become Non responsive and failed to switch to maintenance, 68 VMs migration to second host and 8 VMs stay on Host1. From engine log: host status: "Host became non responsive and was not restarted due to Fencing Policy: 50.0 percents of the Hosts in the Cluster have connectivity issues. " migration status: "Migration failed while Host is in 'preparing for maintenance' state. Consider manual intervention: stopping/migrating Vms as Host's state will not turn to maintenance while VMs are still running on it.(VM: real_load_1184, Source: host23" From vdsm.log: Source host: Thread-391580::ERROR::2015-12-01 08:49:37,277::migration::310::virt.vm::(run) vmId=`f10f030c-31a0-41a4-884c-49161ba1d8b9`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request. Destination host: Thread-2001::ERROR::2015-12-01 09:04:21,442::vm::769::virt.vm::(_startUnderlyingVm) vmId=`f10f030c-31a0-41a4-884c-49161ba1d8b9`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 739, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2729, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2782, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid 'f10f030c-31a0-41a4-884c-49161ba1d8b9' The maintenance operation failed, problem did not solved. adding logs. I think you hit the same problem https://gerrit.ovirt.org/#/c/48803/ which is unrelated (but serious) issue. Unrelated to this issue putting back on ON_QA however this is blocked by BZ#1274670 for testing (In reply to Israel Pinto from comment #32) > Reproduce the problem, > Setup: > RHEVM: 3.6.0.3-0.1 > vdsm: vdsm-cli-4.17.2-1.el7ev.noarch can you confirm the version? That's very old yes this is the version, it scale setup, I will check if it possible to upgrade the hosts and recheck it. retested with updated vdsm: RHEVM: 3.6.0.3-0.1 VDSM: vdsm-4.17.12-0.el7ev.noarch I see the same problem: from event log: Failed migration of vms: Migration failed while Host is in 'preparing for maintenance' state. Consider manual intervention: stopping/migrating Vms as Host's state will not turn to maintenance while VMs are still running on it.(VM: real_load_1118, Source: [host address], Destination: <UNKNOWN>) vdsm log: destination host: Thread-3879::ERROR::2015-12-07 15:35:08,114::vm::752::virt.vm::(_startUnderlyingVm) vmId=`94f70ca1-6747-4744-ba73-238e028283fd`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 721, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2777, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2830, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid '94f70ca1-6747-4744-ba73-238e028283fd' source host: Thread-3712::DEBUG::2015-12-07 15:35:08,068::__init__::206::jsonrpc.Notification::(emit) Sending event {"params": {"notify_time": 4839239840, "94f70ca1-6747-4744-ba73-238e028283fd": {"status": "Migration Source"}}, "jsonrpc": "2.0", "method": "|virt|VM_status|94f70ca1-6747-4744-ba73-238e028283fd"} Thread-3712::ERROR::2015-12-07 15:35:08,069::migration::310::virt.vm::(run) vmId=`94f70ca1-6747-4744-ba73-238e028283fd`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request. I checked the hosts (host22 and host23) briefly and it seems the last occurrences did happen on 4.17.12-0 on both sides. I see the migration request arrived to destination and it started the operation, but the source side timed out with the mentioned JSON-RPC error. Piotr, can you please take a look? If this is the same problem as we faced on master we need to address it in 3.6 ASAP The issue that you mentioned was master only and the patch [1] causing it was never backported to 3.6. I specifically asked to check 3.6 and it was not affected. [1] https://gerrit.ovirt.org/#/c/46911 (In reply to Piotr Kliczewski from comment #41) > The issue that you mentioned was master only and the patch [1] causing it > was never backported to 3.6. I specifically asked to check 3.6 and it was > not affected. > > > [1] https://gerrit.ovirt.org/#/c/46911 Unfortunately this bug indicates/suggests the issue (or similar one) exist in 3.6 code as well. I don't see any issue with migration flow, just the JSON-RPC related lack of response seem to be the only "error" here. Hence moving to infra for further investigation Can we check whether the problem exists after applying master fix [1]? [1] https://gerrit.ovirt.org/#/c/50265/ (In reply to Piotr Kliczewski from comment #43) > Can we check whether the problem exists after applying master fix [1]? > > [1] https://gerrit.ovirt.org/#/c/50265/ I think Israel might be able to test it out After the investigation it is not the same issue as on master. It looks like there is issue in propagating exceptions. During the tests there were many migrations and I can see many failures like: one host: Thread-7270::ERROR::2015-12-17 11:27:10,749::vm::752::virt.vm::(_startUnderlyingVm) vmId=`4612800c-de1c-4cf1-bf8e-0b96e84826ce`::Failed to start a migration destination vm Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 721, in _startUnderlyingVm self._completeIncomingMigration() File "/usr/share/vdsm/virt/vm.py", line 2777, in _completeIncomingMigration self._incomingMigrationFinished.isSet(), usedTimeout) File "/usr/share/vdsm/virt/vm.py", line 2830, in _attachLibvirtDomainAfterMigration raise MigrationError(e.get_error_message()) MigrationError: Domain not found: no domain with matching uuid '4612800c-de1c-4cf1-bf8e-0b96e84826ce' second host: Thread-7458::ERROR::2015-12-17 11:27:10,869::migration::310::virt.vm::(run) vmId=`4612800c-de1c-4cf1-bf8e-0b96e84826ce`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 294, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 331, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 76, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 61, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request. Please note that MigrationError occured not in context of API invocation so the exception was not handled by communication layer as a result the client timeout a bit later. Michal it looks like virt issue. Can you confirm? the first host err is not important the second host (source) is not receiving response to migrationCreate()...and that one should return in several seconds (in up to "self._loadCorrectedTimeout(config.getint('vars', 'migration_listener_timeout'), doubler=5)" seconds) is there any timeout for jsonrpc calls? shorter than, say, a minute? I hope you wanted to answer:) To rephrase - xmlrpc code relied on a potentially long-running migrationCreate call to destination host. Does this work the same with jsonrpc? While Piotr is out, Francesco, can you please try that with e.g. sleeping on destination side artificially for, say, 2 minutes? (In reply to Michal Skrivanek from comment #50) > I hope you wanted to answer:) > To rephrase - xmlrpc code relied on a potentially long-running > migrationCreate call to destination host. Does this work the same with > jsonrpc? > > While Piotr is out, Francesco, can you please try that with e.g. sleeping on > destination side artificially for, say, 2 minutes? - added a completely bogus delay of 123s before the completion of Vm.migrationCreate() in API.py: - tried migration. Dest side: jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,361::API::578::vds::(migrationCreate) Migration create jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,377::vm::4610::virt.vm::(waitForMigrationDestinationPrepare) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::migration destination: waiting for VM creation Thread-113::DEBUG::2016-01-07 17:50:55,377::vm::674::virt.vm::(_startUnderlyingVm) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::Start jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:55,378::vm::4615::virt.vm::(waitForMigrationDestinationPrepare) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::migration destination: waiting 36s for path preparation (trimmed for brevity) jsonrpc.Executor/4::DEBUG::2016-01-07 17:50:58,143::API::596::vds::(migrationCreate) Destination VM creation succeeded, sleeping for 123s (trimmed for brevity) jsonrpc.Executor/4::DEBUG::2016-01-07 17:53:01,244::API::599::vds::(migrationCreate) Destination VM creation succeeded, waking up! - Src side: Thread-40::ERROR::2016-01-07 17:51:11,741::migration::311::virt.vm::(run) vmId=`b29c12fe-d0ce-4bed-b2f5-184488854d37`::Failed to migrate Traceback (most recent call last): File "/usr/share/vdsm/virt/migration.py", line 295, in run self._startUnderlyingMigration(time.time()) File "/usr/share/vdsm/virt/migration.py", line 332, in _startUnderlyingMigration result = self._destServer.migrationCreate(self._machineParams) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 116, in migrationCreate params) File "/usr/lib/python2.7/site-packages/vdsm/jsonrpcvdscli.py", line 98, in _callMethod raise JsonRpcNoResponseError(method) JsonRpcNoResponseError: [-32605] No response for JSON-RPC VM.migrationCreate request. moving back to infra, postponing to 3.6.3 json-rpc seems to assume a short enough response time...which was not the case for xml rpc. Some of the calls takes a bit longer to handle, like migrationCreate. destroy() comes in mind as well as it can take >60s too We can extend the timeout. Is 60 seconds enough or we should have more? I'm afraid it's not enough. For destroy path it used to be 72s at most, perhaps it's reduced after Francesco's changes in 3.6. But the migrationCreate is dynamic and potentially long. See prepareTimeout in waitForMigrationDestinationPrepare(). I think on a busy setup it can easily be minutes OK, I am going to set 90 secs by default and ability to configure it. Once we perform verification we would check whether it is good enough value. I don't think it's enough, with load higher than 15 (which is high, I admit, but on busy systems running real workloads it's not unusual, I've seen >70) the storage creation timeout would exceed 90s Unfortunately we can't make it proportional since the timeout is on source side...we should probably change the logic to include that time in the overall migration timeout, but that's a bit risky for 3.6 Francesco, thoughts? (In reply to Michal Skrivanek from comment #59) > I don't think it's enough, with load higher than 15 (which is high, I admit, > but on busy systems running real workloads it's not unusual, I've seen >70) > the storage creation timeout would exceed 90s > Unfortunately we can't make it proportional since the timeout is on source > side...we should probably change the logic to include that time in the > overall migration timeout, but that's a bit risky for 3.6 > Francesco, thoughts? (late update) We settled with 60 after all and I think is a step in the right direction. Another sensible approach could be make timeouts inter-dependent, e.g. migrationCreate timeout = storage timeout + delta, but this can lead to a too complex and fragile system after all. (In reply to Francesco Romani from comment #60) > (late update) > We settled with 60 after all and I think is a step in the right direction. > Another sensible approach could be make timeouts inter-dependent, e.g. > migrationCreate timeout = storage timeout + delta, but this can lead to a > too complex and fragile system after all. 600s* verify with: RHEVM: 3.6.2-0.1.el6 Hosts: OS Version: RHEL - 7.2 - 9.el7 Kernel Version: 3.10.0 - 327.4.4.el7.x86_64 KVM Version: 2.3.0 - 31.el7_2.7 LIBVIRT Version: libvirt-1.2.17-13.el7_2.2 VDSM Version:vdsm-4.17.19-0.el7ev Scenario: 1. Cluster with 2 host 2. One host running 105 VMs 3. Switch host with vms to maintenance Results: All VMs migrate to second host and first host switch to maintenance successfully. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html |