Bug 998239
Summary: | [SCALE][Live Storage Migration] [NFS] Failed Live Storage Migration on NFS Data Center under heavy load | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | vvyazmin <vvyazmin> | ||||||||
Component: | vdsm | Assignee: | Daniel Erez <derez> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Eldad Marciano <emarcian> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 3.3.0 | CC: | acanan, amureini, bazulay, fsimonce, gklein, lpeer, oourfali, scohen, yeylon, yobshans | ||||||||
Target Milestone: | ovirt-3.6.0-rc | Keywords: | Reopened | ||||||||
Target Release: | 3.6.0 | Flags: | scohen:
Triaged+
|
||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-03-09 19:18:34 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | 999896, 1040476 | ||||||||||
Bug Blocks: | |||||||||||
Attachments: |
|
Given the heavy load on the VM disk (installing os) flushing the data to the (async) replica (destination) took a long time, more than the engine timeout (3 minutes). Engine sent diskReplicateFinish for a second time triggering a race on the 'diskReplicate' key removal. Thread-198282::DEBUG::2013-08-18 12:34:11,303::BindingXMLRPC::974::vds::(wrapper) client [10.35.161.29]::call vmDiskReplicateFinish with ('373c4014-d74d-4f6b-9750-729c6d3f847f', {'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}, {'device': 'disk', 'domainID': 'f7cb7d3b-ac55-44eb-9afb-6c2439c521ab', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}) {} flowID [35e5febd] Thread-198756::DEBUG::2013-08-18 12:37:11,332::BindingXMLRPC::974::vds::(wrapper) client [10.35.161.29]::call vmDiskReplicateFinish with ('373c4014-d74d-4f6b-9750-729c6d3f847f', {'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}, {'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}) {} flowID [35e5febd] Thread-198756::ERROR::2013-08-18 12:37:41,001::vm::3897::vm.Vm::(diskReplicateFinish) vmId=`373c4014-d74d-4f6b-9750-729c6d3f847f`::Replication job not found for disk vda ({'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}) Thread-198756::DEBUG::2013-08-18 12:37:41,029::BindingXMLRPC::981::vds::(wrapper) return vmDiskReplicateFinish with {'status': {'message': 'Drive replication error', 'code': 55}} Thread-198282::ERROR::2013-08-18 12:40:27,936::BindingXMLRPC::993::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 377, in vmDiskReplicateFinish return vm.diskReplicateFinish(srcDisk, dstDisk) File "/usr/share/vdsm/API.py", line 536, in diskReplicateFinish return v.diskReplicateFinish(srcDisk, dstDisk) File "/usr/share/vdsm/vm.py", line 3955, in diskReplicateFinish self._delDiskReplica(srcDrive) File "/usr/share/vdsm/vm.py", line 3826, in _delDiskReplica del device['diskReplicate'] KeyError: 'diskReplicate' I don't think it's a regression, there's nothing new here that triggered this issue. *** This bug has been marked as a duplicate of bug 999896 *** Bug 999896 dealt with the 'blocker' part of this bug. What remains is to improve vdsm behaviour (In reply to vvyazmin from comment #0) > Steps to Reproduce: > Create NFS Data Center with two Storage Domain > Create VM with one disk > Start to install OS on this VM > During installation OS, run Live Storage Migration (LSM) During development of 3.5 features we've done this multiple types and never hit the BZ. Moving to MODIFIED so QA can verify this for 3.6.0. Bug verified on RHEVM: 3.6.3-0.1 KVM Version: 2.3.0 - 31.el7_2.7 LIBVIRT Version: libvirt-1.2.17-13.el7_2.3 VDSM Version: vdsm-4.17.19-0.el7ev Performed VM disk live migration from one NFS storage domain to other during OS RHEL 7.2 installation on VM Process passed. Elapsed time for migration about 2 min. Messages from engine.log: 2016-02-24 13:08:49,326 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-44) [3b3bbf2f] Lock Acquired to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[ed07622e-ff5a-4a64-9adc-c60583af7400=<VM, ACTION_TYPE_FAILED_DISK_IS_BEING_MIGRATED$DiskName vm-os-1_Disk1>]'}' ..... 2016-02-24 13:08:50,529 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-44) [3b3bbf2f] Correlation ID: 3b3bbf2f, Job ID: 1718953b-7244-40e5-be0d-e2c7fe5fd52e, Call Stack: null, Custom Event ID: -1, Message: User admin@internal moving disk vm-os-1_Disk1 to domain NFS-SD-2. ..... 2016-02-24 13:10:19,202 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-26) [3b3bbf2f] Ending command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' successfully. ... 2016-02-24 13:10:19,323 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-26) [3b3bbf2f] Correlation ID: 3b3bbf2f, Job ID: 1718953b-7244-40e5-be0d-e2c7fe5fd52e, Call Stack: null, Custom Event ID: -1, Message: User admin@internal finished moving disk vm-os-1_Disk1 to domain NFS-SD-2. See details in attached engine and vdsm log files Bug can be closed Created attachment 1130182 [details]
3.6.3-0.1 engine log
Created attachment 1130183 [details]
3.6.3-0.1 vdsm log
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0362.html |
Created attachment 787791 [details] ## Logs rhevm, vdsm, libvirt, thread dump, superVdsm Description of problem: Failed Live Storage Migration on NFS Data Center Version-Release number of selected component (if applicable): RHEVM 3.3 - IS10 environment: RHEVM: rhevm-3.3.0-0.15.master.el6ev.noarch PythonSDK: rhevm-sdk-python-3.3.0.10-1.el6ev.noarch VDSM: vdsm-4.12.0-61.git8178ec2.el6ev.x86_64 LIBVIRT: libvirt-0.10.2-18.el6_4.9.x86_64 QEMU & KVM: qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64 SANLOCK: sanlock-2.8-1.el6.x86_64 How reproducible: 100 % Steps to Reproduce: Create NFS Data Center with two Storage Domain Create VM with one disk Start to install OS on this VM During installation OS, run Live Storage Migration (LSM) Actual results: Failed run LSM Expected results: Succeed run LSM Impact on user: Failed run LSM Workaround: none Additional info: No problem were found on FCP Data Center No rules in iptables are found. /var/log/ovirt-engine/engine.log 2013-08-18 14:37:09,422 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Command VmReplicateDiskFinishVDS executio n failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException 2013-08-18 14:37:09,423 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] FINISH, VmReplicateDiskFinishVDSCommand, log id: 62a48070 2013-08-18 14:37:09,442 ERROR [org.ovirt.engine.core.bll.lsm.VmReplicateDiskFinishTaskHandler] (pool-5-thread-43) [35e5febd] Failed VmReplicateDiskFinish (Disk 12e58b89-25b1-465 c-801f-b7ae61b21bed , VM 373c4014-d74d-4f6b-9750-729c6d3f847f) 2013-08-18 14:37:09,442 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-5-thread-43) [35e5febd] Command org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand t hrow Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with err or VDS_NETWORK_ERROR and code 5022) 2013-08-18 14:37:09,447 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-5-thread-43) [35e5febd] Reverting task unknown, handler: org.ovirt.engine.core.bll.lsm .VmReplicateDiskStartTaskHandler 2013-08-18 14:37:09,448 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] START, VmReplicateDiskFinishVDSCommand(Ho stName = tigris01.scl.lab.tlv.redhat.com, HostId = 9576d8ca-4466-46e6-bebc-ccd922075ac6, vmId=373c4014-d74d-4f6b-9750-729c6d3f847f), log id: 22c8a416 2013-08-18 14:37:39,153 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Failed in VmReplicateDiskFinishVDS method 2013-08-18 14:37:39,154 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Error code replicaErr and error message V DSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error 2013-08-18 14:37:39,154 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Command org.ovirt.engine.core.vdsbroker.v dsbroker.VmReplicateDiskFinishVDSCommand return value StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=55, mMessage=Drive replication error]] 2013-08-18 14:37:39,154 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] HostName = tigris01.scl.lab.tlv.redhat.co m 2013-08-18 14:37:39,155 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Command VmReplicateDiskFinishVDS executio n failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error 2013-08-18 14:37:39,155 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] FINISH, VmReplicateDiskFinishVDSCommand, log id: 22c8a416 2013-08-18 14:37:39,161 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-43) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk completed, handling the result. 2013-08-18 14:37:39,161 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-43) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type LiveMigrateDisk hasnt succeeded, not clearing tasks, will attempt again next polling. 2013-08-18 14:37:39,348 INFO [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-11) BaseAsyncTask::OnTaskEndSuccess: Task 6b996cee-144b-4175-b975-7a062877c1b8 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully. 2013-08-18 14:37:39,348 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (DefaultQuartzScheduler_Worker-11) CommandAsyncTask::EndActionIfNecessary: All tasks of command 883610d9-a23c-4e10-a113-54e3beee9a09 has ended -> executing EndAction 2013-08-18 14:37:39,348 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (DefaultQuartzScheduler_Worker-11) CommandAsyncTask::EndAction: Ending action for 1 tasks (command ID: 883610d9-a23c-4e10-a113-54e3beee9a09): calling EndAction . 2013-08-18 14:37:39,349 INFO [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-43) CommandAsyncTask::EndCommandAction [within thread] context: Attempting to EndAction LiveMigrateDisk, executionIndex: 0 2013-08-18 14:37:39,413 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-5-thread-43) [35e5febd] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand /var/log/vdsm/vdsm.log Thread-198282::ERROR::2013-08-18 12:40:27,936::BindingXMLRPC::993::vds::(wrapper) unexpected error Traceback (most recent call last): File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper res = f(*args, **kwargs) File "/usr/share/vdsm/BindingXMLRPC.py", line 377, in vmDiskReplicateFinish return vm.diskReplicateFinish(srcDisk, dstDisk) File "/usr/share/vdsm/API.py", line 536, in diskReplicateFinish return v.diskReplicateFinish(srcDisk, dstDisk) File "/usr/share/vdsm/vm.py", line 3955, in diskReplicateFinish self._delDiskReplica(srcDrive) File "/usr/share/vdsm/vm.py", line 3826, in _delDiskReplica del device['diskReplicate'] KeyError: 'diskReplicate'