Bug 998239 - [SCALE][Live Storage Migration] [NFS] Failed Live Storage Migration on NFS Data Center under heavy load
Summary: [SCALE][Live Storage Migration] [NFS] Failed Live Storage Migration on NFS Da...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Daniel Erez
QA Contact: Eldad Marciano
URL:
Whiteboard:
Depends On: 999896 1040476
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-08-18 14:59 UTC by vvyazmin@redhat.com
Modified: 2016-03-09 19:18 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-09 19:18:34 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:
scohen: Triaged+


Attachments (Terms of Use)
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm (5.36 MB, application/x-gzip)
2013-08-18 14:59 UTC, vvyazmin@redhat.com
no flags Details
3.6.3-0.1 engine log (32.97 KB, application/x-gzip)
2016-02-24 11:24 UTC, Yuri Obshansky
no flags Details
3.6.3-0.1 vdsm log (353.90 KB, application/x-gzip)
2016-02-24 11:25 UTC, Yuri Obshansky
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:0362 0 normal SHIPPED_LIVE vdsm 3.6.0 bug fix and enhancement update 2016-03-09 23:49:32 UTC

Description vvyazmin@redhat.com 2013-08-18 14:59:26 UTC
Created attachment 787791 [details]
## Logs rhevm, vdsm, libvirt, thread dump, superVdsm

Description of problem:
Failed Live Storage Migration on NFS Data Center

Version-Release number of selected component (if applicable):
RHEVM 3.3 - IS10 environment:

RHEVM:  rhevm-3.3.0-0.15.master.el6ev.noarch
PythonSDK:  rhevm-sdk-python-3.3.0.10-1.el6ev.noarch
VDSM:  vdsm-4.12.0-61.git8178ec2.el6ev.x86_64
LIBVIRT:  libvirt-0.10.2-18.el6_4.9.x86_64
QEMU & KVM:  qemu-kvm-rhev-0.12.1.2-2.355.el6_4.5.x86_64
SANLOCK:  sanlock-2.8-1.el6.x86_64

How reproducible:
100 %

Steps to Reproduce:
Create NFS Data Center with two Storage Domain
Create VM with one disk
Start to install OS on this VM
During installation OS, run Live Storage Migration (LSM)


Actual results:
Failed run LSM

Expected results:
Succeed run LSM

Impact on user:
Failed run LSM

Workaround:
none

Additional info:

No problem were found on FCP Data Center
No rules in iptables are found.

/var/log/ovirt-engine/engine.log
2013-08-18 14:37:09,422 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Command VmReplicateDiskFinishVDS executio
n failed. Exception: VDSNetworkException: java.util.concurrent.TimeoutException
2013-08-18 14:37:09,423 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] FINISH, VmReplicateDiskFinishVDSCommand, 
log id: 62a48070
2013-08-18 14:37:09,442 ERROR [org.ovirt.engine.core.bll.lsm.VmReplicateDiskFinishTaskHandler] (pool-5-thread-43) [35e5febd] Failed VmReplicateDiskFinish (Disk 12e58b89-25b1-465
c-801f-b7ae61b21bed , VM 373c4014-d74d-4f6b-9750-729c6d3f847f)
2013-08-18 14:37:09,442 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-5-thread-43) [35e5febd] Command org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand t
hrow Vdc Bll exception. With error message VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.util.concurrent.TimeoutException (Failed with err
or VDS_NETWORK_ERROR and code 5022)
2013-08-18 14:37:09,447 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-5-thread-43) [35e5febd] Reverting task unknown, handler: org.ovirt.engine.core.bll.lsm
.VmReplicateDiskStartTaskHandler
2013-08-18 14:37:09,448 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] START, VmReplicateDiskFinishVDSCommand(Ho
stName = tigris01.scl.lab.tlv.redhat.com, HostId = 9576d8ca-4466-46e6-bebc-ccd922075ac6, vmId=373c4014-d74d-4f6b-9750-729c6d3f847f), log id: 22c8a416
2013-08-18 14:37:39,153 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Failed in VmReplicateDiskFinishVDS method
2013-08-18 14:37:39,154 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Error code replicaErr and error message V
DSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error
2013-08-18 14:37:39,154 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Command org.ovirt.engine.core.vdsbroker.v
dsbroker.VmReplicateDiskFinishVDSCommand return value 
 StatusOnlyReturnForXmlRpc [mStatus=StatusForXmlRpc [mCode=55, mMessage=Drive replication error]]
2013-08-18 14:37:39,154 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] HostName = tigris01.scl.lab.tlv.redhat.co
m
2013-08-18 14:37:39,155 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] Command VmReplicateDiskFinishVDS executio
n failed. Exception: VDSErrorException: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error
2013-08-18 14:37:39,155 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (pool-5-thread-43) [35e5febd] FINISH, VmReplicateDiskFinishVDSCommand, 
log id: 22c8a416
2013-08-18 14:37:39,161 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-43) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type 
LiveMigrateDisk completed, handling the result.
2013-08-18 14:37:39,161 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-43) CommandAsyncTask::HandleEndActionResult [within thread]: EndAction for action type 
LiveMigrateDisk hasnt succeeded, not clearing tasks, will attempt again next polling.
2013-08-18 14:37:39,348 INFO  [org.ovirt.engine.core.bll.SPMAsyncTask] (DefaultQuartzScheduler_Worker-11) BaseAsyncTask::OnTaskEndSuccess: Task 6b996cee-144b-4175-b975-7a062877c1b8 (Parent Command LiveMigrateDisk, Parameters Type org.ovirt.engine.core.common.asynctasks.AsyncTaskParameters) ended successfully.
2013-08-18 14:37:39,348 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (DefaultQuartzScheduler_Worker-11) CommandAsyncTask::EndActionIfNecessary: All tasks of command 883610d9-a23c-4e10-a113-54e3beee9a09 has ended -> executing EndAction
2013-08-18 14:37:39,348 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (DefaultQuartzScheduler_Worker-11) CommandAsyncTask::EndAction: Ending action for 1 tasks (command ID: 883610d9-a23c-4e10-a113-54e3beee9a09): calling EndAction .
2013-08-18 14:37:39,349 INFO  [org.ovirt.engine.core.bll.CommandAsyncTask] (pool-5-thread-43) CommandAsyncTask::EndCommandAction [within thread] context: Attempting to EndAction LiveMigrateDisk, executionIndex: 0
2013-08-18 14:37:39,413 ERROR [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-5-thread-43) [35e5febd] Ending command with failure: org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand



/var/log/vdsm/vdsm.log

Thread-198282::ERROR::2013-08-18 12:40:27,936::BindingXMLRPC::993::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 377, in vmDiskReplicateFinish
    return vm.diskReplicateFinish(srcDisk, dstDisk)
  File "/usr/share/vdsm/API.py", line 536, in diskReplicateFinish
    return v.diskReplicateFinish(srcDisk, dstDisk)
  File "/usr/share/vdsm/vm.py", line 3955, in diskReplicateFinish
    self._delDiskReplica(srcDrive)
  File "/usr/share/vdsm/vm.py", line 3826, in _delDiskReplica
    del device['diskReplicate']
KeyError: 'diskReplicate'

Comment 2 Federico Simoncelli 2013-08-20 09:20:34 UTC
Given the heavy load on the VM disk (installing os) flushing the data to the (async) replica (destination) took a long time, more than the engine timeout (3 minutes).
Engine sent diskReplicateFinish for a second time triggering a race on the 'diskReplicate' key removal.


Thread-198282::DEBUG::2013-08-18 12:34:11,303::BindingXMLRPC::974::vds::(wrapper) client [10.35.161.29]::call vmDiskReplicateFinish with ('373c4014-d74d-4f6b-9750-729c6d3f847f', {'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}, {'device': 'disk', 'domainID': 'f7cb7d3b-ac55-44eb-9afb-6c2439c521ab', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}) {} flowID [35e5febd]

Thread-198756::DEBUG::2013-08-18 12:37:11,332::BindingXMLRPC::974::vds::(wrapper) client [10.35.161.29]::call vmDiskReplicateFinish with ('373c4014-d74d-4f6b-9750-729c6d3f847f', {'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}, {'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'}) {} flowID [35e5febd]

Thread-198756::ERROR::2013-08-18 12:37:41,001::vm::3897::vm.Vm::(diskReplicateFinish) vmId=`373c4014-d74d-4f6b-9750-729c6d3f847f`::Replication job not found for disk vda ({'device': 'disk', 'domainID': 'e130e4dd-97ae-48e1-9210-136ac204d573', 'volumeID': 'c1d0988d-dd2f-4614-a9d7-ebfc854f7a3b', 'poolID': '18a4a18a-6b40-473c-a490-8096340c1054', 'imageID': '12e58b89-25b1-465c-801f-b7ae61b21bed'})

Thread-198756::DEBUG::2013-08-18 12:37:41,029::BindingXMLRPC::981::vds::(wrapper) return vmDiskReplicateFinish with {'status': {'message': 'Drive replication error', 'code': 55}}

Thread-198282::ERROR::2013-08-18 12:40:27,936::BindingXMLRPC::993::vds::(wrapper) unexpected error
Traceback (most recent call last):
  File "/usr/share/vdsm/BindingXMLRPC.py", line 979, in wrapper
    res = f(*args, **kwargs)
  File "/usr/share/vdsm/BindingXMLRPC.py", line 377, in vmDiskReplicateFinish
    return vm.diskReplicateFinish(srcDisk, dstDisk)
  File "/usr/share/vdsm/API.py", line 536, in diskReplicateFinish
    return v.diskReplicateFinish(srcDisk, dstDisk)
  File "/usr/share/vdsm/vm.py", line 3955, in diskReplicateFinish
    self._delDiskReplica(srcDrive)
  File "/usr/share/vdsm/vm.py", line 3826, in _delDiskReplica
    del device['diskReplicate']
KeyError: 'diskReplicate'


I don't think it's a regression, there's nothing new here that triggered this issue.

Comment 4 Allon Mureinik 2013-09-01 08:35:03 UTC

*** This bug has been marked as a duplicate of bug 999896 ***

Comment 7 Ayal Baron 2013-09-13 09:11:53 UTC
Bug 999896 dealt with the 'blocker' part of this bug.
What remains is to improve vdsm behaviour

Comment 10 Allon Mureinik 2015-03-16 15:57:49 UTC
(In reply to vvyazmin from comment #0)
> Steps to Reproduce:
> Create NFS Data Center with two Storage Domain
> Create VM with one disk
> Start to install OS on this VM
> During installation OS, run Live Storage Migration (LSM)
During development of 3.5 features we've done this multiple types and never hit the BZ.

Moving to MODIFIED so QA can verify this for 3.6.0.

Comment 12 Yuri Obshansky 2016-02-24 11:24:15 UTC
Bug verified on 
RHEVM: 3.6.3-0.1
KVM Version: 2.3.0 - 31.el7_2.7
LIBVIRT Version: libvirt-1.2.17-13.el7_2.3
VDSM Version: vdsm-4.17.19-0.el7ev

Performed VM disk live migration from one NFS storage domain to other 
during OS RHEL 7.2 installation on VM
Process passed. Elapsed time for migration about 2 min.

Messages from engine.log:
2016-02-24 13:08:49,326 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-44) [3b3bbf2f] Lock Acquired to object 'EngineLock:{exclusiveLocks='null', sharedLocks='[ed07622e-ff5a-4a64-9adc-c60583af7400=<VM, ACTION_TYPE_FAILED_DISK_IS_BEING_MIGRATED$DiskName vm-os-1_Disk1>]'}'
.....
2016-02-24 13:08:50,529 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-44) [3b3bbf2f] Correlation ID: 3b3bbf2f, Job ID: 1718953b-7244-40e5-be0d-e2c7fe5fd52e, Call Stack: null, Custom Event ID: -1, Message: User admin@internal moving disk vm-os-1_Disk1 to domain NFS-SD-2.
.....
2016-02-24 13:10:19,202 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (org.ovirt.thread.pool-6-thread-26) [3b3bbf2f] Ending command 'org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand' successfully.
...
2016-02-24 13:10:19,323 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-26) [3b3bbf2f] Correlation ID: 3b3bbf2f, Job ID: 1718953b-7244-40e5-be0d-e2c7fe5fd52e, Call Stack: null, Custom Event ID: -1, Message: User admin@internal finished moving disk vm-os-1_Disk1 to domain NFS-SD-2.

See details in attached engine and vdsm log files

Bug can be closed

Comment 13 Yuri Obshansky 2016-02-24 11:24:46 UTC
Created attachment 1130182 [details]
3.6.3-0.1 engine log

Comment 14 Yuri Obshansky 2016-02-24 11:25:09 UTC
Created attachment 1130183 [details]
3.6.3-0.1 vdsm log

Comment 16 errata-xmlrpc 2016-03-09 19:18:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0362.html


Note You need to log in before you can comment on or make changes to this bug.