+++ This bug was initially created as a clone of Bug #1481688 +++ Description of problem: Storage migration is failing in VmReplicateDiskFinishVDSCommand with error Failed to change disk image I saw this error mainly in out automation tests and I can't reproduce it locally, but the problem was found in few tests and not only one. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20170801132145.git3e12cf0.el7.centos.noarch vdsm-4.20.2-60.git06231e5.el7.centos.x86_64 How reproducible: ~50% maybe more Steps to Reproduce: 1. not really sure, just migrate disks (cold migration or live it doesn't matter) between storage domains. Actual results: In some cases the migration fails with the error Failed to change disk image Expected results: migration should end successfully Additional info: engine.log 2017-08-12 10:39:26,349+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler9) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]' 2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: Drive replication error 2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='824bbda7-cfbb-400b-a021-068dc060832f', vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5', storagePoolId='19791d8f-814e-4241-a028-e103c310060a', srcStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', targetStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', imageGroupId='6d61ec9f-5a0a-4840-af54-04dd27762ad8', imageId='5a1fcef4-81ff-42e3-83be-33e013245673'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error, code = 55 --- Additional comment from Lilach Zitnitski on 2017-08-15 09:00 EDT --- engine and vdsm logs --- Additional comment from Allon Mureinik on 2017-08-17 08:17:53 EDT --- From VDSM's log: 017-08-12 10:39:26,109+0300 ERROR (jsonrpc/4) [virt.vm] (vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5') Unable to stop the replication for the drive: vda (vm:4295) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4292, in diskReplicateFinish self._dom.blockJobAbort(drive.name, blockJobFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 743, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) Daniel/Nir can you take a look? --- Additional comment from Daniel Erez on 2017-08-17 08:31:46 EDT --- @Lilach - can you please attach qemu and libvirt logs as well? --- Additional comment from Lilach Zitnitski on 2017-08-20 07:08 EDT --- Since the first logs were from the automation tests and I couldn't get the libvirt and qemu, I reproduced this bug. The new zip contains also new engine, vdsm logs with the libvirt and qemu. correlationID - disks_syncAction_81a244b3-0f7c-4eeb error in new engine log at 17:39:15 --- Additional comment from Daniel Erez on 2017-08-20 14:19:39 EDT --- According to the log[1], there's a KeyError of missing 'protocol' key during replication start phase. Seems the cause is 'diskType': 'network', i.e. the disk has a path to a file domain, but is marked as a network disk. @Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/ @Lilach - Do you use GlusterFS for any of the tests? [1] 2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166) KeyError: 'protocol' --- Additional comment from Lilach Zitnitski on 2017-08-21 02:44:05 EDT --- Daniel, the first logs are not from tests using GlusterFS, the new logs are. I will try to re-run again the tests on iscsi and nfs only. --- Additional comment from Allon Mureinik on 2017-08-21 06:52:52 EDT --- (In reply to Daniel Erez from comment #5) > According to the log[1], there's a KeyError of missing 'protocol' key during > replication start phase. Seems the cause is 'diskType': 'network', i.e. the > disk has a path to a file domain, but is marked as a network disk. > > @Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/ Denis, wasn't there a caveat about LSM and libgfapi? --- Additional comment from Denis Chaplygin on 2017-08-21 06:58:11 EDT --- There was, but on lower level - libvirt still doesn't support block copy to the network disks. This one seems to be different issue --- Additional comment from Daniel Erez on 2017-08-21 07:31:05 EDT --- (In reply to Denis Chaplygin from comment #8) > There was, but on lower level - libvirt still doesn't support block copy to > the network disks. This one seems to be different issue @Denis - so why it's not disabled in the engine for now? Currently, passing diskType as network for image domain throws a KeyError in vdsm (as explained in comment 5). I.e. not sure I'm following the change introduced in https://gerrit.ovirt.org/#/c/79745/4/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/lsm/LiveMigrateDiskCommand.java --- Additional comment from Denis Chaplygin on 2017-08-21 08:49:32 EDT --- I'm afraid hat issue was introduced later and described in [1] Anyway, libgfapi is by default disabled everywhere, except 4.2 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 --- Additional comment from Daniel Erez on 2017-08-21 09:00:49 EDT --- (In reply to Denis Chaplygin from comment #10) > I'm afraid hat issue was introduced later and described in [1] > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 Thanks Denis! @Francesco - did you test storage live migration flow as part of bug 1481165 ? --- Additional comment from Francesco Romani on 2017-08-21 09:27:23 EDT --- (In reply to Daniel Erez from comment #11) > (In reply to Denis Chaplygin from comment #10) > > I'm afraid hat issue was introduced later and described in [1] > > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 > > Thanks Denis! > > @Francesco - did you test storage live migration flow as part of bug 1481165 > ? I am testing the LSM flow in general. We have one more fix in this area: https://gerrit.ovirt.org/#/c/80831/5 --- Additional comment from Francesco Romani on 2017-08-21 13:09:34 EDT --- (In reply to Francesco Romani from comment #12) > (In reply to Daniel Erez from comment #11) > > (In reply to Denis Chaplygin from comment #10) > > > I'm afraid hat issue was introduced later and described in [1] > > > > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 > > > > Thanks Denis! > > > > @Francesco - did you test storage live migration flow as part of bug 1481165 > > ? > > I am testing the LSM flow in general. We have one more fix in this area: > https://gerrit.ovirt.org/#/c/80831/5 To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7 seems to confirm that) Please note that from what I've seen here from the logs, the fix I'm working on [1] should not affect this bz. [1] 80831 is about recovery, if Vdsm doesn't crash there is no issue. --- Additional comment from Daniel Erez on 2017-08-21 13:44:56 EDT --- (In reply to Francesco Romani from comment #13) > (In reply to Francesco Romani from comment #12) > > (In reply to Daniel Erez from comment #11) > > > (In reply to Denis Chaplygin from comment #10) > > > > I'm afraid hat issue was introduced later and described in [1] > > > > > > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > > > > > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 > > > > > > Thanks Denis! > > > > > > @Francesco - did you test storage live migration flow as part of bug 1481165 > > > ? > > > > I am testing the LSM flow in general. We have one more fix in this area: > > https://gerrit.ovirt.org/#/c/80831/5 > > To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it > isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7 > seems to confirm that) > > Please note that from what I've seen here from the logs, the fix I'm working > on [1] should not affect this bz. > > [1] 80831 is about recovery, if Vdsm doesn't crash there is no issue. @Lilach - so we have two separate issues: * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. * LSM on nfs - can you please attach full logs of the tests? --- Additional comment from Nir Soffer on 2017-08-21 14:02:59 EDT --- (In reply to Daniel Erez from comment #14) > @Lilach - so we have two separate issues: > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. LSM on glusterfs is supported. > * LSM on nfs - can you please attach full logs of the tests? --- Additional comment from Daniel Erez on 2017-08-21 14:28:54 EDT --- (In reply to Nir Soffer from comment #15) > (In reply to Daniel Erez from comment #14) > > @Lilach - so we have two separate issues: > > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. > > LSM on glusterfs is supported. @Nir/Denis - I'm not following then: 1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8 I understand that we don't support LSM yet? 2. If we do support, why do we get a KeyError in start replication? and why do we set diskType to network? [1] [1] 2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166) KeyError: 'protocol' > > > * LSM on nfs - can you please attach full logs of the tests? --- Additional comment from Nir Soffer on 2017-08-21 14:42:47 EDT --- Regarding comment 7 - the error is: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in diskReplicateStart replica['path'] = self.cif.prepareVolumePath(replica) File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in prepareVolumePath volPath = self._prepare_network_drive(drive, res) File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in _prepare_network_drive drive['protocol'] = volinfo['protocol'] KeyError: 'protocol' _prepare_netwrok_drive() was called with a non network drive, the error is expected when sending invalid data to vdsm. The bad request was: # local_vdsm.log 2017-08-17 17:39:11,428+0300 INFO (jsonrpc/1) [vdsm.api] START diskReplicateStart(srcDisk={'device': 'disk', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd 0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'} ) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb (api:46) The destination storage domain is a block storage domain based on: 2017-08-17 17:04:56,686+0300 INFO (monitor/00d54fa) [storage.LVM] Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294) The root cause is engine, sending networkDisk for block based disk. Network disk should never be sent to LSM destination disk, even for gluster disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to network disk. Denis check this? --- Additional comment from Nir Soffer on 2017-08-21 14:47:03 EDT --- (In reply to Daniel Erez from comment #16) > (In reply to Nir Soffer from comment #15) > > LSM on glusterfs is supported. > > @Nir/Denis - I'm not following then: > 1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8 > I understand that we don't support LSM yet? We do support LSM to gluster, but not using network disk. If the source disk is gluster disk using libgfapi (networkDisk=True, protocol="gluster"), LSM will convert to the disk to a file based gluster disk, degrading performance - but the operation will complete. Until libvirt is fixed to do blockCopy to network disk don't have a choice. > 2. If we do support, why do we get a KeyError in start replication? and why > do we set diskType to network? This is engine bug. --- Additional comment from Lilach Zitnitski on 2017-08-22 02:39:06 EDT --- (In reply to Daniel Erez from comment #14) > @Lilach - so we have two separate issues: > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. > * LSM on nfs - can you please attach full logs of the tests? For now I failed to reproduce this errors with NFS/ISCSI storage domains, only with GlusterFS on my local environment. The errors I saw on NFS/ISCSI were from the automation tests, so if I'll see them again on the same environment I'll attach new logs. --- Additional comment from Daniel Erez on 2017-08-22 03:04:49 EDT --- (In reply to Nir Soffer from comment #17) > Regarding comment 7 - the error is: > > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in > diskReplicateStart > replica['path'] = self.cif.prepareVolumePath(replica) > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in > prepareVolumePath > volPath = self._prepare_network_drive(drive, res) > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in > _prepare_network_drive > drive['protocol'] = volinfo['protocol'] > KeyError: 'protocol' > > _prepare_netwrok_drive() was called with a non network drive, the error > is expected when sending invalid data to vdsm. > > The bad request was: > > # local_vdsm.log > 2017-08-17 17:39:11,428+0300 INFO (jsonrpc/1) [vdsm.api] START > diskReplicateStart(srcDisk={'device': 'disk', 'poolID': > '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd > 0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f', > 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID': > '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID': > '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID': > '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID': > '81738b6d-1693-46e6-b470-df38353ac867'} > ) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb > (api:46) > > The destination storage domain is a block storage domain based on: > > 2017-08-17 17:04:56,686+0300 INFO (monitor/00d54fa) [storage.LVM] > Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata', > 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294) > > The root cause is engine, sending networkDisk for block based disk. > > Network disk should never be sent to LSM destination disk, even for gluster > disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to > network disk. > > Denis check this? --- Additional comment from Daniel Erez on 2017-08-27 05:16:07 EDT --- As per comment 17, diskType 'Network' should not be sent to LSM destination disk. Moving to Gluster oVirt team, as the issue affects only Gluster and was introduced by https://gerrit.ovirt.org/#/c/79745/ --- Additional comment from Sahina Bose on 2017-08-30 01:57:24 EDT --- If this breaks live storage migration when libgfapi access is not enabled, I think we should target for 4.1.6. Denis?
Hi, I have tested LSM with both source and target domains to be glusterfs with libgfapi enabled and i see that it fails with following errors in engine and vdsm logs. engine.log: ================================== 2017-08-31 01:50:00,390-04 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-28) [b9830c2e-c16a-485f-b91b-45058698f0aa] CommandAsyncTa sk::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '3c1e7646-5a69-49cb-862f-2ee431fd2961' 2017-08-31 01:50:01,879-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterServersListVDSCom mand, return: [10.70.36.78/23:CONNECTED, 10.70.36.76:CONNECTED, 10.70.36.77:CONNECTED], log id: 62ba0d5f 2017-08-31 01:50:01,891-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] START, GlusterVolumesListVDSComm and(HostName = yarrow.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='11972b68-a720-4236-bde0-3c066ceff3b6'}), log id: 2ddaf087 2017-08-31 01:50:02,095-04 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler4) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'L iveMigrateVmDisks' (id: '7a857239-50a8-412a-922d-09228065bcda') waiting on child command id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' type:'LiveMigrateDisk' to complete 2017-08-31 01:50:03,116-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] START, VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId ='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomain Id='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'}), log id: 59c08196 2017-08-31 01:50:03,147-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterVolumesListVDSCom mand, return: {9eae9600-7185-4fbb-a42c-bbe9c6eb86ea=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@9669622b, 0763bd54-ea1c-4707-8abb-3d80d62df20f= org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@7aaadd95, 0dccff81-2ffa-4876-b4c3-15930fab2c65=org.ovirt.engine.core.common.businessentities.gluste r.GlusterVolumeEntity@844a93e1}, log id: 2ddaf087 2017-08-31 01:50:04,367-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed in 'VmReplicateDiskStartVDS' method 2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM zod.lab.eng.blr.redhat.com command VmReplicateDiskStartVDS failed: Drive replication error 2017-08-31 01:50:04,377-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand' return value 'StatusOnlyReturn [status=Status [code=55, message=Drive replication error]]' 2017-08-31 01:50:04,377-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] HostName = zod.lab.eng.blr.redhat.com 2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomainId='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55 2017-08-31 01:50:04,377-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] FINISH, VmReplicateDiskStartVDSCommand, log id: 59c08196 2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed VmReplicateDiskStart (Disk '908c0355-ceb8-41c8-a644-5a551040d914' , VM 'b6da275e-649b-470d-ac32-89738faa3be5') 2017-08-31 01:50:04,378-04 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' failed when attempting to perform the next operation, marking as FAILED '[3c1e7646-5a69-49cb-862f-2ee431fd2961]' 2017-08-31 01:50:04,378-04 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' child commands '[3c1e7646-5a69-49cb-862f-2ee431fd2961]' executions were completed, status 'FAILED' 2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure. 2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed durin vdsm.log file: ================================================== fce721', 'volumeChain': [{'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c 8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721', 'volumeID': u'209455eb-4b82-417e-97ea-d0bab2fce721', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_d ata/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a5510 40d914'}, {'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040 d914/f2380cbb-1e16-452e-b597-2066c979cbc2', 'volumeID': u'f2380cbb-1e16-452e-b597-2066c979cbc2', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_data/1e67ccf5-83 31-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/f2380cbb-1e16-452e-b597-2066c979cbc2.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a551040d914'}]} (vm: 3739) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 3727, in diskReplicateStart if utils.isBlockDevice(replica['path']): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 96, in isBlockDevice return stat.S_ISBLK(os.stat(path).st_mode) OSError: [Errno 2] No such file or directory: '/data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721' 2017-08-31 11:20:03,486+0530 INFO (jsonrpc/2) [vdsm.api] FINISH diskReplicateStart return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.70.3 4.124,42020, flow_id=b9830c2e-c16a-485f-b91b-45058698f0aa (api:52) 2017-08-31 11:20:03,487+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateStart failed (error 55) in 0.37 seconds (__init__:539) 2017-08-31 11:20:03,710+0530 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:53150 (protocoldetector:72) 2017-08-31 11:20:03,719+0530 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:53150 (protocoldetector:127) Tested migrating disk from one storage domain to another using glusterfs with libgfapi enabled when the vm is offline and i see that it works fine.
Engine and vdsm logs can be found in the link below: ============================================================ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1484070/
Sas, can you confirm if Live Storage migration fails if gfapi access is not turned on. If so, we will need to target this for RHHI 1.1. If not, we can move to next release
Hi sahina, when gfapi access is not turned on, i.e with fuse access Live storage migration works as expected. Clearing need info on sas Thanks kasturi.
Sahina, The dependent bug is CLOSED. We should also close this bug, if the issue is resolved
Moving to ON_QA if issue has been resolved.
Sahina, RHHI 2.0 doesn't support gfapi access mechanism. Availability of this bug in errata will make the user believe that gfapi access mechanism related fix has been made. Should we defer this bug from RHHI 2.0 and remove this bug from errata ?
(In reply to SATHEESARAN from comment #9) > Sahina, > > RHHI 2.0 doesn't support gfapi access mechanism. > Availability of this bug in errata will make the user believe that gfapi > access mechanism related fix has been made. > > Should we defer this bug from RHHI 2.0 and remove this bug from errata ? Yes, we can, though the dependent bug in oVirt is closed. I'll move to 2.0.z when we qualify with gfapi access.
(In reply to Sahina Bose from comment #10) > (In reply to SATHEESARAN from comment #9) > > Sahina, > > > > RHHI 2.0 doesn't support gfapi access mechanism. > > Availability of this bug in errata will make the user believe that gfapi > > access mechanism related fix has been made. > > > > Should we defer this bug from RHHI 2.0 and remove this bug from errata ? > > Yes, we can, though the dependent bug in oVirt is closed. > I'll move to 2.0.z when we qualify with gfapi access. There is no near term goal for RHHI to make use of gfapi. I'm closing this bug