Description of problem: Storage migration is failing in VmReplicateDiskFinishVDSCommand with error Failed to change disk image I saw this error mainly in out automation tests and I can't reproduce it locally, but the problem was found in few tests and not only one. Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20170801132145.git3e12cf0.el7.centos.noarch vdsm-4.20.2-60.git06231e5.el7.centos.x86_64 How reproducible: ~50% maybe more Steps to Reproduce: 1. not really sure, just migrate disks (cold migration or live it doesn't matter) between storage domains. Actual results: In some cases the migration fails with the error Failed to change disk image Expected results: migration should end successfully Additional info: engine.log 2017-08-12 10:39:26,349+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler9) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]' 2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: Drive replication error 2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='824bbda7-cfbb-400b-a021-068dc060832f', vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5', storagePoolId='19791d8f-814e-4241-a028-e103c310060a', srcStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', targetStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', imageGroupId='6d61ec9f-5a0a-4840-af54-04dd27762ad8', imageId='5a1fcef4-81ff-42e3-83be-33e013245673'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error, code = 55
Created attachment 1313680 [details] logs engine and vdsm logs
From VDSM's log: 017-08-12 10:39:26,109+0300 ERROR (jsonrpc/4) [virt.vm] (vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5') Unable to stop the replication for the drive: vda (vm:4295) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4292, in diskReplicateFinish self._dom.blockJobAbort(drive.name, blockJobFlags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 743, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) Daniel/Nir can you take a look?
@Lilach - can you please attach qemu and libvirt logs as well?
Created attachment 1315939 [details] new run logs Since the first logs were from the automation tests and I couldn't get the libvirt and qemu, I reproduced this bug. The new zip contains also new engine, vdsm logs with the libvirt and qemu. correlationID - disks_syncAction_81a244b3-0f7c-4eeb error in new engine log at 17:39:15
According to the log[1], there's a KeyError of missing 'protocol' key during replication start phase. Seems the cause is 'diskType': 'network', i.e. the disk has a path to a file domain, but is marked as a network disk. @Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/ @Lilach - Do you use GlusterFS for any of the tests? [1] 2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166) KeyError: 'protocol'
Daniel, the first logs are not from tests using GlusterFS, the new logs are. I will try to re-run again the tests on iscsi and nfs only.
(In reply to Daniel Erez from comment #5) > According to the log[1], there's a KeyError of missing 'protocol' key during > replication start phase. Seems the cause is 'diskType': 'network', i.e. the > disk has a path to a file domain, but is marked as a network disk. > > @Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/ Denis, wasn't there a caveat about LSM and libgfapi?
There was, but on lower level - libvirt still doesn't support block copy to the network disks. This one seems to be different issue
(In reply to Denis Chaplygin from comment #8) > There was, but on lower level - libvirt still doesn't support block copy to > the network disks. This one seems to be different issue @Denis - so why it's not disabled in the engine for now? Currently, passing diskType as network for image domain throws a KeyError in vdsm (as explained in comment 5). I.e. not sure I'm following the change introduced in https://gerrit.ovirt.org/#/c/79745/4/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/lsm/LiveMigrateDiskCommand.java
I'm afraid hat issue was introduced later and described in [1] Anyway, libgfapi is by default disabled everywhere, except 4.2 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
(In reply to Denis Chaplygin from comment #10) > I'm afraid hat issue was introduced later and described in [1] > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 Thanks Denis! @Francesco - did you test storage live migration flow as part of bug 1481165 ?
(In reply to Daniel Erez from comment #11) > (In reply to Denis Chaplygin from comment #10) > > I'm afraid hat issue was introduced later and described in [1] > > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 > > Thanks Denis! > > @Francesco - did you test storage live migration flow as part of bug 1481165 > ? I am testing the LSM flow in general. We have one more fix in this area: https://gerrit.ovirt.org/#/c/80831/5
(In reply to Francesco Romani from comment #12) > (In reply to Daniel Erez from comment #11) > > (In reply to Denis Chaplygin from comment #10) > > > I'm afraid hat issue was introduced later and described in [1] > > > > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 > > > > Thanks Denis! > > > > @Francesco - did you test storage live migration flow as part of bug 1481165 > > ? > > I am testing the LSM flow in general. We have one more fix in this area: > https://gerrit.ovirt.org/#/c/80831/5 To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7 seems to confirm that) Please note that from what I've seen here from the logs, the fix I'm working on [1] should not affect this bz. [1] 80831 is about recovery, if Vdsm doesn't crash there is no issue.
(In reply to Francesco Romani from comment #13) > (In reply to Francesco Romani from comment #12) > > (In reply to Daniel Erez from comment #11) > > > (In reply to Denis Chaplygin from comment #10) > > > > I'm afraid hat issue was introduced later and described in [1] > > > > > > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 > > > > > > > > > > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165 > > > > > > Thanks Denis! > > > > > > @Francesco - did you test storage live migration flow as part of bug 1481165 > > > ? > > > > I am testing the LSM flow in general. We have one more fix in this area: > > https://gerrit.ovirt.org/#/c/80831/5 > > To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it > isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7 > seems to confirm that) > > Please note that from what I've seen here from the logs, the fix I'm working > on [1] should not affect this bz. > > [1] 80831 is about recovery, if Vdsm doesn't crash there is no issue. @Lilach - so we have two separate issues: * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. * LSM on nfs - can you please attach full logs of the tests?
(In reply to Daniel Erez from comment #14) > @Lilach - so we have two separate issues: > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. LSM on glusterfs is supported. > * LSM on nfs - can you please attach full logs of the tests?
(In reply to Nir Soffer from comment #15) > (In reply to Daniel Erez from comment #14) > > @Lilach - so we have two separate issues: > > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. > > LSM on glusterfs is supported. @Nir/Denis - I'm not following then: 1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8 I understand that we don't support LSM yet? 2. If we do support, why do we get a KeyError in start replication? and why do we set diskType to network? [1] [1] 2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166) KeyError: 'protocol' > > > * LSM on nfs - can you please attach full logs of the tests?
Regarding comment 7 - the error is: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in diskReplicateStart replica['path'] = self.cif.prepareVolumePath(replica) File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in prepareVolumePath volPath = self._prepare_network_drive(drive, res) File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in _prepare_network_drive drive['protocol'] = volinfo['protocol'] KeyError: 'protocol' _prepare_netwrok_drive() was called with a non network drive, the error is expected when sending invalid data to vdsm. The bad request was: # local_vdsm.log 2017-08-17 17:39:11,428+0300 INFO (jsonrpc/1) [vdsm.api] START diskReplicateStart(srcDisk={'device': 'disk', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd 0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'} ) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb (api:46) The destination storage domain is a block storage domain based on: 2017-08-17 17:04:56,686+0300 INFO (monitor/00d54fa) [storage.LVM] Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294) The root cause is engine, sending networkDisk for block based disk. Network disk should never be sent to LSM destination disk, even for gluster disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to network disk. Denis check this?
(In reply to Daniel Erez from comment #16) > (In reply to Nir Soffer from comment #15) > > LSM on glusterfs is supported. > > @Nir/Denis - I'm not following then: > 1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8 > I understand that we don't support LSM yet? We do support LSM to gluster, but not using network disk. If the source disk is gluster disk using libgfapi (networkDisk=True, protocol="gluster"), LSM will convert to the disk to a file based gluster disk, degrading performance - but the operation will complete. Until libvirt is fixed to do blockCopy to network disk don't have a choice. > 2. If we do support, why do we get a KeyError in start replication? and why > do we set diskType to network? This is engine bug.
(In reply to Daniel Erez from comment #14) > @Lilach - so we have two separate issues: > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests. > * LSM on nfs - can you please attach full logs of the tests? For now I failed to reproduce this errors with NFS/ISCSI storage domains, only with GlusterFS on my local environment. The errors I saw on NFS/ISCSI were from the automation tests, so if I'll see them again on the same environment I'll attach new logs.
(In reply to Nir Soffer from comment #17) > Regarding comment 7 - the error is: > > Traceback (most recent call last): > File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in > diskReplicateStart > replica['path'] = self.cif.prepareVolumePath(replica) > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in > prepareVolumePath > volPath = self._prepare_network_drive(drive, res) > File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in > _prepare_network_drive > drive['protocol'] = volinfo['protocol'] > KeyError: 'protocol' > > _prepare_netwrok_drive() was called with a non network drive, the error > is expected when sending invalid data to vdsm. > > The bad request was: > > # local_vdsm.log > 2017-08-17 17:39:11,428+0300 INFO (jsonrpc/1) [vdsm.api] START > diskReplicateStart(srcDisk={'device': 'disk', 'poolID': > '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd > 0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f', > 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID': > '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID': > '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID': > '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID': > '81738b6d-1693-46e6-b470-df38353ac867'} > ) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb > (api:46) > > The destination storage domain is a block storage domain based on: > > 2017-08-17 17:04:56,686+0300 INFO (monitor/00d54fa) [storage.LVM] > Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata', > 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294) > > The root cause is engine, sending networkDisk for block based disk. > > Network disk should never be sent to LSM destination disk, even for gluster > disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to > network disk. > > Denis check this?
As per comment 17, diskType 'Network' should not be sent to LSM destination disk. Moving to Gluster oVirt team, as the issue affects only Gluster and was introduced by https://gerrit.ovirt.org/#/c/79745/
If this breaks live storage migration when libgfapi access is not enabled, I think we should target for 4.1.6. Denis?
Hi, I have tested LSM with both source and target domains to be glusterfs with libgfapi enabled and i see that it fails with following errors in engine and vdsm logs. engine.log: ================================== 2017-08-31 01:50:00,390-04 INFO [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-28) [b9830c2e-c16a-485f-b91b-45058698f0aa] CommandAsyncTa sk::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '3c1e7646-5a69-49cb-862f-2ee431fd2961' 2017-08-31 01:50:01,879-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterServersListVDSCom mand, return: [10.70.36.78/23:CONNECTED, 10.70.36.76:CONNECTED, 10.70.36.77:CONNECTED], log id: 62ba0d5f 2017-08-31 01:50:01,891-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] START, GlusterVolumesListVDSComm and(HostName = yarrow.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='11972b68-a720-4236-bde0-3c066ceff3b6'}), log id: 2ddaf087 2017-08-31 01:50:02,095-04 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler4) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'L iveMigrateVmDisks' (id: '7a857239-50a8-412a-922d-09228065bcda') waiting on child command id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' type:'LiveMigrateDisk' to complete 2017-08-31 01:50:03,116-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] START, VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId ='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomain Id='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'}), log id: 59c08196 2017-08-31 01:50:03,147-04 INFO [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterVolumesListVDSCom mand, return: {9eae9600-7185-4fbb-a42c-bbe9c6eb86ea=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@9669622b, 0763bd54-ea1c-4707-8abb-3d80d62df20f= org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@7aaadd95, 0dccff81-2ffa-4876-b4c3-15930fab2c65=org.ovirt.engine.core.common.businessentities.gluste r.GlusterVolumeEntity@844a93e1}, log id: 2ddaf087 2017-08-31 01:50:04,367-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed in 'VmReplicateDiskStartVDS' method 2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM zod.lab.eng.blr.redhat.com command VmReplicateDiskStartVDS failed: Drive replication error 2017-08-31 01:50:04,377-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand' return value 'StatusOnlyReturn [status=Status [code=55, message=Drive replication error]]' 2017-08-31 01:50:04,377-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] HostName = zod.lab.eng.blr.redhat.com 2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomainId='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55 2017-08-31 01:50:04,377-04 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] FINISH, VmReplicateDiskStartVDSCommand, log id: 59c08196 2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed VmReplicateDiskStart (Disk '908c0355-ceb8-41c8-a644-5a551040d914' , VM 'b6da275e-649b-470d-ac32-89738faa3be5') 2017-08-31 01:50:04,378-04 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' failed when attempting to perform the next operation, marking as FAILED '[3c1e7646-5a69-49cb-862f-2ee431fd2961]' 2017-08-31 01:50:04,378-04 INFO [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' child commands '[3c1e7646-5a69-49cb-862f-2ee431fd2961]' executions were completed, status 'FAILED' 2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure. 2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed durin vdsm.log file: ================================================== fce721', 'volumeChain': [{'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c 8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721', 'volumeID': u'209455eb-4b82-417e-97ea-d0bab2fce721', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_d ata/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a5510 40d914'}, {'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040 d914/f2380cbb-1e16-452e-b597-2066c979cbc2', 'volumeID': u'f2380cbb-1e16-452e-b597-2066c979cbc2', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_data/1e67ccf5-83 31-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/f2380cbb-1e16-452e-b597-2066c979cbc2.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a551040d914'}]} (vm: 3739) Traceback (most recent call last): File "/usr/share/vdsm/virt/vm.py", line 3727, in diskReplicateStart if utils.isBlockDevice(replica['path']): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 96, in isBlockDevice return stat.S_ISBLK(os.stat(path).st_mode) OSError: [Errno 2] No such file or directory: '/data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721' 2017-08-31 11:20:03,486+0530 INFO (jsonrpc/2) [vdsm.api] FINISH diskReplicateStart return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.70.3 4.124,42020, flow_id=b9830c2e-c16a-485f-b91b-45058698f0aa (api:52) 2017-08-31 11:20:03,487+0530 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateStart failed (error 55) in 0.37 seconds (__init__:539) 2017-08-31 11:20:03,710+0530 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:53150 (protocoldetector:72) 2017-08-31 11:20:03,719+0530 INFO (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:53150 (protocoldetector:127) Tested migrating disk from one storage domain to another using glusterfs with libgfapi enabled when the vm is offline and i see that it works fine.
Engine and vdsm logs can be found in the link below: ============================================================ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1484070/
*** Bug 1484070 has been marked as a duplicate of this bug. ***
Treating gluster storage as file storage for non LSM purposes doesn't work. On the other hand, LSM from gluster to non-gluster should work and it looks like it sends network disk type incorrectly. I think that we should return error from the engine for the "to gluster" LSM and make "to non-gluster" LSM working.
(In reply to Denis Chaplygin from comment #26) > Treating gluster storage as file storage for non LSM purposes doesn't work. This always worked, all storage operations use the gluster mount, for example monitoring storage. What "does not work" now? > On the other hand, LSM from gluster to non-gluster should work and it looks > like it sends network disk type incorrectly. > > I think that we should return error from the engine for the "to gluster" LSM > and make "to non-gluster" LSM working. LSM from any storage to gluster also used to work and I don't see any reason that it will not work now. Please be more specific with the problem you see with current code.
(In reply to Nir Soffer from comment #27) > This always worked, all storage operations use the gluster mount, for example > monitoring storage. LSM never worked, due to libvirt limitation, documented at [1] > What "does not work" now? same libvirt. Itfail with libvirtError: argument unsupported: non-file destination not supported yet > > On the other hand, LSM from gluster to non-gluster should work and it looks > > like it sends network disk type incorrectly. > > > > I think that we should return error from the engine for the "to gluster" LSM > > and make "to non-gluster" LSM working. > > LSM from any storage to gluster also used to work and I don't see any reason > that > it will not work now. No, cause > Please be more specific with the problem you see with current code. Engine is sending disks as network even for non-gluster disks. This needs to be fixed. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1306562
Patch is merged for 4.2. Denis - do we need this backported?
BZ target version is 4.2, but i think it needs to be backported to 4.1
(In reply to Denis Chaplygin from comment #30) > BZ target version is 4.2, but i think it needs to be backported to 4.1 Agreed. I've reset the target milestone to 4.1.8.
Referenced patch was included in 4.1.7.6 tag, moving to QE
Storage migration tests passed with all storage types on rhevm-4.1.8.1-0.1.el7.noarch, vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64 moving to verified.
I ran today migration test on 4.2 and the issue still exist, same errors exactly. The same test is passing on 4.1. Adding 4.2 logs for you to look. correlationID - disks_syncAction_d0063abf-482d-4835 2017-11-28 20:01:23,202+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [disks_syncAction_d0063abf-482d-4835] Command 'VmReplicateDiskStartVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='04f8a786-fc94-4629-bf66-79d0a3d1944c', vmId='8d874efe-dd99-4451-ba31-33b600436444', storagePoolId='4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', srcStorageDomainId='70562927-a370-46b3-b827-fe223031a085', targetStorageDomainId='5f1676a5-3a75-47b6-9571-22f107badeb4', imageGroupId='95655bef-3a82-43c1-b76c-30032045c025', imageId='da1047da-abc9-438e-af24-61941f255497'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55
Created attachment 1360243 [details] 4.2-logs
Relevant excerpt from logs: gluster01.scl.lab.tlv.redhat.com:_storage__jenkins__ge5__volume__0/5f1676a5-3a75-47b6-9571-22f107badeb4/images/95655bef-3a82-43c1-b76c-30032045c025/da1047da-abc9-438e-af24-61941f255497.lease', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}]} (vm:4367) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4361, in diskReplicateStart self._startDriveReplication(drive) File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4497, in _startDriveReplication self._dom.blockCopy(drive.name, destxml, flags=flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 556, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 688, in blockCopy if ret == -1: raise libvirtError ('virDomainBlockCopy() failed', dom=self) libvirtError: argument unsupported: non-file destination not supported yet 2017-11-28 20:01:23,194+0200 INFO (jsonrpc/6) [api.virt] FINISH diskReplicateStart return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.35.69.27,46362, flow_id=disks_syncAction_d0063abf-482d-4835 (api:52) 2017-11-28 20:01:23,196+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateStart failed (error 55) in 0.60 seconds (__init__:573) 2017-11-28 20:01:23,484+0200 DEBUG (check/loop) [storage.check] START check '/dev/58056ef0-77de-4e86-85ce-c4d38c46be4b/metadata' (delay=0.00) (check:282) 2017-11-28 20:01:23,540+0200 DEBUG (check/loop) [storage.asyncevent] Process <subprocess32.Popen object at 0x2cb2250> terminated (count=1) (asyncevent:659) 2017-11-28 20:01:23,542+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/58056ef0-77de-4e86-85ce-c4d38c46be4b/metadata' (rc=0, elapsed=0.06) (check:328) 2017-11-28 20:01:24,234+0200 INFO (jsonrpc/3) [api.virt] START diskReplicateFinish(srcDisk={'device': 'disk', 'poolID': '4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', 'volumeID': 'da1047da-abc9-438e-af24-61941f255497', 'domainID': '70562927-a370-46b3-b827-fe223031a085', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}, dstDisk={'device': 'disk', 'poolID': '4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', 'volumeID': 'da1047da-abc9-438e-af24-61941f255497', 'domainID': '70562927-a370-46b3-b827-fe223031a085', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}) from=::ffff:10.35.69.27,46362, flow_id=disks_syncAction_d0063abf-482d-4835 (api:46) 2017-11-28 20:01:24,236+0200 ERROR (jsonrpc/3) [virt.vm] (vmId='8d874efe-dd99-4451-ba31-33b600436444') No replication in progress (drive: 'vdc', srcDisk: {'device': 'disk', 'poolID': '4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', 'volumeID': 'da1047da-abc9-438e-af24-61941f255497', 'domainID': '70562927-a370-46b3-b827-fe223031a085', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}) (vm:4402) 2017-11-28 20:01:24,236+0200 INFO (jsonrpc/3) [api.virt] FINISH diskReplicateFinish return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.35.69.27,46362, flow_id=disks_syncAction_d0063abf-482d-4835 (api:52) 2017-11-28 20:01:24,237+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateFinish failed (error 55) in 0.00 seconds (__init__:573)
Lilach, what 4.2 build are you using?
> libvirtError: argument unsupported: non-file destination not supported yet This is expected behavior. LSM to gluster disks will not work, until https://bugzilla.redhat.com/show_bug.cgi?id=1306562 is fixed.
(In reply to Allon Mureinik from comment #37) > Lilach, what 4.2 build are you using? ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch (In reply to Denis Chaplygin from comment #38) > > libvirtError: argument unsupported: non-file destination not supported yet > > This is expected behavior. LSM to gluster disks will not work, until > https://bugzilla.redhat.com/show_bug.cgi?id=1306562 is fixed. OK thanks for the clarification.
*** Bug 1519149 has been marked as a duplicate of this bug. ***