1481688 – Storage migration fails in VmReplicateDiskFinishVDSCommand stage

Bug 1481688 - Storage migration fails in VmReplicateDiskFinishVDSCommand stage

Summary: Storage migration fails in VmReplicateDiskFinishVDSCommand stage

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	ovirt-engine
Classification:	oVirt
Component:	BLL.Storage
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.1.8
Target Release:	---
Assignee:	Denis Chaplygin
QA Contact:	Lilach Zitnitski
Docs Contact:
URL:
Whiteboard:	Hyper Convergence
Duplicates (1):	1484070 (view as bug list)
Depends On:	1507994
Blocks:	1489361
TreeView+	depends on / blocked

Reported:	2017-08-15 13:00 UTC by Lilach Zitnitski
Modified:	2021-09-09 12:33 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Clones:	1489361 (view as bug list)
Environment:
Last Closed:	2017-12-11 16:30:26 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	ykaul: ovirt-4.1+ rule-engine: ovirt-4.2+ rule-engine: blocker+

Attachments	(Terms of Use)
logs (106.25 KB, application/zip) 2017-08-15 13:00 UTC, Lilach Zitnitski	no flags	Details
new run logs (814.91 KB, application/zip) 2017-08-20 11:08 UTC, Lilach Zitnitski	no flags	Details
4.2-logs (95.30 KB, application/zip) 2017-11-29 09:40 UTC, Lilach Zitnitski	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHV-43451	None	None	None	2021-09-09 12:32:59 UTC
oVirt gerrit	81909	master	MERGED	storage: Fixed destination SD type detection during LSM	2020-10-02 09:43:30 UTC
oVirt gerrit	83394	ovirt-engine-4.1	MERGED	storage: Fixed destination SD type detection during LSM	2020-10-02 09:43:30 UTC

Description Lilach Zitnitski 2017-08-15 13:00:09 UTC

Description of problem:
Storage migration is failing in VmReplicateDiskFinishVDSCommand with error Failed to change disk image 

I saw this error mainly in out automation tests and I can't reproduce it locally, but the problem was found in few tests and not only one. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20170801132145.git3e12cf0.el7.centos.noarch
vdsm-4.20.2-60.git06231e5.el7.centos.x86_64

How reproducible:
~50% maybe more 

Steps to Reproduce:
1. not really sure, just migrate disks (cold migration or live it doesn't matter) between storage domains.

Actual results:
In some cases the migration fails with the error Failed to change disk image

Expected results:
migration should end successfully 

Additional info:

engine.log

2017-08-12 10:39:26,349+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler9) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]'

2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: Drive replication error

2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='824bbda7-cfbb-400b-a021-068dc060832f', vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5', storagePoolId='19791d8f-814e-4241-a028-e103c310060a', srcStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', targetStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', imageGroupId='6d61ec9f-5a0a-4840-af54-04dd27762ad8', imageId='5a1fcef4-81ff-42e3-83be-33e013245673'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error, code = 55

Comment 1 Lilach Zitnitski 2017-08-15 13:00:47 UTC

Created attachment 1313680 [details]
logs

engine and vdsm logs

Comment 2 Allon Mureinik 2017-08-17 12:17:53 UTC

From VDSM's log:

017-08-12 10:39:26,109+0300 ERROR (jsonrpc/4) [virt.vm] (vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5') Unable to stop the replication for the drive: vda (vm:4295)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4292, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 743, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)

Daniel/Nir can you take a look?

Comment 3 Daniel Erez 2017-08-17 12:31:46 UTC

@Lilach - can you please attach qemu and libvirt logs as well?

Comment 4 Lilach Zitnitski 2017-08-20 11:08:59 UTC

Created attachment 1315939 [details]
new run logs

Since the first logs were from the automation tests and I couldn't get the libvirt and qemu, I reproduced this bug. The new zip contains also new engine, vdsm logs with the libvirt and qemu.

correlationID - disks_syncAction_81a244b3-0f7c-4eeb
error in new engine log at 17:39:15

Comment 5 Daniel Erez 2017-08-20 18:19:39 UTC

According to the log[1], there's a KeyError of missing 'protocol' key during replication start phase. Seems the cause is 'diskType': 'network', i.e. the disk has a path to a file domain, but is marked as a network disk.

@Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/

@Lilach - Do you use GlusterFS for any of the tests?

[1]
2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166)
KeyError: 'protocol'

Comment 6 Lilach Zitnitski 2017-08-21 06:44:05 UTC

Daniel, the first logs are not from tests using GlusterFS, the new logs are. I will try to re-run again the tests on iscsi and nfs only.

Comment 7 Allon Mureinik 2017-08-21 10:52:52 UTC

(In reply to Daniel Erez from comment #5)
> According to the log[1], there's a KeyError of missing 'protocol' key during
> replication start phase. Seems the cause is 'diskType': 'network', i.e. the
> disk has a path to a file domain, but is marked as a network disk.
> 
> @Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/
Denis, wasn't there a caveat about LSM and libgfapi?

Comment 8 Denis Chaplygin 2017-08-21 10:58:11 UTC

There was, but on lower level - libvirt still doesn't support block copy to the network disks. This one seems to be different issue

Comment 9 Daniel Erez 2017-08-21 11:31:05 UTC

(In reply to Denis Chaplygin from comment #8)
> There was, but on lower level - libvirt still doesn't support block copy to
> the network disks. This one seems to be different issue

@Denis - so why it's not disabled in the engine for now? Currently, passing diskType as network for image domain throws a KeyError in vdsm (as explained in comment 5). I.e. not sure I'm following the change introduced in https://gerrit.ovirt.org/#/c/79745/4/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/lsm/LiveMigrateDiskCommand.java

Comment 10 Denis Chaplygin 2017-08-21 12:49:32 UTC

I'm afraid hat issue was introduced later and described in [1]

Anyway, libgfapi is by default disabled everywhere, except 4.2 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165

Comment 11 Daniel Erez 2017-08-21 13:00:49 UTC

(In reply to Denis Chaplygin from comment #10)
> I'm afraid hat issue was introduced later and described in [1]
> 
> Anyway, libgfapi is by default disabled everywhere, except 4.2 
> 
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165

Thanks Denis!

@Francesco - did you test storage live migration flow as part of bug 1481165 ?

Comment 12 Francesco Romani 2017-08-21 13:27:23 UTC

(In reply to Daniel Erez from comment #11)
> (In reply to Denis Chaplygin from comment #10)
> > I'm afraid hat issue was introduced later and described in [1]
> > 
> > Anyway, libgfapi is by default disabled everywhere, except 4.2 
> > 
> > 
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
> 
> Thanks Denis!
> 
> @Francesco - did you test storage live migration flow as part of bug 1481165
> ?

I am testing the LSM flow in general. We have one more fix in this area: https://gerrit.ovirt.org/#/c/80831/5

Comment 13 Francesco Romani 2017-08-21 17:09:34 UTC

(In reply to Francesco Romani from comment #12)
> (In reply to Daniel Erez from comment #11)
> > (In reply to Denis Chaplygin from comment #10)
> > > I'm afraid hat issue was introduced later and described in [1]
> > > 
> > > Anyway, libgfapi is by default disabled everywhere, except 4.2 
> > > 
> > > 
> > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
> > 
> > Thanks Denis!
> > 
> > @Francesco - did you test storage live migration flow as part of bug 1481165
> > ?
> 
> I am testing the LSM flow in general. We have one more fix in this area:
> https://gerrit.ovirt.org/#/c/80831/5

To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7 seems to confirm that)

Please note that from what I've seen here from the logs, the fix I'm working on [1] should not affect this bz.

[1] 80831 is about recovery, if Vdsm doesn't crash there is no issue.

Comment 14 Daniel Erez 2017-08-21 17:44:56 UTC

(In reply to Francesco Romani from comment #13)
> (In reply to Francesco Romani from comment #12)
> > (In reply to Daniel Erez from comment #11)
> > > (In reply to Denis Chaplygin from comment #10)
> > > > I'm afraid hat issue was introduced later and described in [1]
> > > > 
> > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 
> > > > 
> > > > 
> > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
> > > 
> > > Thanks Denis!
> > > 
> > > @Francesco - did you test storage live migration flow as part of bug 1481165
> > > ?
> > 
> > I am testing the LSM flow in general. We have one more fix in this area:
> > https://gerrit.ovirt.org/#/c/80831/5
> 
> To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it
> isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7
> seems to confirm that)
> 
> Please note that from what I've seen here from the logs, the fix I'm working
> on [1] should not affect this bz.
> 
> [1] 80831 is about recovery, if Vdsm doesn't crash there is no issue.

@Lilach - so we have two separate issues:
* LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.
* LSM on nfs - can you please attach full logs of the tests?

Comment 15 Nir Soffer 2017-08-21 18:02:59 UTC

(In reply to Daniel Erez from comment #14)
> @Lilach - so we have two separate issues:
> * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.

LSM on glusterfs is supported.

> * LSM on nfs - can you please attach full logs of the tests?

Comment 16 Daniel Erez 2017-08-21 18:28:54 UTC

(In reply to Nir Soffer from comment #15)
> (In reply to Daniel Erez from comment #14)
> > @Lilach - so we have two separate issues:
> > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.
> 
> LSM on glusterfs is supported.

@Nir/Denis - I'm not following then:
1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8
I understand that we don't support LSM yet?
2. If we do support, why do we get a KeyError in start replication? and why do we set diskType to network?

[1] [1]
2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166)
KeyError: 'protocol'


> 
> > * LSM on nfs - can you please attach full logs of the tests?

Comment 17 Nir Soffer 2017-08-21 18:42:47 UTC

Regarding comment 7 - the error is:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in diskReplicateStart
    replica['path'] = self.cif.prepareVolumePath(replica)
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in prepareVolumePath
    volPath = self._prepare_network_drive(drive, res)
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in _prepare_network_drive
    drive['protocol'] = volinfo['protocol']
KeyError: 'protocol'

_prepare_netwrok_drive() was called with a non network drive, the error
is expected when sending invalid data to vdsm.

The bad request was:

# local_vdsm.log
2017-08-17 17:39:11,428+0300 INFO  (jsonrpc/1) [vdsm.api] START diskReplicateStart(srcDisk={'device': 'disk', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd
0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}
) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb (api:46)

The destination storage domain is a block storage domain based on:

2017-08-17 17:04:56,686+0300 INFO  (monitor/00d54fa) [storage.LVM] Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294)

The root cause is engine, sending networkDisk for block based disk.

Network disk should never be sent to LSM destination disk, even for gluster
disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to 
network disk.

Denis check this?

Comment 18 Nir Soffer 2017-08-21 18:47:03 UTC

(In reply to Daniel Erez from comment #16)
> (In reply to Nir Soffer from comment #15)
> > LSM on glusterfs is supported.
> 
> @Nir/Denis - I'm not following then:
> 1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8
> I understand that we don't support LSM yet?

We do support LSM to gluster, but not using network disk. If the source disk is
gluster disk using libgfapi (networkDisk=True, protocol="gluster"), LSM will
convert to the disk to a file based gluster disk, degrading performance - but
the operation will complete. Until libvirt is fixed to do blockCopy to network
disk don't have a choice.

> 2. If we do support, why do we get a KeyError in start replication? and why
> do we set diskType to network?

This is engine bug.

Comment 19 Lilach Zitnitski 2017-08-22 06:39:06 UTC

(In reply to Daniel Erez from comment #14)
> @Lilach - so we have two separate issues:
> * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.
> * LSM on nfs - can you please attach full logs of the tests?

For now I failed to reproduce this errors with NFS/ISCSI storage domains, only with GlusterFS on my local environment. 
The errors I saw on NFS/ISCSI were from the automation tests, so if I'll see them again on the same environment I'll attach new logs.

Comment 20 Daniel Erez 2017-08-22 07:04:49 UTC

(In reply to Nir Soffer from comment #17)
> Regarding comment 7 - the error is:
> 
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in
> diskReplicateStart
>     replica['path'] = self.cif.prepareVolumePath(replica)
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in
> prepareVolumePath
>     volPath = self._prepare_network_drive(drive, res)
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in
> _prepare_network_drive
>     drive['protocol'] = volinfo['protocol']
> KeyError: 'protocol'
> 
> _prepare_netwrok_drive() was called with a non network drive, the error
> is expected when sending invalid data to vdsm.
> 
> The bad request was:
> 
> # local_vdsm.log
> 2017-08-17 17:39:11,428+0300 INFO  (jsonrpc/1) [vdsm.api] START
> diskReplicateStart(srcDisk={'device': 'disk', 'poolID':
> '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd
> 0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f',
> 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID':
> '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID':
> '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID':
> '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID':
> '81738b6d-1693-46e6-b470-df38353ac867'}
> ) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb
> (api:46)
> 
> The destination storage domain is a block storage domain based on:
> 
> 2017-08-17 17:04:56,686+0300 INFO  (monitor/00d54fa) [storage.LVM]
> Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata',
> 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294)
> 
> The root cause is engine, sending networkDisk for block based disk.
> 
> Network disk should never be sent to LSM destination disk, even for gluster
> disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to 
> network disk.
> 
> Denis check this?

Comment 21 Daniel Erez 2017-08-27 09:16:07 UTC

As per comment 17, diskType 'Network' should not be sent to LSM destination disk. 
Moving to Gluster oVirt team, as the issue affects only Gluster and was introduced by https://gerrit.ovirt.org/#/c/79745/

Comment 22 Sahina Bose 2017-08-30 05:57:24 UTC

If this breaks live storage migration when libgfapi access is not enabled, I think we should target for 4.1.6.
Denis?

Comment 23 RamaKasturi 2017-09-07 09:32:01 UTC

Hi,

  I have tested LSM with both source and target domains to be glusterfs with libgfapi enabled and i see that it fails with following errors in engine and vdsm logs.

engine.log:
==================================

2017-08-31 01:50:00,390-04 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-28) [b9830c2e-c16a-485f-b91b-45058698f0aa] CommandAsyncTa
sk::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '3c1e7646-5a69-49cb-862f-2ee431fd2961'
2017-08-31 01:50:01,879-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterServersListVDSCom
mand, return: [10.70.36.78/23:CONNECTED, 10.70.36.76:CONNECTED, 10.70.36.77:CONNECTED], log id: 62ba0d5f
2017-08-31 01:50:01,891-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] START, GlusterVolumesListVDSComm
and(HostName = yarrow.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='11972b68-a720-4236-bde0-3c066ceff3b6'}), log id: 2ddaf087
2017-08-31 01:50:02,095-04 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler4) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'L
iveMigrateVmDisks' (id: '7a857239-50a8-412a-922d-09228065bcda') waiting on child command id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' type:'LiveMigrateDisk' to complete
2017-08-31 01:50:03,116-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa]
 START, VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId
='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomain
Id='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'}), log id: 59c08196
2017-08-31 01:50:03,147-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterVolumesListVDSCom
mand, return: {9eae9600-7185-4fbb-a42c-bbe9c6eb86ea=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@9669622b, 0763bd54-ea1c-4707-8abb-3d80d62df20f=
org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@7aaadd95, 0dccff81-2ffa-4876-b4c3-15930fab2c65=org.ovirt.engine.core.common.businessentities.gluste
r.GlusterVolumeEntity@844a93e1}, log id: 2ddaf087
2017-08-31 01:50:04,367-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed in 'VmReplicateDiskStartVDS' method
2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM zod.lab.eng.blr.redhat.com command VmReplicateDiskStartVDS failed: Drive replication error
2017-08-31 01:50:04,377-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand' return value 'StatusOnlyReturn [status=Status [code=55, message=Drive replication error]]'
2017-08-31 01:50:04,377-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] HostName = zod.lab.eng.blr.redhat.com
2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomainId='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55
2017-08-31 01:50:04,377-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] FINISH, VmReplicateDiskStartVDSCommand, log id: 59c08196
2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed VmReplicateDiskStart (Disk '908c0355-ceb8-41c8-a644-5a551040d914' , VM 'b6da275e-649b-470d-ac32-89738faa3be5')
2017-08-31 01:50:04,378-04 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' failed when attempting to perform the next operation, marking as FAILED '[3c1e7646-5a69-49cb-862f-2ee431fd2961]'
2017-08-31 01:50:04,378-04 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' child commands '[3c1e7646-5a69-49cb-862f-2ee431fd2961]' executions were completed, status 'FAILED'
2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed durin

vdsm.log file:
==================================================
fce721', 'volumeChain': [{'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c
8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721', 'volumeID': u'209455eb-4b82-417e-97ea-d0bab2fce721', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_d
ata/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a5510
40d914'}, {'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040
d914/f2380cbb-1e16-452e-b597-2066c979cbc2', 'volumeID': u'f2380cbb-1e16-452e-b597-2066c979cbc2', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_data/1e67ccf5-83
31-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/f2380cbb-1e16-452e-b597-2066c979cbc2.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a551040d914'}]} (vm:
3739)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3727, in diskReplicateStart
    if utils.isBlockDevice(replica['path']):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 96, in isBlockDevice
    return stat.S_ISBLK(os.stat(path).st_mode)
OSError: [Errno 2] No such file or directory: '/data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721'
2017-08-31 11:20:03,486+0530 INFO  (jsonrpc/2) [vdsm.api] FINISH diskReplicateStart return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.70.3
4.124,42020, flow_id=b9830c2e-c16a-485f-b91b-45058698f0aa (api:52)
2017-08-31 11:20:03,487+0530 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateStart failed (error 55) in 0.37 seconds (__init__:539)
2017-08-31 11:20:03,710+0530 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:53150 (protocoldetector:72)
2017-08-31 11:20:03,719+0530 INFO  (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:53150 (protocoldetector:127)

Tested migrating disk from one storage domain to another using glusterfs with libgfapi enabled when the vm is offline and i see that it works fine.

Comment 24 RamaKasturi 2017-09-07 09:32:15 UTC

Engine and vdsm logs can be found in the link below:
============================================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1484070/

Comment 25 Yaniv Kaul 2017-09-14 12:48:14 UTC

*** Bug 1484070 has been marked as a duplicate of this bug. ***

Comment 26 Denis Chaplygin 2017-09-15 14:27:57 UTC

Treating gluster storage as file storage for non LSM purposes doesn't work. On the other hand, LSM from gluster to non-gluster should work and it looks like it sends network disk type incorrectly. 

I think that we should return error from the engine for the "to gluster" LSM and make "to non-gluster" LSM working.

Comment 27 Nir Soffer 2017-09-15 15:04:06 UTC

(In reply to Denis Chaplygin from comment #26)
> Treating gluster storage as file storage for non LSM purposes doesn't work.

This always worked, all storage operations use the gluster mount, for example
monitoring storage.

What "does not work" now?

> On the other hand, LSM from gluster to non-gluster should work and it looks
> like it sends network disk type incorrectly. 
> 
> I think that we should return error from the engine for the "to gluster" LSM
> and make "to non-gluster" LSM working.

LSM from any storage to gluster also used to work and I don't see any reason that
it will not work now.

Please be more specific with the problem you see with current code.

Comment 28 Denis Chaplygin 2017-09-18 08:39:52 UTC

(In reply to Nir Soffer from comment #27)
> This always worked, all storage operations use the gluster mount, for example
> monitoring storage.

LSM never worked, due to libvirt limitation, documented at [1]
 
> What "does not work" now?

same libvirt. Itfail with 

libvirtError: argument unsupported: non-file destination not supported yet

> > On the other hand, LSM from gluster to non-gluster should work and it looks
> > like it sends network disk type incorrectly. 
> > 
> > I think that we should return error from the engine for the "to gluster" LSM
> > and make "to non-gluster" LSM working.
> 
> LSM from any storage to gluster also used to work and I don't see any reason
> that
> it will not work now.

No, cause 
 
> Please be more specific with the problem you see with current code.

Engine is sending disks as network even for non-gluster disks. This needs to be fixed.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1306562

Comment 29 Allon Mureinik 2017-10-30 13:59:23 UTC

Patch is merged for 4.2.
Denis - do we need this backported?

Comment 30 Denis Chaplygin 2017-10-30 14:00:58 UTC

BZ target version is 4.2, but i think it needs to be backported to 4.1

Comment 31 Allon Mureinik 2017-11-01 09:56:23 UTC

(In reply to Denis Chaplygin from comment #30)
> BZ target version is 4.2, but i think it needs to be backported to 4.1
Agreed.
I've reset the target milestone to 4.1.8.

Comment 32 Sandro Bonazzola 2017-11-24 12:50:30 UTC

Referenced patch was included in 4.1.7.6 tag, moving to QE

Comment 33 Lilach Zitnitski 2017-11-28 14:06:44 UTC

Storage migration tests passed with all storage types on rhevm-4.1.8.1-0.1.el7.noarch, vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64

moving to verified.

Comment 34 Lilach Zitnitski 2017-11-29 09:39:54 UTC

I ran today migration test on 4.2 and the issue still exist, same errors exactly. The same test is passing on 4.1.

Adding 4.2 logs for you to look.

correlationID - disks_syncAction_d0063abf-482d-4835

2017-11-28 20:01:23,202+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [disks_syncAction_d0063abf-482d-4835] Command 'VmReplicateDiskStartVDSCommand(HostName = host_mixed_3, VmReplicateDiskParameters:{hostId='04f8a786-fc94-4629-bf66-79d0a3d1944c', vmId='8d874efe-dd99-4451-ba31-33b600436444', storagePoolId='4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', srcStorageDomainId='70562927-a370-46b3-b827-fe223031a085', targetStorageDomainId='5f1676a5-3a75-47b6-9571-22f107badeb4', imageGroupId='95655bef-3a82-43c1-b76c-30032045c025', imageId='da1047da-abc9-438e-af24-61941f255497'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55

Comment 35 Lilach Zitnitski 2017-11-29 09:40:22 UTC

Created attachment 1360243 [details]
4.2-logs

Comment 36 Francesco Romani 2017-11-29 09:58:53 UTC

Relevant excerpt from logs:

gluster01.scl.lab.tlv.redhat.com:_storage__jenkins__ge5__volume__0/5f1676a5-3a75-47b6-9571-22f107badeb4/images/95655bef-3a82-43c1-b76c-30032045c025/da1047da-abc9-438e-af24-61941f255497.lease', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}]} (vm:4367)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4361, in diskReplicateStart
    self._startDriveReplication(drive)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4497, in _startDriveReplication
    self._dom.blockCopy(drive.name, destxml, flags=flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 556, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 688, in blockCopy
    if ret == -1: raise libvirtError ('virDomainBlockCopy() failed', dom=self)
libvirtError: argument unsupported: non-file destination not supported yet
2017-11-28 20:01:23,194+0200 INFO  (jsonrpc/6) [api.virt] FINISH diskReplicateStart return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.35.69.27,46362, flow_id=disks_syncAction_d0063abf-482d-4835 (api:52)
2017-11-28 20:01:23,196+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateStart failed (error 55) in 0.60 seconds (__init__:573)
2017-11-28 20:01:23,484+0200 DEBUG (check/loop) [storage.check] START check '/dev/58056ef0-77de-4e86-85ce-c4d38c46be4b/metadata' (delay=0.00) (check:282)
2017-11-28 20:01:23,540+0200 DEBUG (check/loop) [storage.asyncevent] Process <subprocess32.Popen object at 0x2cb2250> terminated (count=1) (asyncevent:659)
2017-11-28 20:01:23,542+0200 DEBUG (check/loop) [storage.check] FINISH check '/dev/58056ef0-77de-4e86-85ce-c4d38c46be4b/metadata' (rc=0, elapsed=0.06) (check:328)
2017-11-28 20:01:24,234+0200 INFO  (jsonrpc/3) [api.virt] START diskReplicateFinish(srcDisk={'device': 'disk', 'poolID': '4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', 'volumeID': 'da1047da-abc9-438e-af24-61941f255497', 'domainID': '70562927-a370-46b3-b827-fe223031a085', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}, dstDisk={'device': 'disk', 'poolID': '4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', 'volumeID': 'da1047da-abc9-438e-af24-61941f255497', 'domainID': '70562927-a370-46b3-b827-fe223031a085', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}) from=::ffff:10.35.69.27,46362, flow_id=disks_syncAction_d0063abf-482d-4835 (api:46)
2017-11-28 20:01:24,236+0200 ERROR (jsonrpc/3) [virt.vm] (vmId='8d874efe-dd99-4451-ba31-33b600436444') No replication in progress (drive: 'vdc', srcDisk: {'device': 'disk', 'poolID': '4a0d48a7-2018-4bdb-9bef-b4ba21cb22c5', 'volumeID': 'da1047da-abc9-438e-af24-61941f255497', 'domainID': '70562927-a370-46b3-b827-fe223031a085', 'imageID': '95655bef-3a82-43c1-b76c-30032045c025'}) (vm:4402)
2017-11-28 20:01:24,236+0200 INFO  (jsonrpc/3) [api.virt] FINISH diskReplicateFinish return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.35.69.27,46362, flow_id=disks_syncAction_d0063abf-482d-4835 (api:52)
2017-11-28 20:01:24,237+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateFinish failed (error 55) in 0.00 seconds (__init__:573)

Comment 37 Allon Mureinik 2017-11-29 10:04:28 UTC

Lilach, what 4.2 build are you using?

Comment 38 Denis Chaplygin 2017-11-29 10:45:30 UTC

> libvirtError: argument unsupported: non-file destination not supported yet

This is expected behavior. LSM to gluster disks will not work, until https://bugzilla.redhat.com/show_bug.cgi?id=1306562 is fixed.

Comment 39 Lilach Zitnitski 2017-11-29 12:09:45 UTC

(In reply to Allon Mureinik from comment #37)
> Lilach, what 4.2 build are you using?

ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch

(In reply to Denis Chaplygin from comment #38)
> > libvirtError: argument unsupported: non-file destination not supported yet
> 
> This is expected behavior. LSM to gluster disks will not work, until
> https://bugzilla.redhat.com/show_bug.cgi?id=1306562 is fixed.

OK thanks for the clarification.

Comment 40 Allon Mureinik 2017-11-30 09:58:44 UTC

*** Bug 1519149 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.