Bug 1489361 - [gfapi] Storage migration fails in VmReplicateDiskFinishVDSCommand stage w/ gfapi access
Summary: [gfapi] Storage migration fails in VmReplicateDiskFinishVDSCommand stage w/ g...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhhi
Version: rhhi-1.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Sahina Bose
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1481688
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-07 09:26 UTC by Sahina Bose
Modified: 2020-08-01 02:53 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1481688
Environment:
Last Closed: 2020-08-01 02:53:17 UTC
Embargoed:


Attachments (Terms of Use)

Description Sahina Bose 2017-09-07 09:26:03 UTC
+++ This bug was initially created as a clone of Bug #1481688 +++

Description of problem:
Storage migration is failing in VmReplicateDiskFinishVDSCommand with error Failed to change disk image 

I saw this error mainly in out automation tests and I can't reproduce it locally, but the problem was found in few tests and not only one. 

Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20170801132145.git3e12cf0.el7.centos.noarch
vdsm-4.20.2-60.git06231e5.el7.centos.x86_64

How reproducible:
~50% maybe more 

Steps to Reproduce:
1. not really sure, just migrate disks (cold migration or live it doesn't matter) between storage domains.

Actual results:
In some cases the migration fails with the error Failed to change disk image

Expected results:
migration should end successfully 

Additional info:

engine.log

2017-08-12 10:39:26,349+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler9) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand' return value 'StatusOnlyReturn [status=Status [code=41, message=Failed to change disk image]]'

2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command VmReplicateDiskFinishVDS failed: Drive replication error

2017-08-12 10:39:27,546+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskFinishVDSCommand] (DefaultQuartzScheduler1) [disks_syncAction_764a9c1c-f5bf-4ca5] Command 'VmReplicateDiskFinishVDSCommand(HostName = host_mixed_2, VmReplicateDiskParameters:{hostId='824bbda7-cfbb-400b-a021-068dc060832f', vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5', storagePoolId='19791d8f-814e-4241-a028-e103c310060a', srcStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', targetStorageDomainId='e8045448-f2e3-421b-b92e-34f79f2a687d', imageGroupId='6d61ec9f-5a0a-4840-af54-04dd27762ad8', imageId='5a1fcef4-81ff-42e3-83be-33e013245673'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskFinishVDS, error = Drive replication error, code = 55

--- Additional comment from Lilach Zitnitski on 2017-08-15 09:00 EDT ---

engine and vdsm logs

--- Additional comment from Allon Mureinik on 2017-08-17 08:17:53 EDT ---

From VDSM's log:

017-08-12 10:39:26,109+0300 ERROR (jsonrpc/4) [virt.vm] (vmId='f13ca114-d739-43b7-93ca-a97c4a951bc5') Unable to stop the replication for the drive: vda (vm:4295)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4292, in diskReplicateFinish
    self._dom.blockJobAbort(drive.name, blockJobFlags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 125, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 586, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 743, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)

Daniel/Nir can you take a look?

--- Additional comment from Daniel Erez on 2017-08-17 08:31:46 EDT ---

@Lilach - can you please attach qemu and libvirt logs as well?

--- Additional comment from Lilach Zitnitski on 2017-08-20 07:08 EDT ---

Since the first logs were from the automation tests and I couldn't get the libvirt and qemu, I reproduced this bug. The new zip contains also new engine, vdsm logs with the libvirt and qemu.

correlationID - disks_syncAction_81a244b3-0f7c-4eeb
error in new engine log at 17:39:15

--- Additional comment from Daniel Erez on 2017-08-20 14:19:39 EDT ---

According to the log[1], there's a KeyError of missing 'protocol' key during replication start phase. Seems the cause is 'diskType': 'network', i.e. the disk has a path to a file domain, but is marked as a network disk.

@Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/

@Lilach - Do you use GlusterFS for any of the tests?

[1]
2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166)
KeyError: 'protocol'

--- Additional comment from Lilach Zitnitski on 2017-08-21 02:44:05 EDT ---

Daniel, the first logs are not from tests using GlusterFS, the new logs are. I will try to re-run again the tests on iscsi and nfs only.

--- Additional comment from Allon Mureinik on 2017-08-21 06:52:52 EDT ---

(In reply to Daniel Erez from comment #5)
> According to the log[1], there's a KeyError of missing 'protocol' key during
> replication start phase. Seems the cause is 'diskType': 'network', i.e. the
> disk has a path to a file domain, but is marked as a network disk.
> 
> @Sahina - Could it be related to https://gerrit.ovirt.org/#/c/79745/
Denis, wasn't there a caveat about LSM and libgfapi?

--- Additional comment from Denis Chaplygin on 2017-08-21 06:58:11 EDT ---

There was, but on lower level - libvirt still doesn't support block copy to the network disks. This one seems to be different issue

--- Additional comment from Daniel Erez on 2017-08-21 07:31:05 EDT ---

(In reply to Denis Chaplygin from comment #8)
> There was, but on lower level - libvirt still doesn't support block copy to
> the network disks. This one seems to be different issue

@Denis - so why it's not disabled in the engine for now? Currently, passing diskType as network for image domain throws a KeyError in vdsm (as explained in comment 5). I.e. not sure I'm following the change introduced in https://gerrit.ovirt.org/#/c/79745/4/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/storage/lsm/LiveMigrateDiskCommand.java

--- Additional comment from Denis Chaplygin on 2017-08-21 08:49:32 EDT ---

I'm afraid hat issue was introduced later and described in [1]

Anyway, libgfapi is by default disabled everywhere, except 4.2 


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165

--- Additional comment from Daniel Erez on 2017-08-21 09:00:49 EDT ---

(In reply to Denis Chaplygin from comment #10)
> I'm afraid hat issue was introduced later and described in [1]
> 
> Anyway, libgfapi is by default disabled everywhere, except 4.2 
> 
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165

Thanks Denis!

@Francesco - did you test storage live migration flow as part of bug 1481165 ?

--- Additional comment from Francesco Romani on 2017-08-21 09:27:23 EDT ---

(In reply to Daniel Erez from comment #11)
> (In reply to Denis Chaplygin from comment #10)
> > I'm afraid hat issue was introduced later and described in [1]
> > 
> > Anyway, libgfapi is by default disabled everywhere, except 4.2 
> > 
> > 
> > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
> 
> Thanks Denis!
> 
> @Francesco - did you test storage live migration flow as part of bug 1481165
> ?

I am testing the LSM flow in general. We have one more fix in this area: https://gerrit.ovirt.org/#/c/80831/5

--- Additional comment from Francesco Romani on 2017-08-21 13:09:34 EDT ---

(In reply to Francesco Romani from comment #12)
> (In reply to Daniel Erez from comment #11)
> > (In reply to Denis Chaplygin from comment #10)
> > > I'm afraid hat issue was introduced later and described in [1]
> > > 
> > > Anyway, libgfapi is by default disabled everywhere, except 4.2 
> > > 
> > > 
> > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
> > 
> > Thanks Denis!
> > 
> > @Francesco - did you test storage live migration flow as part of bug 1481165
> > ?
> 
> I am testing the LSM flow in general. We have one more fix in this area:
> https://gerrit.ovirt.org/#/c/80831/5

To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7 seems to confirm that)

Please note that from what I've seen here from the logs, the fix I'm working on [1] should not affect this bz.

[1] 80831 is about recovery, if Vdsm doesn't crash there is no issue.

--- Additional comment from Daniel Erez on 2017-08-21 13:44:56 EDT ---

(In reply to Francesco Romani from comment #13)
> (In reply to Francesco Romani from comment #12)
> > (In reply to Daniel Erez from comment #11)
> > > (In reply to Denis Chaplygin from comment #10)
> > > > I'm afraid hat issue was introduced later and described in [1]
> > > > 
> > > > Anyway, libgfapi is by default disabled everywhere, except 4.2 
> > > > 
> > > > 
> > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1481165
> > > 
> > > Thanks Denis!
> > > 
> > > @Francesco - did you test storage live migration flow as part of bug 1481165
> > > ?
> > 
> > I am testing the LSM flow in general. We have one more fix in this area:
> > https://gerrit.ovirt.org/#/c/80831/5
> 
> To be clear, I didn't test LSM on glusterfs, only VM start/stop (IIUC it
> isn't supported and https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c7
> seems to confirm that)
> 
> Please note that from what I've seen here from the logs, the fix I'm working
> on [1] should not affect this bz.
> 
> [1] 80831 is about recovery, if Vdsm doesn't crash there is no issue.

@Lilach - so we have two separate issues:
* LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.
* LSM on nfs - can you please attach full logs of the tests?

--- Additional comment from Nir Soffer on 2017-08-21 14:02:59 EDT ---

(In reply to Daniel Erez from comment #14)
> @Lilach - so we have two separate issues:
> * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.

LSM on glusterfs is supported.

> * LSM on nfs - can you please attach full logs of the tests?

--- Additional comment from Daniel Erez on 2017-08-21 14:28:54 EDT ---

(In reply to Nir Soffer from comment #15)
> (In reply to Daniel Erez from comment #14)
> > @Lilach - so we have two separate issues:
> > * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.
> 
> LSM on glusterfs is supported.

@Nir/Denis - I'm not following then:
1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8
I understand that we don't support LSM yet?
2. If we do support, why do we get a KeyError in start replication? and why do we set diskType to network?

[1] [1]
2017-08-17 17:39:15,611+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d3f3c5c5-a337-45f6-b30c-26e2fc739951') Unable to start replication for vda to {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeInfo': {'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volType': 'path'}, 'format': 'cow', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'propagateErrors': 'off', 'diskType': 'network', 'cache': 'none', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867', 'volumeChain': [{'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 116391936, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/0e8861ca-7a38-4bd0-8f04-b2344661b432', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 114294784, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/b8bd4c59-ac36-4847-82ce-d09fc3552146', 'volumeID': 'b8bd4c59-ac36-4847-82ce-d09fc3552146', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, {'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'leaseOffset': 109051904, 'path': '/rhev/data-center/mnt/blockSD/00d54fa8-ac31-4190-addb-0085335e7272/images/81738b6d-1693-46e6-b470-df38353ac867/d5508161-5232-4453-b81d-4a65f1240d38', 'volumeID': 'd5508161-5232-4453-b81d-4a65f1240d38', 'leasePath': '/dev/00d54fa8-ac31-4190-addb-0085335e7272/leases', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}]} (vm:4166)
KeyError: 'protocol'


> 
> > * LSM on nfs - can you please attach full logs of the tests?

--- Additional comment from Nir Soffer on 2017-08-21 14:42:47 EDT ---

Regarding comment 7 - the error is:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in diskReplicateStart
    replica['path'] = self.cif.prepareVolumePath(replica)
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in prepareVolumePath
    volPath = self._prepare_network_drive(drive, res)
  File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in _prepare_network_drive
    drive['protocol'] = volinfo['protocol']
KeyError: 'protocol'

_prepare_netwrok_drive() was called with a non network drive, the error
is expected when sending invalid data to vdsm.

The bad request was:

# local_vdsm.log
2017-08-17 17:39:11,428+0300 INFO  (jsonrpc/1) [vdsm.api] START diskReplicateStart(srcDisk={'device': 'disk', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd
0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID': '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID': '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID': '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}
) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb (api:46)

The destination storage domain is a block storage domain based on:

2017-08-17 17:04:56,686+0300 INFO  (monitor/00d54fa) [storage.LVM] Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata', 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294)

The root cause is engine, sending networkDisk for block based disk.

Network disk should never be sent to LSM destination disk, even for gluster
disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to 
network disk.

Denis check this?

--- Additional comment from Nir Soffer on 2017-08-21 14:47:03 EDT ---

(In reply to Daniel Erez from comment #16)
> (In reply to Nir Soffer from comment #15)
> > LSM on glusterfs is supported.
> 
> @Nir/Denis - I'm not following then:
> 1. According to https://bugzilla.redhat.com/show_bug.cgi?id=1481688#c8
> I understand that we don't support LSM yet?

We do support LSM to gluster, but not using network disk. If the source disk is
gluster disk using libgfapi (networkDisk=True, protocol="gluster"), LSM will
convert to the disk to a file based gluster disk, degrading performance - but
the operation will complete. Until libvirt is fixed to do blockCopy to network
disk don't have a choice.

> 2. If we do support, why do we get a KeyError in start replication? and why
> do we set diskType to network?

This is engine bug.

--- Additional comment from Lilach Zitnitski on 2017-08-22 02:39:06 EDT ---

(In reply to Daniel Erez from comment #14)
> @Lilach - so we have two separate issues:
> * LSM on glusterfs - IIUC, unsupported yet, so can be removed from tests.
> * LSM on nfs - can you please attach full logs of the tests?

For now I failed to reproduce this errors with NFS/ISCSI storage domains, only with GlusterFS on my local environment. 
The errors I saw on NFS/ISCSI were from the automation tests, so if I'll see them again on the same environment I'll attach new logs.

--- Additional comment from Daniel Erez on 2017-08-22 03:04:49 EDT ---

(In reply to Nir Soffer from comment #17)
> Regarding comment 7 - the error is:
> 
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 4149, in
> diskReplicateStart
>     replica['path'] = self.cif.prepareVolumePath(replica)
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 345, in
> prepareVolumePath
>     volPath = self._prepare_network_drive(drive, res)
>   File "/usr/lib/python2.7/site-packages/vdsm/clientIF.py", line 632, in
> _prepare_network_drive
>     drive['protocol'] = volinfo['protocol']
> KeyError: 'protocol'
> 
> _prepare_netwrok_drive() was called with a non network drive, the error
> is expected when sending invalid data to vdsm.
> 
> The bad request was:
> 
> # local_vdsm.log
> 2017-08-17 17:39:11,428+0300 INFO  (jsonrpc/1) [vdsm.api] START
> diskReplicateStart(srcDisk={'device': 'disk', 'poolID':
> '30312c9a-2bff-4ebe-9335-120b48e52d91', 'volumeID': '0e8861ca-7a38-4bd
> 0-8f04-b2344661b432', 'domainID': '25893777-25bb-4085-ac34-f0e88d4aa10f',
> 'imageID': '81738b6d-1693-46e6-b470-df38353ac867'}, dstDisk={'domainID':
> '00d54fa8-ac31-4190-addb-0085335e7272', 'volumeID':
> '0e8861ca-7a38-4bd0-8f04-b2344661b432', 'diskType': 'network', 'poolID':
> '30312c9a-2bff-4ebe-9335-120b48e52d91', 'device': 'disk', 'imageID':
> '81738b6d-1693-46e6-b470-df38353ac867'}
> ) from=::ffff:10.35.69.27,55492, flow_id=disks_syncAction_81a244b3-0f7c-4eeb
> (api:46)
> 
> The destination storage domain is a block storage domain based on:
> 
> 2017-08-17 17:04:56,686+0300 INFO  (monitor/00d54fa) [storage.LVM]
> Refreshing lvs: vg=00d54fa8-ac31-4190-addb-0085335e7272 lvs=['metadata',
> 'leases', 'ids', 'inbox', 'outbox', 'xleases', 'master'] (lvm:1294)
> 
> The root cause is engine, sending networkDisk for block based disk.
> 
> Network disk should never be sent to LSM destination disk, even for gluster
> disk, not in 4.1 and not in 4.2, since libvirt does not support yet LSM to 
> network disk.
> 
> Denis check this?

--- Additional comment from Daniel Erez on 2017-08-27 05:16:07 EDT ---

As per comment 17, diskType 'Network' should not be sent to LSM destination disk. 
Moving to Gluster oVirt team, as the issue affects only Gluster and was introduced by https://gerrit.ovirt.org/#/c/79745/

--- Additional comment from Sahina Bose on 2017-08-30 01:57:24 EDT ---

If this breaks live storage migration when libgfapi access is not enabled, I think we should target for 4.1.6.
Denis?

Comment 1 RamaKasturi 2017-09-07 09:32:40 UTC
Hi,

  I have tested LSM with both source and target domains to be glusterfs with libgfapi enabled and i see that it fails with following errors in engine and vdsm logs.

engine.log:
==================================

2017-08-31 01:50:00,390-04 INFO  [org.ovirt.engine.core.bll.tasks.CommandAsyncTask] (org.ovirt.thread.pool-6-thread-28) [b9830c2e-c16a-485f-b91b-45058698f0aa] CommandAsyncTa
sk::HandleEndActionResult [within thread]: Removing CommandMultiAsyncTasks object for entity '3c1e7646-5a69-49cb-862f-2ee431fd2961'
2017-08-31 01:50:01,879-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterServersListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterServersListVDSCom
mand, return: [10.70.36.78/23:CONNECTED, 10.70.36.76:CONNECTED, 10.70.36.77:CONNECTED], log id: 62ba0d5f
2017-08-31 01:50:01,891-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] START, GlusterVolumesListVDSComm
and(HostName = yarrow.lab.eng.blr.redhat.com, GlusterVolumesListVDSParameters:{runAsync='true', hostId='11972b68-a720-4236-bde0-3c066ceff3b6'}), log id: 2ddaf087
2017-08-31 01:50:02,095-04 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler4) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'L
iveMigrateVmDisks' (id: '7a857239-50a8-412a-922d-09228065bcda') waiting on child command id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' type:'LiveMigrateDisk' to complete
2017-08-31 01:50:03,116-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa]
 START, VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId
='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomain
Id='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'}), log id: 59c08196
2017-08-31 01:50:03,147-04 INFO  [org.ovirt.engine.core.vdsbroker.gluster.GlusterVolumesListVDSCommand] (DefaultQuartzScheduler8) [7097b76a] FINISH, GlusterVolumesListVDSCom
mand, return: {9eae9600-7185-4fbb-a42c-bbe9c6eb86ea=org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@9669622b, 0763bd54-ea1c-4707-8abb-3d80d62df20f=
org.ovirt.engine.core.common.businessentities.gluster.GlusterVolumeEntity@7aaadd95, 0dccff81-2ffa-4876-b4c3-15930fab2c65=org.ovirt.engine.core.common.businessentities.gluste
r.GlusterVolumeEntity@844a93e1}, log id: 2ddaf087
2017-08-31 01:50:04,367-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed in 'VmReplicateDiskStartVDS' method
2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM zod.lab.eng.blr.redhat.com command VmReplicateDiskStartVDS failed: Drive replication error
2017-08-31 01:50:04,377-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand' return value 'StatusOnlyReturn [status=Status [code=55, message=Drive replication error]]'
2017-08-31 01:50:04,377-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] HostName = zod.lab.eng.blr.redhat.com
2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'VmReplicateDiskStartVDSCommand(HostName = zod.lab.eng.blr.redhat.com, VmReplicateDiskParameters:{runAsync='true', hostId='6a90a343-10e4-4c5c-afc7-e5477594d2b6', vmId='b6da275e-649b-470d-ac32-89738faa3be5', storagePoolId='59a5429a-0236-034c-00d4-000000000250', srcStorageDomainId='d58a1d51-da25-454e-9edd-9754e419d668', targetStorageDomainId='1e67ccf5-8331-4e69-a0f6-c5ed58381217', imageGroupId='908c0355-ceb8-41c8-a644-5a551040d914', imageId='209455eb-4b82-417e-97ea-d0bab2fce721'})' execution failed: VDSGenericException: VDSErrorException: Failed to VmReplicateDiskStartVDS, error = Drive replication error, code = 55
2017-08-31 01:50:04,377-04 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] FINISH, VmReplicateDiskStartVDSCommand, log id: 59c08196
2017-08-31 01:50:04,377-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed VmReplicateDiskStart (Disk '908c0355-ceb8-41c8-a644-5a551040d914' , VM 'b6da275e-649b-470d-ac32-89738faa3be5')
2017-08-31 01:50:04,378-04 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' failed when attempting to perform the next operation, marking as FAILED '[3c1e7646-5a69-49cb-862f-2ee431fd2961]'
2017-08-31 01:50:04,378-04 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler10) [b9830c2e-c16a-485f-b91b-45058698f0aa] Command 'LiveMigrateDisk' id: '1b57116f-ba8f-40a9-b0d9-df4f4bfec93f' child commands '[3c1e7646-5a69-49cb-862f-2ee431fd2961]' executions were completed, status 'FAILED'
2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand' with failure.
2017-08-31 01:50:05,400-04 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateDiskCommand] (DefaultQuartzScheduler7) [b9830c2e-c16a-485f-b91b-45058698f0aa] Failed durin

vdsm.log file:
==================================================
fce721', 'volumeChain': [{'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c
8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721', 'volumeID': u'209455eb-4b82-417e-97ea-d0bab2fce721', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_d
ata/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a5510
40d914'}, {'domainID': u'1e67ccf5-8331-4e69-a0f6-c5ed58381217', 'leaseOffset': 0, 'path': u'data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040
d914/f2380cbb-1e16-452e-b597-2066c979cbc2', 'volumeID': u'f2380cbb-1e16-452e-b597-2066c979cbc2', 'leasePath': u'/rhev/data-center/mnt/glusterSD/10.70.36.78:_data/1e67ccf5-83
31-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/f2380cbb-1e16-452e-b597-2066c979cbc2.lease', 'imageID': u'908c0355-ceb8-41c8-a644-5a551040d914'}]} (vm:
3739)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3727, in diskReplicateStart
    if utils.isBlockDevice(replica['path']):
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 96, in isBlockDevice
    return stat.S_ISBLK(os.stat(path).st_mode)
OSError: [Errno 2] No such file or directory: '/data/1e67ccf5-8331-4e69-a0f6-c5ed58381217/images/908c0355-ceb8-41c8-a644-5a551040d914/209455eb-4b82-417e-97ea-d0bab2fce721'
2017-08-31 11:20:03,486+0530 INFO  (jsonrpc/2) [vdsm.api] FINISH diskReplicateStart return={'status': {'message': 'Drive replication error', 'code': 55}} from=::ffff:10.70.3
4.124,42020, flow_id=b9830c2e-c16a-485f-b91b-45058698f0aa (api:52)
2017-08-31 11:20:03,487+0530 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call VM.diskReplicateStart failed (error 55) in 0.37 seconds (__init__:539)
2017-08-31 11:20:03,710+0530 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:53150 (protocoldetector:72)
2017-08-31 11:20:03,719+0530 INFO  (Reactor thread) [ProtocolDetector.Detector] Detected protocol stomp from ::1:53150 (protocoldetector:127)

Tested migrating disk from one storage domain to another using glusterfs with libgfapi enabled when the vm is offline and i see that it works fine.

Comment 2 RamaKasturi 2017-09-07 09:32:59 UTC
Engine and vdsm logs can be found in the link below:
============================================================

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1484070/

Comment 3 Sahina Bose 2017-10-11 11:32:57 UTC
Sas, can you confirm if Live Storage migration fails if gfapi access is not turned on. If so, we will need to target this for RHHI 1.1. 
If not, we can move to next release

Comment 4 RamaKasturi 2017-11-29 10:39:50 UTC
Hi sahina,

     when gfapi access is not turned on, i.e with fuse access Live storage migration works as expected. Clearing need info on sas

Thanks
kasturi.

Comment 5 SATHEESARAN 2017-12-18 10:31:14 UTC
Sahina,

The dependent bug is CLOSED. We should also close this bug, if the issue is resolved

Comment 6 Sahina Bose 2017-12-18 13:50:33 UTC
Moving to ON_QA if issue has been resolved.

Comment 9 SATHEESARAN 2018-07-20 19:12:50 UTC
Sahina,

RHHI 2.0 doesn't support gfapi access mechanism.
Availability of this bug in errata will make the user believe that gfapi access mechanism related fix has been made.

Should we defer this bug from RHHI 2.0 and remove this bug from errata ?

Comment 10 Sahina Bose 2018-07-23 05:24:02 UTC
(In reply to SATHEESARAN from comment #9)
> Sahina,
> 
> RHHI 2.0 doesn't support gfapi access mechanism.
> Availability of this bug in errata will make the user believe that gfapi
> access mechanism related fix has been made.
> 
> Should we defer this bug from RHHI 2.0 and remove this bug from errata ?

Yes, we can, though the dependent bug in oVirt is closed.
I'll move to 2.0.z when we qualify with gfapi access.

Comment 14 SATHEESARAN 2020-08-01 02:53:17 UTC
(In reply to Sahina Bose from comment #10)
> (In reply to SATHEESARAN from comment #9)
> > Sahina,
> > 
> > RHHI 2.0 doesn't support gfapi access mechanism.
> > Availability of this bug in errata will make the user believe that gfapi
> > access mechanism related fix has been made.
> > 
> > Should we defer this bug from RHHI 2.0 and remove this bug from errata ?
> 
> Yes, we can, though the dependent bug in oVirt is closed.
> I'll move to 2.0.z when we qualify with gfapi access.

There is no near term goal for RHHI to make use of gfapi.
I'm closing this bug


Note You need to log in before you can comment on or make changes to this bug.