Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1458548

Summary: [vdsm] Live storage migration fails on "libvirtError: Requested operation is not valid: domain is not transient" during diskReplicateStart
Product: [oVirt] vdsm Reporter: Elad <ebenahar>
Component: CoreAssignee: Milan Zamazal <mzamazal>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.20.0CC: amureini, bugs, bzlotnik, eshenitz, fromani, michal.skrivanek, mzamazal, tnisan
Target Milestone: ovirt-4.2.0Keywords: Automation, Regression
Target Release: ---Flags: rule-engine: ovirt-4.2+
rule-engine: blocker+
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-12-20 10:49:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1459113, 1459183    
Bug Blocks:    
Attachments:
Description Flags
logs from engine and vdsm none

Description Elad 2017-06-04 09:09:34 UTC
Created attachment 1284721 [details]
logs from engine and vdsm

Description of problem:
Live storage migration fails during diskReplicateStart on vdsm with the following error:

2017-06-03 02:44:07,578+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='47181715-61ef-4b08-b00d-a75085a6a4ca') Unable to start replication for vdc to {u'domainID': u'2e5c20f3-7797-4ad3-bc12-8c6b4be
2225a', 'volumeInfo': {'domainID': u'2e5c20f3-7797-4ad3-bc12-8c6b4be2225a', 'volType': 'path', 'leaseOffset': 119537664, 'path': u'/rhev/data-center/mnt/blockSD/2e5c20f3-7797-4ad3-bc12-8c6b
4be2225a/images/9cdedc5b-f04d-4e38-a9dd-6e6a66d76316/205c5196-e8f7-45d6-8734-93522232799e', 'volumeID': '205c5196-e8f7-45d6-8734-93522232799e', 'leasePath': '/dev/2e5c20f3-7797-4ad3-bc12-8c
6b4be2225a/leases', 'imageID': u'9cdedc5b-f04d-4e38-a9dd-6e6a66d76316'}, 'diskType': 'block', 'format': 'cow', 'cache': 'none', u'volumeID': u'205c5196-e8f7-45d6-8734-93522232799e', u'image
ID': u'9cdedc5b-f04d-4e38-a9dd-6e6a66d76316', u'poolID': u'587973cc-da01-4950-866e-2f03fe9d71e8', u'device': 'disk', 'path': u'/rhev/data-center/587973cc-da01-4950-866e-2f03fe9d71e8/2e5c20f
3-7797-4ad3-bc12-8c6b4be2225a/images/9cdedc5b-f04d-4e38-a9dd-6e6a66d76316/205c5196-e8f7-45d6-8734-93522232799e', 'propagateErrors': u'off', 'volumeChain': [{'domainID': u'2e5c20f3-7797-4ad3
-bc12-8c6b4be2225a', 'volType': 'path', 'leaseOffset': 119537664, 'path': u'/rhev/data-center/mnt/blockSD/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/images/9cdedc5b-f04d-4e38-a9dd-6e6a66d76316/20
5c5196-e8f7-45d6-8734-93522232799e', 'volumeID': '205c5196-e8f7-45d6-8734-93522232799e', 'leasePath': '/dev/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/leases', 'imageID': u'9cdedc5b-f04d-4e38-a9d
d-6e6a66d76316'}, {'domainID': u'2e5c20f3-7797-4ad3-bc12-8c6b4be2225a', 'volType': 'path', 'leaseOffset': 117440512, 'path': u'/rhev/data-center/mnt/blockSD/2e5c20f3-7797-4ad3-bc12-8c6b4be2
225a/images/9cdedc5b-f04d-4e38-a9dd-6e6a66d76316/834c71f6-88b0-4f37-9cdc-6c5d6b9595e5', 'volumeID': '834c71f6-88b0-4f37-9cdc-6c5d6b9595e5', 'leasePath': '/dev/2e5c20f3-7797-4ad3-bc12-8c6b4b
e2225a/leases', 'imageID': u'9cdedc5b-f04d-4e38-a9dd-6e6a66d76316'}, {'domainID': u'2e5c20f3-7797-4ad3-bc12-8c6b4be2225a', 'volType': 'path', 'leaseOffset': 116391936, 'path': u'/rhev/data-
center/mnt/blockSD/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/images/9cdedc5b-f04d-4e38-a9dd-6e6a66d76316/217b3b27-a708-4716-bf1f-0df43ce9c409', 'volumeID': '217b3b27-a708-4716-bf1f-0df43ce9c409'
, 'leasePath': '/dev/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/leases', 'imageID': u'9cdedc5b-f04d-4e38-a9dd-6e6a66d76316'}, {'domainID': u'2e5c20f3-7797-4ad3-bc12-8c6b4be2225a', 'volType': 'pat
h', 'leaseOffset': 115343360, 'path': u'/rhev/data-center/mnt/blockSD/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/images/9cdedc5b-f04d-4e38-a9dd-6e6a66d76316/cd64932d-87b1-4b55-a3b5-93498221c5df',
 'volumeID': 'cd64932d-87b1-4b55-a3b5-93498221c5df', 'leasePath': '/dev/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/leases', 'imageID': u'9cdedc5b-f04d-4e38-a9dd-6e6a66d76316'}, {'domainID': u'2e5
c20f3-7797-4ad3-bc12-8c6b4be2225a', 'volType': 'path', 'leaseOffset': 118489088, 'path': u'/rhev/data-center/mnt/blockSD/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/images/9cdedc5b-f04d-4e38-a9dd-
6e6a66d76316/6dd6ae67-39fe-466a-9960-75ce0146a299', 'volumeID': '6dd6ae67-39fe-466a-9960-75ce0146a299', 'leasePath': '/dev/2e5c20f3-7797-4ad3-bc12-8c6b4be2225a/leases', 'imageID': u'9cdedc5
b-f04d-4e38-a9dd-6e6a66d76316'}]} (vm:3985)
Traceback (most recent call last):
  File "/usr/share/vdsm/virt/vm.py", line 3979, in diskReplicateStart
    self._startDriveReplication(drive)
  File "/usr/share/vdsm/virt/vm.py", line 4104, in _startDriveReplication
    self._dom.blockCopy(drive.name, destxml, flags=flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 86, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 751, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 684, in blockCopy
    if ret == -1: raise libvirtError ('virDomainBlockCopy() failed', dom=self)
libvirtError: Requested operation is not valid: domain is not transient


Version-Release number of selected component (if applicable):
vdsm-4.20.0-959.gitd0c51cd.el7.x86_64
libvirt-daemon-2.0.0-10.el7_3.9.x86_64
ovirt-engine-4.2.0-0.0.master.20170531203202.git1bf6667.el7.centos.noarch


How reproducible:
Discovered in automation test:
https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-6057 
Reproduced on iSCSI, NFS and GlusterFS


Steps to Reproduce:
From https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-6057 
1. Create a VM with 4 disks and OS installed 
2. Write file 1 to all disks
3. Create a snapshot of the VM with all disks
4. Write file 2 to all disks
5. Create a snapshot 2 of the VM with all disks
6. Write file 3 to all disks
7. Create snapshot 3 fo the VM with all disks
8. Write file 4 to all disks	
9. Perform Disk Migration of the from the current SD to another SD
10. On completion of the Disk Migration, delete snapshot 2

Actual results:
Live storage migration fails with the mentioned error in vdsm

Expected results:
Live storage migration should succeed

Additional info:
logs from engine and vdsm
* Can't upload libvirtd.log since it's too big, please contact me for it
Engine.log:

_SNAPSHOT_FINISHED_SUCCESS(356), Snapshot 'Auto-generated for Live Storage Migration' deletion for VM 'vm_TestCase6057_REST_ISCSI_0302325155' has been completed.
2017-06-03 02:45:08,614+03 INFO  [org.ovirt.engine.core.bll.SerialChildCommandsExecutionCallback] (DefaultQuartzScheduler2) [disks_syncAction_ca5be808-3e18-4714] Command 'LiveMigrateVmDisks' id: '965ed87a-85ac-42fd-9ded-dc196f0b4bf0' child commands '[e854f725-ab6c-44b8-8211-e8d5219eceaa, 949a5625-1709-450c-a51b-8952b3f72786, 6cbd8533-79ae-4301-ac14-73a71947d2bd]' executions were completed, status 'FAILED'
2017-06-03 02:45:09,663+03 ERROR [org.ovirt.engine.core.bll.storage.lsm.LiveMigrateVmDisksCommand] (DefaultQuartzScheduler7) [disks_syncAction_ca5be808-3e18-4714] Ending command 'org.ovirt.engine.core.bll.storage.lsm.LiveMigrateVmDisksCommand' with failure.

Comment 1 Allon Mureinik 2017-06-04 14:42:21 UTC
Milan/Francecso - the very recent change 15808ddf9035287575a59d9aa879598a33a2b6bc changed our domains to be persistent, which I'm guessing is the root cause here.

Does this make any sense to you?

Comment 2 Francesco Romani 2017-06-05 06:57:32 UTC
Well, this is very surprising. Looks like we actually hit one QEMU issue.
libvirt actively and explicitely prevents virDomainBlockCopy with persistent domains. Totally unexpected.

in src/qemu/qemu_driver.c we have

commit c1eb38053d616d764c0c5381301b4cd5d2c45921
Author: Eric Blake <eblake>
Date:   Fri Oct 19 17:46:08 2012 -0600

    if (vm->persistent) {
        /* XXX if qemu ever lets us start a new domain with mirroring
         * already active, we can relax this; but for now, the risk of
         * 'managedsave' due to libvirt-guests means we can't risk
         * this on persistent domains.  */
        virReportError(VIR_ERR_OPERATION_INVALID, "%s",
                       _("domain is not transient"));
        goto cleanup;
    }

now we need to see if this is still relevant after 5 years.
We'll need at very least a libvirt bug to depend on.

Comment 3 Allon Mureinik 2017-06-05 11:36:45 UTC
Assigning to Milan in the mean time while he works with libvirt devs to see if this limitation can/should be lifted.

Comment 4 Benny Zlotnik 2017-06-06 10:05:14 UTC
Note: this bug only occurs when the volume format from which the LSM auto-generated snapshot is taken is COW

Comment 5 Milan Zamazal 2017-06-06 10:49:21 UTC
I got information about the problem from libvirt developers. libvirt could probably add a small feature to permit running block-copy operations also on persistent domains, with the same limitations as for transient domains, see https://bugzilla.redhat.com/1459113.

Comment 6 Ala Hino 2017-06-25 12:44:18 UTC
*** Bug 1461468 has been marked as a duplicate of this bug. ***

Comment 7 Elad 2017-07-19 14:56:22 UTC
Tested the scenario described in the bug description (https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/workitem?id=RHEVM3-6057)

It passed:


2017-07-19 17:28:11,004 - MainThread - art.ll_lib.jobs - INFO - JOB 'Migrating Disk disk_virtiocow_1916450315 from iscsi_1 to iscsi_2' TOOK 79.777 seconds
2017-07-19 17:28:11,004 - MainThread - art.ll_lib.jobs - INFO - All jobs are gone

2017-07-19 17:41:16,309 - MainThread - art.logging - INFO - Status: passed


Tested using:
ovirt-engine-4.2.0-0.0.master.20170717104433.gita1ba045.el7.centos.noarch
vdsm-4.20.1-202.git9f953f3.el7.centos.x86_64
libvirt-daemon-3.2.0-14.el7.x86_64

Comment 8 Sandro Bonazzola 2017-12-20 10:49:48 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.