Bug 906246 - fail to do multiple disks live mirroring simultaneously in RHEVM
Summary: fail to do multiple disks live mirroring simultaneously in RHEVM
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.2.0
Assignee: Daniel Erez
QA Contact: Dafna Ron
URL:
Whiteboard: storage
Depends On: 913253
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-31 09:57 UTC by Sibiao Luo
Modified: 2016-02-10 17:25 UTC (History)
20 users (show)

Fixed In Version: SF9
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-06-11 08:19:12 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screenshot for win2k8r2. (33.40 KB, image/png)
2013-01-31 10:03 UTC, Sibiao Luo
no flags Details
screenshot for rhel6.4-32. (28.50 KB, image/png)
2013-01-31 10:10 UTC, Sibiao Luo
no flags Details
vdsm.log.txt (48.05 MB, text/plain)
2013-02-01 02:45 UTC, Sibiao Luo
no flags Details
engine.log.txt (50.01 MB, text/plain)
2013-02-01 02:46 UTC, Sibiao Luo
no flags Details

Description Sibiao Luo 2013-01-31 09:57:04 UTC
Description of problem:
fail to do multiple disks live mirroring simultaneously in RHEVM 3.2 environment, the disks were in locked status and can't finish successfully after wait for one day time. 
BTW, this issue can't reproduce by qemu-kvm command line manually testing, i can do multiple disks live mirroring simultaneously by qemu-kvm command line with speed set to 10M at beginning, after steady state, can reopen to targets successully. 

Version-Release number of selected component (if applicable):
host info:
2.6.32-355.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.352.el6.x86_64
vdsm-4.10.2-4.0.el6ev.x86_64
rhevm-3.2.0-5.el6ev.noarch 
libvirt-0.10.2-16.el6.x86_64
guest info:
rhel6.4-32bit: kernel-2.6.32-355.el6.i686
windows:windows-2k8R2

How reproducible:
100%

Steps to Reproduce:
-scenario 1:
1.prepare a qcow2 format rhel6.4-32bit guest with one disk in RHEVM.
2.do live mirroring to the system and data disk simultaneously moving a running VM and its disk from one storage domain to another in RHEVM.

-scenario 2:
1.prepare a qcow2 format windows-2k8R2 guest with two disks in RHEVM.
2.just do live mirroring to two data disks simultaneously from one storage domain to another in RHEVM.

Actual results:
the mirroring disks were in locked status and i cann't monitor the mirroring related commands, the mirroring fail to finish waiting for while day.

Expected results:
can do multiple disks live mirroring simultaneously in RHEVM successfully.

Additional info:
If i do live mirroring just to one disk at one time in RHEVM, it can complete successfully, and i can monitor the mirroring related commands, like:
{"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":"existing","format":"qcow2"},"id":"libvirt-4338"}
{"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-4484"}
{"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":"libvirt-4485"}

Comment 1 Sibiao Luo 2013-01-31 10:03:22 UTC
Created attachment 690810 [details]
screenshot for win2k8r2.

Comment 2 Sibiao Luo 2013-01-31 10:10:30 UTC
Created attachment 690813 [details]
screenshot for rhel6.4-32.

Comment 3 Sibiao Luo 2013-01-31 10:29:59 UTC
(In reply to comment #0)
> {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-
> disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":
> "existing","format":"qcow2"},"id":"libvirt-4338"}
> {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},
> "id":"libvirt-4484"}
for the "block-job-complete" commands, and i check the HMP and QMP monitor commands that did not find it. I am very curious that qemu cann't provide it but rhevm tools need this command, how they works well ? Does it the RHEVM bug ? does we need to open a new bug for it ?
Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did not have it.
> {"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-
> disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":
> "libvirt-4485"}

Comment 4 Daniel Erez 2013-01-31 20:26:46 UTC
Sibiao, can you please attach engine.log and vdsm.log

Comment 5 Sibiao Luo 2013-02-01 02:45:23 UTC
Created attachment 691325 [details]
vdsm.log.txt

Comment 6 Sibiao Luo 2013-02-01 02:46:16 UTC
Created attachment 691336 [details]
engine.log.txt

Comment 7 Sibiao Luo 2013-02-01 03:00:09 UTC
(In reply to comment #3)
> (In reply to comment #0)
> > {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-
> > disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":
> > "existing","format":"qcow2"},"id":"libvirt-4338"}
> > {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},
> > "id":"libvirt-4484"}
> for the "block-job-complete" commands, and i check the HMP and QMP monitor
> commands that did not find it. I am very curious that qemu cann't provide it
> but rhevm tools need this command, how they works well ? Does it the RHEVM
> bug ? does we need to open a new bug for it ?
> Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did
> not have it.
I charted with kwolf for this problem in IRC, and he throught this maybe the RHEVM bug. So I will separate this issue to a new bug. Please correct me if any mistake.

Best Regards.
sluo

Comment 8 Daniel Erez 2013-02-13 16:55:12 UTC
Verficiation depends on bug 906620.
Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb

Comment 9 Federico Simoncelli 2013-02-18 20:29:30 UTC
(In reply to comment #8)
> Verficiation depends on bug 906620.
> Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb

From the logs it looks to me that the engine is not able to communicate with the HSM where the VM is running (Connection refused):

2013-01-30 18:16:38,502 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-45) [4127afd8] Running command: LiveMigrateDiskCommandTask handler: VmReplicateDiskStartTaskHandler internal: false. Entities affected :  ID: 4ef3f12f-f6f2-4573-9022-41d5940cb02f Type: Disk,  ID: 4dae5421-9c9b-499e-a91a-9da8f6830c8c Type: Storage
2013-01-30 18:16:38,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) START, SnapshotVDSCommand(HostName = dhcp-4-121, HostId = 90a18316-234c-41c8-a20c-76369b4cb49f, vmId=6158948a-c1ee-4a29-ab31-2d24f4575d75), log id: 48a6a6dc
2013-01-30 18:16:38,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-3-thread-45) [4127afd8] START, VmReplicateDiskStartVDSCommand(HostName = dchp-6-222, HostId = 4c43c53e-aa12-49d9-9691-7bd2704f61c0, vmId=4aec63be-5616-4d21-b29d-8747434f8992), log id: 6c275a0a
2013-01-30 18:16:38,506 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) FINISH, SnapshotVDSCommand, log id: 48a6a6dc
2013-01-30 18:16:38,506 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back
2013-01-30 18:16:38,508 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back
2013-01-30 18:16:38,523 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-47) EntityAsyncTask::EndCommandAction [within thread]: EndAction for action type LiveMigrateVmDisks threw an exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused

And this happens 4 times.
Therefore you don't see any __com.redhat_drive-mirror/block-job-complete on the host. I don't think this is related at all to VDSM. Bug 906620 can be probably closed. At the moment of this writing I don't see any reason to block the verification, I'm removing the dependency.

I'm not sure if the solution proposed in I899e55c995a96f68023e2ad7b31daac57d1e8dbb is exactly addressing this issue. http://gerrit.ovirt.org/#/c/11311/
To me it looks unsafe to move ImageStatus.LOCKED from LiveSnapshotTaskHandler to CreateImagePlaceholderTaskHandler. In fact the first step that should succeed (and that requires the lock) is the live snapshot. I would have expected a fix that unlocks the image if the live snapshot fails (rather than removing the locking).

Anyway I'm not expert on this part and I'll leave you to decide if you want to proceed with the current fix.

Comment 10 Dafna Ron 2013-03-18 20:02:28 UTC
verified on sf10

Comment 11 Itamar Heim 2013-06-11 08:19:12 UTC
3.2 has been released

Comment 12 Itamar Heim 2013-06-11 08:23:13 UTC
3.2 has been released


Note You need to log in before you can comment on or make changes to this bug.