Bug 906246

Summary: fail to do multiple disks live mirroring simultaneously in RHEVM
Product: Red Hat Enterprise Virtualization Manager Reporter: Sibiao Luo <sluo>
Component: ovirt-engineAssignee: Daniel Erez <derez>
Status: CLOSED CURRENTRELEASE QA Contact: Dafna Ron <dron>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.2.0CC: abaron, acathrow, amureini, bazulay, chayang, dyasny, flang, fsimonce, hateya, iheim, juzhang, lpeer, michen, mtosatti, qzhang, Rhev-m-bugs, scohen, sluo, yeylon, ykaul
Target Milestone: ---   
Target Release: 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: storage
Fixed In Version: SF9 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-06-11 08:19:12 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 913253    
Bug Blocks:    
Attachments:
Description Flags
screenshot for win2k8r2.
none
screenshot for rhel6.4-32.
none
vdsm.log.txt
none
engine.log.txt none

Description Sibiao Luo 2013-01-31 09:57:04 UTC
Description of problem:
fail to do multiple disks live mirroring simultaneously in RHEVM 3.2 environment, the disks were in locked status and can't finish successfully after wait for one day time. 
BTW, this issue can't reproduce by qemu-kvm command line manually testing, i can do multiple disks live mirroring simultaneously by qemu-kvm command line with speed set to 10M at beginning, after steady state, can reopen to targets successully. 

Version-Release number of selected component (if applicable):
host info:
2.6.32-355.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.352.el6.x86_64
vdsm-4.10.2-4.0.el6ev.x86_64
rhevm-3.2.0-5.el6ev.noarch 
libvirt-0.10.2-16.el6.x86_64
guest info:
rhel6.4-32bit: kernel-2.6.32-355.el6.i686
windows:windows-2k8R2

How reproducible:
100%

Steps to Reproduce:
-scenario 1:
1.prepare a qcow2 format rhel6.4-32bit guest with one disk in RHEVM.
2.do live mirroring to the system and data disk simultaneously moving a running VM and its disk from one storage domain to another in RHEVM.

-scenario 2:
1.prepare a qcow2 format windows-2k8R2 guest with two disks in RHEVM.
2.just do live mirroring to two data disks simultaneously from one storage domain to another in RHEVM.

Actual results:
the mirroring disks were in locked status and i cann't monitor the mirroring related commands, the mirroring fail to finish waiting for while day.

Expected results:
can do multiple disks live mirroring simultaneously in RHEVM successfully.

Additional info:
If i do live mirroring just to one disk at one time in RHEVM, it can complete successfully, and i can monitor the mirroring related commands, like:
{"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":"existing","format":"qcow2"},"id":"libvirt-4338"}
{"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-4484"}
{"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":"libvirt-4485"}

Comment 1 Sibiao Luo 2013-01-31 10:03:22 UTC
Created attachment 690810 [details]
screenshot for win2k8r2.

Comment 2 Sibiao Luo 2013-01-31 10:10:30 UTC
Created attachment 690813 [details]
screenshot for rhel6.4-32.

Comment 3 Sibiao Luo 2013-01-31 10:29:59 UTC
(In reply to comment #0)
> {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-
> disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":
> "existing","format":"qcow2"},"id":"libvirt-4338"}
> {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},
> "id":"libvirt-4484"}
for the "block-job-complete" commands, and i check the HMP and QMP monitor commands that did not find it. I am very curious that qemu cann't provide it but rhevm tools need this command, how they works well ? Does it the RHEVM bug ? does we need to open a new bug for it ?
Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did not have it.
> {"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-
> disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":
> "libvirt-4485"}

Comment 4 Daniel Erez 2013-01-31 20:26:46 UTC
Sibiao, can you please attach engine.log and vdsm.log

Comment 5 Sibiao Luo 2013-02-01 02:45:23 UTC
Created attachment 691325 [details]
vdsm.log.txt

Comment 6 Sibiao Luo 2013-02-01 02:46:16 UTC
Created attachment 691336 [details]
engine.log.txt

Comment 7 Sibiao Luo 2013-02-01 03:00:09 UTC
(In reply to comment #3)
> (In reply to comment #0)
> > {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-
> > disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":
> > "existing","format":"qcow2"},"id":"libvirt-4338"}
> > {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},
> > "id":"libvirt-4484"}
> for the "block-job-complete" commands, and i check the HMP and QMP monitor
> commands that did not find it. I am very curious that qemu cann't provide it
> but rhevm tools need this command, how they works well ? Does it the RHEVM
> bug ? does we need to open a new bug for it ?
> Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did
> not have it.
I charted with kwolf for this problem in IRC, and he throught this maybe the RHEVM bug. So I will separate this issue to a new bug. Please correct me if any mistake.

Best Regards.
sluo

Comment 8 Daniel Erez 2013-02-13 16:55:12 UTC
Verficiation depends on bug 906620.
Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb

Comment 9 Federico Simoncelli 2013-02-18 20:29:30 UTC
(In reply to comment #8)
> Verficiation depends on bug 906620.
> Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb

From the logs it looks to me that the engine is not able to communicate with the HSM where the VM is running (Connection refused):

2013-01-30 18:16:38,502 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-45) [4127afd8] Running command: LiveMigrateDiskCommandTask handler: VmReplicateDiskStartTaskHandler internal: false. Entities affected :  ID: 4ef3f12f-f6f2-4573-9022-41d5940cb02f Type: Disk,  ID: 4dae5421-9c9b-499e-a91a-9da8f6830c8c Type: Storage
2013-01-30 18:16:38,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) START, SnapshotVDSCommand(HostName = dhcp-4-121, HostId = 90a18316-234c-41c8-a20c-76369b4cb49f, vmId=6158948a-c1ee-4a29-ab31-2d24f4575d75), log id: 48a6a6dc
2013-01-30 18:16:38,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-3-thread-45) [4127afd8] START, VmReplicateDiskStartVDSCommand(HostName = dchp-6-222, HostId = 4c43c53e-aa12-49d9-9691-7bd2704f61c0, vmId=4aec63be-5616-4d21-b29d-8747434f8992), log id: 6c275a0a
2013-01-30 18:16:38,506 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) FINISH, SnapshotVDSCommand, log id: 48a6a6dc
2013-01-30 18:16:38,506 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back
2013-01-30 18:16:38,508 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back
2013-01-30 18:16:38,523 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-47) EntityAsyncTask::EndCommandAction [within thread]: EndAction for action type LiveMigrateVmDisks threw an exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused

And this happens 4 times.
Therefore you don't see any __com.redhat_drive-mirror/block-job-complete on the host. I don't think this is related at all to VDSM. Bug 906620 can be probably closed. At the moment of this writing I don't see any reason to block the verification, I'm removing the dependency.

I'm not sure if the solution proposed in I899e55c995a96f68023e2ad7b31daac57d1e8dbb is exactly addressing this issue. http://gerrit.ovirt.org/#/c/11311/
To me it looks unsafe to move ImageStatus.LOCKED from LiveSnapshotTaskHandler to CreateImagePlaceholderTaskHandler. In fact the first step that should succeed (and that requires the lock) is the live snapshot. I would have expected a fix that unlocks the image if the live snapshot fails (rather than removing the locking).

Anyway I'm not expert on this part and I'll leave you to decide if you want to proceed with the current fix.

Comment 10 Dafna Ron 2013-03-18 20:02:28 UTC
verified on sf10

Comment 11 Itamar Heim 2013-06-11 08:19:12 UTC
3.2 has been released

Comment 12 Itamar Heim 2013-06-11 08:23:13 UTC
3.2 has been released