Bug 906246

Summary:

fail to do multiple disks live mirroring simultaneously in RHEVM

Product:

Red Hat Enterprise Virtualization Manager

Reporter:

Sibiao Luo <sluo>

Component:

ovirt-engine

Assignee:

Daniel Erez <derez>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Dafna Ron <dron>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

3.2.0

CC:

abaron, acathrow, amureini, bazulay, chayang, dyasny, flang, fsimonce, hateya, iheim, juzhang, lpeer, michen, mtosatti, qzhang, Rhev-m-bugs, scohen, sluo, yeylon, ykaul

Target Milestone:

---

Target Release:

3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

storage

Fixed In Version:

SF9

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-06-11 08:19:12 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

913253

Bug Blocks:

Attachments:

Description	Flags
screenshot for win2k8r2.	none
screenshot for rhel6.4-32.	none
vdsm.log.txt	none
engine.log.txt	none

Description Sibiao Luo 2013-01-31 09:57:04 UTC

Description of problem:
fail to do multiple disks live mirroring simultaneously in RHEVM 3.2 environment, the disks were in locked status and can't finish successfully after wait for one day time. 
BTW, this issue can't reproduce by qemu-kvm command line manually testing, i can do multiple disks live mirroring simultaneously by qemu-kvm command line with speed set to 10M at beginning, after steady state, can reopen to targets successully. 

Version-Release number of selected component (if applicable):
host info:
2.6.32-355.el6.x86_64
qemu-kvm-rhev-0.12.1.2-2.352.el6.x86_64
vdsm-4.10.2-4.0.el6ev.x86_64
rhevm-3.2.0-5.el6ev.noarch 
libvirt-0.10.2-16.el6.x86_64
guest info:
rhel6.4-32bit: kernel-2.6.32-355.el6.i686
windows:windows-2k8R2

How reproducible:
100%

Steps to Reproduce:
-scenario 1:
1.prepare a qcow2 format rhel6.4-32bit guest with one disk in RHEVM.
2.do live mirroring to the system and data disk simultaneously moving a running VM and its disk from one storage domain to another in RHEVM.

-scenario 2:
1.prepare a qcow2 format windows-2k8R2 guest with two disks in RHEVM.
2.just do live mirroring to two data disks simultaneously from one storage domain to another in RHEVM.

Actual results:
the mirroring disks were in locked status and i cann't monitor the mirroring related commands, the mirroring fail to finish waiting for while day.

Expected results:
can do multiple disks live mirroring simultaneously in RHEVM successfully.

Additional info:
If i do live mirroring just to one disk at one time in RHEVM, it can complete successfully, and i can monitor the mirroring related commands, like:
{"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":"existing","format":"qcow2"},"id":"libvirt-4338"}
{"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-4484"}
{"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":"libvirt-4485"}

Comment 1 Sibiao Luo 2013-01-31 10:03:22 UTC

Created attachment 690810 [details]
screenshot for win2k8r2.

Comment 2 Sibiao Luo 2013-01-31 10:10:30 UTC

Created attachment 690813 [details]
screenshot for rhel6.4-32.

Comment 3 Sibiao Luo 2013-01-31 10:29:59 UTC

(In reply to comment #0)
> {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-
> disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":
> "existing","format":"qcow2"},"id":"libvirt-4338"}
> {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},
> "id":"libvirt-4484"}
for the "block-job-complete" commands, and i check the HMP and QMP monitor commands that did not find it. I am very curious that qemu cann't provide it but rhevm tools need this command, how they works well ? Does it the RHEVM bug ? does we need to open a new bug for it ?
Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did not have it.
> {"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-
> disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":
> "libvirt-4485"}

Comment 4 Daniel Erez 2013-01-31 20:26:46 UTC

Sibiao, can you please attach engine.log and vdsm.log

Comment 5 Sibiao Luo 2013-02-01 02:45:23 UTC

Created attachment 691325 [details]
vdsm.log.txt

Comment 6 Sibiao Luo 2013-02-01 02:46:16 UTC

Created attachment 691336 [details]
engine.log.txt

Comment 7 Sibiao Luo 2013-02-01 03:00:09 UTC

(In reply to comment #3)
> (In reply to comment #0)
> > {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-
> > disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":
> > "existing","format":"qcow2"},"id":"libvirt-4338"}
> > {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},
> > "id":"libvirt-4484"}
> for the "block-job-complete" commands, and i check the HMP and QMP monitor
> commands that did not find it. I am very curious that qemu cann't provide it
> but rhevm tools need this command, how they works well ? Does it the RHEVM
> bug ? does we need to open a new bug for it ?
> Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did
> not have it.
I charted with kwolf for this problem in IRC, and he throught this maybe the RHEVM bug. So I will separate this issue to a new bug. Please correct me if any mistake.

Best Regards.
sluo

Comment 8 Daniel Erez 2013-02-13 16:55:12 UTC

Verficiation depends on bug 906620.
Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb

Comment 9 Federico Simoncelli 2013-02-18 20:29:30 UTC

(In reply to comment #8)
> Verficiation depends on bug 906620.
> Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb

From the logs it looks to me that the engine is not able to communicate with the HSM where the VM is running (Connection refused):

2013-01-30 18:16:38,502 INFO  [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-45) [4127afd8] Running command: LiveMigrateDiskCommandTask handler: VmReplicateDiskStartTaskHandler internal: false. Entities affected :  ID: 4ef3f12f-f6f2-4573-9022-41d5940cb02f Type: Disk,  ID: 4dae5421-9c9b-499e-a91a-9da8f6830c8c Type: Storage
2013-01-30 18:16:38,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) START, SnapshotVDSCommand(HostName = dhcp-4-121, HostId = 90a18316-234c-41c8-a20c-76369b4cb49f, vmId=6158948a-c1ee-4a29-ab31-2d24f4575d75), log id: 48a6a6dc
2013-01-30 18:16:38,504 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-3-thread-45) [4127afd8] START, VmReplicateDiskStartVDSCommand(HostName = dchp-6-222, HostId = 4c43c53e-aa12-49d9-9691-7bd2704f61c0, vmId=4aec63be-5616-4d21-b29d-8747434f8992), log id: 6c275a0a
2013-01-30 18:16:38,506 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) FINISH, SnapshotVDSCommand, log id: 48a6a6dc
2013-01-30 18:16:38,506 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back
2013-01-30 18:16:38,508 INFO  [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back
2013-01-30 18:16:38,523 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-47) EntityAsyncTask::EndCommandAction [within thread]: EndAction for action type LiveMigrateVmDisks threw an exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused

And this happens 4 times.
Therefore you don't see any __com.redhat_drive-mirror/block-job-complete on the host. I don't think this is related at all to VDSM. Bug 906620 can be probably closed. At the moment of this writing I don't see any reason to block the verification, I'm removing the dependency.

I'm not sure if the solution proposed in I899e55c995a96f68023e2ad7b31daac57d1e8dbb is exactly addressing this issue. http://gerrit.ovirt.org/#/c/11311/
To me it looks unsafe to move ImageStatus.LOCKED from LiveSnapshotTaskHandler to CreateImagePlaceholderTaskHandler. In fact the first step that should succeed (and that requires the lock) is the live snapshot. I would have expected a fix that unlocks the image if the live snapshot fails (rather than removing the locking).

Anyway I'm not expert on this part and I'll leave you to decide if you want to proceed with the current fix.

Comment 10 Dafna Ron 2013-03-18 20:02:28 UTC

verified on sf10

Comment 11 Itamar Heim 2013-06-11 08:19:12 UTC

3.2 has been released

Comment 12 Itamar Heim 2013-06-11 08:23:13 UTC

3.2 has been released