Description of problem: fail to do multiple disks live mirroring simultaneously in RHEVM 3.2 environment, the disks were in locked status and can't finish successfully after wait for one day time. BTW, this issue can't reproduce by qemu-kvm command line manually testing, i can do multiple disks live mirroring simultaneously by qemu-kvm command line with speed set to 10M at beginning, after steady state, can reopen to targets successully. Version-Release number of selected component (if applicable): host info: 2.6.32-355.el6.x86_64 qemu-kvm-rhev-0.12.1.2-2.352.el6.x86_64 vdsm-4.10.2-4.0.el6ev.x86_64 rhevm-3.2.0-5.el6ev.noarch libvirt-0.10.2-16.el6.x86_64 guest info: rhel6.4-32bit: kernel-2.6.32-355.el6.i686 windows:windows-2k8R2 How reproducible: 100% Steps to Reproduce: -scenario 1: 1.prepare a qcow2 format rhel6.4-32bit guest with one disk in RHEVM. 2.do live mirroring to the system and data disk simultaneously moving a running VM and its disk from one storage domain to another in RHEVM. -scenario 2: 1.prepare a qcow2 format windows-2k8R2 guest with two disks in RHEVM. 2.just do live mirroring to two data disks simultaneously from one storage domain to another in RHEVM. Actual results: the mirroring disks were in locked status and i cann't monitor the mirroring related commands, the mirroring fail to finish waiting for while day. Expected results: can do multiple disks live mirroring simultaneously in RHEVM successfully. Additional info: If i do live mirroring just to one disk at one time in RHEVM, it can complete successfully, and i can monitor the mirroring related commands, like: {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio-disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode":"existing","format":"qcow2"},"id":"libvirt-4338"} {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"},"id":"libvirt-4484"} {"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio-disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id":"libvirt-4485"}
Created attachment 690810 [details] screenshot for win2k8r2.
Created attachment 690813 [details] screenshot for rhel6.4-32.
(In reply to comment #0) > {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio- > disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode": > "existing","format":"qcow2"},"id":"libvirt-4338"} > {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"}, > "id":"libvirt-4484"} for the "block-job-complete" commands, and i check the HMP and QMP monitor commands that did not find it. I am very curious that qemu cann't provide it but rhevm tools need this command, how they works well ? Does it the RHEVM bug ? does we need to open a new bug for it ? Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did not have it. > {"execute":"__com.redhat_drive-reopen","arguments":{"device":"drive-virtio- > disk0","new-image-file":"/rhev/data-center/xxxx","format":"qcow2"},"id": > "libvirt-4485"}
Sibiao, can you please attach engine.log and vdsm.log
Created attachment 691325 [details] vdsm.log.txt
Created attachment 691336 [details] engine.log.txt
(In reply to comment #3) > (In reply to comment #0) > > {"execute":"__com.redhat_drive-mirror","arguments":{"device":"drive-virtio- > > disk0","target":"/rhev/data-center/xxxxx","speed":0,"full":false,"mode": > > "existing","format":"qcow2"},"id":"libvirt-4338"} > > {"execute":"block-job-complete","arguments":{"device":"drive-virtio-disk0"}, > > "id":"libvirt-4484"} > for the "block-job-complete" commands, and i check the HMP and QMP monitor > commands that did not find it. I am very curious that qemu cann't provide it > but rhevm tools need this command, how they works well ? Does it the RHEVM > bug ? does we need to open a new bug for it ? > Btw, the "block-job-complete" only existing in rhel7 qemu, the rhel6.4 did > not have it. I charted with kwolf for this problem in IRC, and he throught this maybe the RHEVM bug. So I will separate this issue to a new bug. Please correct me if any mistake. Best Regards. sluo
Verficiation depends on bug 906620. Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb
(In reply to comment #8) > Verficiation depends on bug 906620. > Backend side is fixed by change-id I899e55c995a96f68023e2ad7b31daac57d1e8dbb From the logs it looks to me that the engine is not able to communicate with the HSM where the VM is running (Connection refused): 2013-01-30 18:16:38,502 INFO [org.ovirt.engine.core.bll.lsm.LiveMigrateDiskCommand] (pool-3-thread-45) [4127afd8] Running command: LiveMigrateDiskCommandTask handler: VmReplicateDiskStartTaskHandler internal: false. Entities affected : ID: 4ef3f12f-f6f2-4573-9022-41d5940cb02f Type: Disk, ID: 4dae5421-9c9b-499e-a91a-9da8f6830c8c Type: Storage 2013-01-30 18:16:38,504 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) START, SnapshotVDSCommand(HostName = dhcp-4-121, HostId = 90a18316-234c-41c8-a20c-76369b4cb49f, vmId=6158948a-c1ee-4a29-ab31-2d24f4575d75), log id: 48a6a6dc 2013-01-30 18:16:38,504 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.VmReplicateDiskStartVDSCommand] (pool-3-thread-45) [4127afd8] START, VmReplicateDiskStartVDSCommand(HostName = dchp-6-222, HostId = 4c43c53e-aa12-49d9-9691-7bd2704f61c0, vmId=4aec63be-5616-4d21-b29d-8747434f8992), log id: 6c275a0a 2013-01-30 18:16:38,506 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SnapshotVDSCommand] (pool-3-thread-47) FINISH, SnapshotVDSCommand, log id: 48a6a6dc 2013-01-30 18:16:38,506 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back 2013-01-30 18:16:38,508 INFO [org.ovirt.engine.core.utils.transaction.TransactionSupport] (pool-3-thread-47) transaction rolled back 2013-01-30 18:16:38,523 ERROR [org.ovirt.engine.core.bll.EntityAsyncTask] (pool-3-thread-47) EntityAsyncTask::EndCommandAction [within thread]: EndAction for action type LiveMigrateVmDisks threw an exception: org.ovirt.engine.core.common.errors.VdcBLLException: VdcBLLException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: java.net.ConnectException: Connection refused And this happens 4 times. Therefore you don't see any __com.redhat_drive-mirror/block-job-complete on the host. I don't think this is related at all to VDSM. Bug 906620 can be probably closed. At the moment of this writing I don't see any reason to block the verification, I'm removing the dependency. I'm not sure if the solution proposed in I899e55c995a96f68023e2ad7b31daac57d1e8dbb is exactly addressing this issue. http://gerrit.ovirt.org/#/c/11311/ To me it looks unsafe to move ImageStatus.LOCKED from LiveSnapshotTaskHandler to CreateImagePlaceholderTaskHandler. In fact the first step that should succeed (and that requires the lock) is the live snapshot. I would have expected a fix that unlocks the image if the live snapshot fails (rather than removing the locking). Anyway I'm not expert on this part and I'll leave you to decide if you want to proceed with the current fix.
verified on sf10
3.2 has been released