Description of problem: A live merge of a disk snapshot is initiated on a running VM. The task aborts but engine sees the job still running. Version-Release number of selected component (if applicable): Ovirt 3.6.7 VDSM 4.17.28 How reproducible: Don't know Steps to Reproduce: 1. Start Snapshot live merge on running VM Actual results: Job fails. Tasks stays active Thread-15824378::ERROR::2016-09-15 11:17:36,599::utils::739::root::(wrapper) Unhandled exception Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 736, in wrapper return f(*a, **kw) File "/usr/share/vdsm/virt/vm.py", line 5266, in run self.tryPivot() File "/usr/share/vdsm/virt/vm.py", line 5235, in tryPivot ret = self.vm._dom.blockJobAbort(self.drive.name, flags) File "/usr/share/vdsm/virt/virdomain.py", line 68, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 124, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 1313, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 733, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirtError: Block Kopieren ist immer noch aktiv: Disk 'vda' noch nicht bereit für Pivot Expected results: Snapshot should finish successfully. If the error is correct engine should correctly determine the error state and cancel the running task. Additional info: Log files and screenshots attached
Created attachment 1201372 [details] Engine Log
Created attachment 1201373 [details] VDSM Log
Created attachment 1201374 [details] Task State
Yesterday the error occured once again. This time a VM with 6 disk with a disk snapshot that contained 4 disks. Result: task is running but vdsm shows abort traceback. Very interesting for us, after the livemerge init qemu started some kind of high disk IO for more than 6 hours. So we assume, that something was doing merge operations but we do not know was really going on. See screenshot attached.
Created attachment 1210227 [details] io during failing live merge
VDSM log of second error attached
Created attachment 1210247 [details] vdsm log 2nd error
I count with customer the same File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 372, in wrapper return f(*a, **kw) File "/usr/share/vdsm/virt/vm.py", line 4924, in run self.tryPivot() File "/usr/share/vdsm/virt/vm.py", line 4893, in tryPivot ret = self.vm._dom.blockJobAbort(self.drive.name, flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 69, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 123, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/utils.py", line 917, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 733, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) ckJobAbort libvirtError: block copy still active: disk 'vda' not ready for pivot yet VM with high disk I/O (hosting Mysql,Nagios) is not able to live migrate disks, job fails.
*** Bug 1406851 has been marked as a duplicate of this bug. ***
4.0.6 has been the last oVirt 4.0 release, please re-target this bug.
*** Bug 1419767 has been marked as a duplicate of this bug. ***
Ala, the attached patch is merged, but not backported to any stable branch. Is this intentional? What is the next step here?
Actually, this is intentional - this area of the code isn't tested and any change is considered risky. This patch handles the (expected) exception that is thrown by not logging it. The next step here is to handle libvirt event that is fired when disk is ready for pivot.
(In reply to Ala Hino from comment #19) > Actually, this is intentional - this area of the code isn't tested and any > change is considered risky. This patch handles the (expected) exception that > is thrown by not logging it. > > The next step here is to handle libvirt event that is fired when disk is > ready for pivot. Please clone this bug so we have a different BZ to track the event handling and rename this one to reflect what was actually done there.
Renamed and cloned to BZ 1438850.
Ala, is there anything we need to document here?
Not really, we now just print the traceback of the exception. The behavior and user experience didn't change
INFO: Bug status wasn't changed from MODIFIED to ON_QA due to the following reason: [NO RELEVANT PATCHES FOUND] For more info please contact: infra
This bug has an open patch attached it (POST), can you provide info on it is pending another patch backported to 4.1 or it can move to ON_QA? If not, it might be best to move it back to POST status.
Moving back to POST while waiting for needed info
Target release should be placed once a package build is known to fix a issue. Since this bug is not modified, the target version has been reset. Please use target milestone to plan a fix for a oVirt release.
(In reply to Eyal Edri from comment #25) > This bug has an open patch attached it (POST), can you provide info on it is > pending another patch backported to 4.1 or it can move to ON_QA? > If not, it might be best to move it back to POST status. The attached patch should attached to a different bug. I remove it and changed bug status back to MODIFIED.
Nir - you just posted another patch for a bug that's already ON_QA. Does this mean the ON_QA status was wrong and the bug needs to be moved back to POST?
Verified with the following code: ------------------------------------ ovirt-engine-4.1.2.1-0.1.el7.noarch rhevm-4.1.2.1-0.1.el7.noarch vdsm-4.19.12-1.el7ev.x86_64 Verified with the following scenario: -------------------------------------- 1. Create a VM with 2 disks of 12g and 6g 2. Install OS on 12g disk and copy 1g of data to 2nd disk 3. Create 3 snapshots sna1, sna2, sna3 adding at least 500MB of new data between each snapshot 4. Start copying 2GB of new data to the disk and delete sna2 which performs a live merge operation >>>> The live merge operation is successful and is not aborted. Tier 1, 2 and 3 of automation passed and no live merge failures reported Will not move to verify until Nir replies as per Comment 29
(In reply to Allon Mureinik from comment #29) > Nir - you just posted another patch for a bug that's already ON_QA. > Does this mean the ON_QA status was wrong and the bug needs to be moved back > to POST? The posted patch fixes the root cause, avoiding calling blockCommit before the block job is ready. However this bug is about failing nicely when a blockCommit is not ready, and we do want to keep the new behavior. I remove this patch from this bug and added it to bug 1438850. Kevin, I think we can mark this as verified now.
Based on comment 31 moving to VERIFIED!