Hide Forgot
+++ This bug was initially created as a clone of Bug #1206355 +++ Description of problem: After completing an active layer merge, vdsm's representation of the new volume chain might not be properly synchronized. When this happens, engine will report a failure to delete the snapshot even though libvirt successfully merged it. Version-Release number of selected component (if applicable): libvirt-1.2.8-16.el7_1.2.x86_64 from http://download.devel.redhat.com/brewroot/packages/libvirt/1.2.8/16.el7_1.2/x86_64/ which is a candidate build for RHEL-7.1 zStream vdsm-4.16.12-28.gitf03bb74.el7.x86_64 : 3.5 branch tip How reproducible: Intermittent but easy to reproduce Steps to Reproduce: 1. Create a VM with 3 BLOCK disks (1 thin and 2 preallocated) 2. Start the VM 3. Create a snapshot 4. Delete the snapshot Actual results: Snapshot may fail to delete on at least one disk Expected results: Snapshot removal is successful Additional info: The vdsm log reveals errors such as the following: Thread-50::ERROR::2015-03-26 10:53:09,314::sampling::488::vm.Vm::(collect) vmId=`b3fbc637-9cc9-4b15-ba94-5a2ee1607785`::Stats function failed: <AdvancedStatsFunction _highWrite at 0x3133870> Traceback (most recent call last): File "/usr/share/vdsm/virt/sampling.py", line 484, in collect statsFunction() File "/usr/share/vdsm/virt/sampling.py", line 359, in __call__ retValue = self._function(*args, **kwargs) File "/usr/share/vdsm/virt/vm.py", line 292, in _highWrite self._vm.extendDrivesIfNeeded() File "/usr/share/vdsm/virt/vm.py", line 2537, in extendDrivesIfNeeded extend = [x for x in self._getExtendCandidates() File "/usr/share/vdsm/virt/vm.py", line 2489, in _getExtendCandidates capacity, alloc, physical = self._dom.blockInfo(drive.path, 0) File "/usr/share/vdsm/virt/vm.py", line 689, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/libvirtconnection.py", line 111, in wrapper ret = f(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 646, in blockInfo if ret is None: raise libvirtError ('virDomainGetBlockInfo() failed', dom=self) libvirtError: invalid argument: invalid path /rhev/data-center/dee92f86-2fef-4edf-ac51-135d12646bde/01f563fa-1aef-41ab-92d3-0d6561d4a731/images/31cfd267-e627-4cfe-8f33-f66b0db627c3/1863e178-c40a-42eb-803a-3d1f1ac6658a not assigned to domain After talking to Eric Blake on IRC, it looks like another problem with libvirt where the domain XML is not updated before the virDomainBlockJobAbort API (used for pivoting an active layer merge) returns. Therefore, our volume chain sync code gets an old value for the disk path which causes _highWrite to fail in this manner.
This is likely fallout from Bug 1202719 and the working theory is that there is a new race window... A virDomainBlockCommit involving the active layer is completed (pivoted) by using a call to virDomainBlockJobAbort. The Abort returns and the caller immediately retrieves the domain XML, but that call returns the XML before libvirt could update it to represent the new backing chain (post pivot).
This fix is required to support Live Merge in RHEVM. I don't have the permissions to raise flags on this BZ, but I'd like to propose it for RHEL 7.1.z.
well, I can reproduce it with rhevm3.5 env, although not 100 percent. with packages rhevm-3.5.0-0.31.el6ev.noarch libvirt-1.2.8-16.el7_1.2.x86_64 vdsm-4.16.12.1-3.el7ev.x86_64 qemu-kvm-rhev-2.1.2-23.el7_1.1.x86_64
Patches fixing the regression proposed upstream. Please note that upstream released the broken code only in the release candidate for the upcomming release so there should not be any upstream release with the flaw once this is fixed. https://www.redhat.com/archives/libvir-list/2015-March/msg01524.html
Fixed upstream: commit 630ee5ac6cf4e3be3f3e986897a289865dd2604b Author: Peter Krempa <pkrempa> Date: Mon Mar 30 11:26:20 2015 +0200 qemu: blockjob: Synchronously update backing chain in XML on ABORT/PIVOT When the synchronous pivot option is selected, libvirt would not update the backing chain until the job was exitted. Some applications then received invalid data as their job serialized first. This patch removes polling to wait for the ABORT/PIVOT job completion and replaces it with a condition. If a synchronous operation is requested the update of the XML is executed in the job of the caller of the synchronous request. Otherwise the monitor event callback uses a separate worker to update the backing chain with a new job. This is a regression since 1a92c719101e5bfa6fe2b78006ad04c7f075ea28 When the ABORT job is finished synchronously you get the following call stack: #0 qemuBlockJobEventProcess #1 qemuDomainBlockJobImpl #2 qemuDomainBlockJobAbort #3 virDomainBlockJobAbort While previously or while using the _ASYNC flag you'd get: #0 qemuBlockJobEventProcess #1 processBlockJobEvent #2 qemuProcessEventHandler #3 virThreadPoolWorker commit 0c4474df4e10d27e27dbcda80b1f9cc14f4bdd8a Author: Peter Krempa <pkrempa> Date: Mon Mar 30 11:26:19 2015 +0200 qemu: Extract internals of processBlockJobEvent into a helper Later on I'll be adding a condition that will allow to synchronise a SYNC block job abort. The approach will require this code to be called from two different places so it has to be extracted into a helper. commit 6b6c4ab8a6d2096bd5f50d2ae9b0a929fbaaf076 Author: Peter Krempa <pkrempa> Date: Mon Mar 30 11:26:18 2015 +0200 qemu: processBlockJob: Don't unlock @vm twice Commit 1a92c719 moved code to handle block job events to a different function that is executed in a separate thread. The caller of processBlockJob handles locking and unlocking of @vm, so the we should not do it in the function itself. v1.2.14-rc2
I can register host with latest libvirt and vdsm to rhevm, so this one is not blocked by bug 1232665. I can reproduce it with libvirt-1.2.13-1.el7.x86_64 and vdsm-4.16.21-1.el7ev.x86_64 Verified it with libvirt-1.2.17-1.el7 and vdsm-4.16.21-1.el7ev.x86_64 Steps 1. Prepare a running guest with 4 disks, (2 thin and 2 preallocated) 2. Create 4 snapshots: s4-->s3-->s2-->s1-->base 3. delete 3 snapshots: s4, s3, s1 4. create 2 snapshots: ss2-->ss1-->s2-->base 5. delete 1 snapshots: s2 6. create 2 snapshots: ss4-->ss3-->ss2-->ss1-->base 7. delete 4 snapshots: ss4, ss2, ss3, ss1 All ops work well so move it to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html