Created attachment 1701277 [details] engine, vdsm, and vm log Description of problem: Performing vm livesnapshot on vm with 2 disks, by creating snapshot, then deleting it via ansible: On removal of snapshot the task seems to hang endlessly from the UI but VDSM has already thrown exception shown below. The following ansible is executed: - name: Delete snapshot for VM "{{ regression_vm_name }}" ovirt_snapshot: auth: "{{ ovirt_auth }}" state: absent vm_name: "{{ regression_vm_name }}" snapshot_id: "{{ snapshot_id }}" wait: True timeout: 1800 register: delete_snapshot_attempt Live Snapshot continues to run endlessly, while VDSM throws exception: vdsm[7618]: ERROR Unhandled exception#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper#012 return f(*a, **kw)#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run#012 self.tryPivot()#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot#012 self.vm._dom.blockJobAbort(self.drive.name, flags)#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper#012 return func(inst, *args, **kwargs)#012 File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort#012 if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)#012libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite Version-Release number of selected component (if applicable): rhv-release-4.4.1-10-001.noarch vdsm-4.40.22-1.el8ev.x86_64 How reproducible: Seems to happen on 3rd time. Steps to Reproduce: 1. Create Vm lime snapshot 2. Delete VM live snapshot Actual results: Live snapshot delete continues to execute for very long time in Tasks, while vdsm has already thrown exception which seems to come from libvirt. 2020-07-15 12:34:49,939-0400 INFO (merge/93ad982c) [vdsm.api] FINISH getVolumeInfo return={'info': {'uuid': 'fe95fcaf-be53-4e70-8042-94bfc51695ca', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '107374182400', 'parent': '96baeeb5-c817-44db-a21c-e0b3c926ff4e', 'description': '', 'pool': '', 'domain': 'ef9ec245-1905-4d6f-a25c-9cf0148d60bc', 'image': '8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', 'ctime': '1594827156', 'mtime': '0', 'legality': 'ILLEGAL', 'generation': 0, 'apparentsize': '2147483648', 'truesize': '2147483648', 'status': 'ILLEGAL', 'lease': {'path': '/dev/ef9ec245-1905-4d6f-a25c-9cf0148d60bc/leases', 'offset': 807403520, 'owners': [], 'version': None}, 'children': []}} from=internal, task_id=2f9fbce1-887d-4d33-a573-4cee23e36fd3 (api:54) 2020-07-15 12:34:49,941-0400 INFO (merge/93ad982c) [vdsm.api] START setVolumeSize(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', spUUID='17770700-4e03-46de-820b-e110cf05d51e', imgUUID='8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', volUUID='96baeeb5-c817-44db-a21c-e0b3c926ff4e', capacity='107374182400') from=internal, task_id=20c8950a-4d72-413a-9d4c-e290c9b3bf3f (api:48) 2020-07-15 12:34:49,956-0400 INFO (merge/1d06804a) [vdsm.api] FINISH setVolumeSize return=None from=internal, task_id=60531e67-5b87-4919-9b11-4a5eece63473 (api:54) 2020-07-15 12:34:49,957-0400 INFO (merge/1d06804a) [vdsm.api] START imageSyncVolumeChain(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', imgUUID='93b42c76-4eb3-499f-843d-6bb170c94ba8', volUUID='ba635360-ae72-4ec0-9ed0-1b59e22b68a7', newChain=['62b698cb-7b05-4f6f-a99a-ff4b2d848c48']) from=internal, task_id=df10c4a8-f0bb-41ed-9655-3a00997b06f2 (api:48) 2020-07-15 12:34:49,973-0400 INFO (merge/93ad982c) [vdsm.api] FINISH setVolumeSize return=None from=internal, task_id=20c8950a-4d72-413a-9d4c-e290c9b3bf3f (api:54) 2020-07-15 12:34:49,975-0400 INFO (merge/93ad982c) [vdsm.api] START imageSyncVolumeChain(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', imgUUID='8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', volUUID='fe95fcaf-be53-4e70-8042-94bfc51695ca', newChain=['96baeeb5-c817-44db-a21c-e0b3c926ff4e']) from=internal, task_id=7b5bceb8-d6ea-468d-8e2c-b7d491eacfa4 (api:48) 2020-07-15 12:34:50,141-0400 INFO (merge/1d06804a) [storage.Image] Current chain=62b698cb-7b05-4f6f-a99a-ff4b2d848c48 < ba635360-ae72-4ec0-9ed0-1b59e22b68a7 (top) (image:871) 2020-07-15 12:34:50,141-0400 INFO (merge/1d06804a) [storage.Image] Unlinking subchain: ['ba635360-ae72-4ec0-9ed0-1b59e22b68a7'] (image:881) 2020-07-15 12:34:50,151-0400 INFO (merge/1d06804a) [storage.Image] Leaf volume ba635360-ae72-4ec0-9ed0-1b59e22b68a7 is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:889) 2020-07-15 12:34:50,152-0400 INFO (merge/1d06804a) [storage.VolumeManifest] sdUUID=ef9ec245-1905-4d6f-a25c-9cf0148d60bc imgUUID=93b42c76-4eb3-499f-843d-6bb170c94ba8 volUUID = ba635360-ae72-4ec0-9ed0-1b59e22b68a7 legality = ILLEGAL (volume:385) 2020-07-15 12:34:50,266-0400 INFO (merge/1d06804a) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=df10c4a8-f0bb-41ed-9655-3a00997b06f2 (api:54) 2020-07-15 12:34:50,268-0400 INFO (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Disabling drive monitoring (drivemonitor:60) 2020-07-15 12:34:50,268-0400 INFO (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 1d06804a-ba2a-404c-bcb2-809b9d68ebf7) (vm:5681) 2020-07-15 12:34:50,271-0400 INFO (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56) 2020-07-15 12:34:50,271-0400 ERROR (merge/1d06804a) [root] Unhandled exception (logutils:450) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper return f(*a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run self.tryPivot() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot self.vm._dom.blockJobAbort(self.drive.name, flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite 2020-07-15 12:34:50,272-0400 ERROR (merge/1d06804a) [root] FINISH thread <Thread(merge/1d06804a, started daemon 139697930086144)> failed (concurrent:267) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper return f(*a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run self.tryPivot() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot self.vm._dom.blockJobAbort(self.drive.name, flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite 2020-07-15 12:34:50,287-0400 INFO (merge/93ad982c) [storage.Image] Current chain=96baeeb5-c817-44db-a21c-e0b3c926ff4e < fe95fcaf-be53-4e70-8042-94bfc51695ca (top) (image:871) 2020-07-15 12:34:50,287-0400 INFO (merge/93ad982c) [storage.Image] Unlinking subchain: ['fe95fcaf-be53-4e70-8042-94bfc51695ca'] (image:881) 2020-07-15 12:34:50,297-0400 INFO (merge/93ad982c) [storage.Image] Leaf volume fe95fcaf-be53-4e70-8042-94bfc51695ca is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:889) 2020-07-15 12:34:50,297-0400 INFO (merge/93ad982c) [storage.VolumeManifest] sdUUID=ef9ec245-1905-4d6f-a25c-9cf0148d60bc imgUUID=8b2d5d66-a07b-4fe7-8194-d76c3daf64e8 volUUID = fe95fcaf-be53-4e70-8042-94bfc51695ca legality = ILLEGAL (volume:385) 2020-07-15 12:34:50,309-0400 INFO (merge/93ad982c) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=7b5bceb8-d6ea-468d-8e2c-b7d491eacfa4 (api:54) 2020-07-15 12:34:50,309-0400 INFO (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Disabling drive monitoring (drivemonitor:60) 2020-07-15 12:34:50,309-0400 INFO (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 93ad982c-9d00-4e0c-9cac-2c6780b2493e) (vm:5681) 2020-07-15 12:34:50,312-0400 INFO (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56) 2020-07-15 12:34:50,312-0400 ERROR (merge/93ad982c) [root] Unhandled exception (logutils:450 Expected results: Live snapshot deletion and creation to work without exception. Additional info: Stopping and then starting the VM from the UI will allow the removal of the snapshot to complete, but the task is still continuing to execute from the UI task view.
Mordechai, this is a libvirt bug, but we must have libvirt debug logs. Please see this page on how to enable debug logs: https://wiki.libvirt.org/page/DebugLogs When we reproduce with debug logs we can move this bug to libvirt.
Mordechai, why is this filed as RHV bug? @Nir, unhandled exception must be handled regardless. We can’t have infinite tasks in engine
Please make sure to capture also the previous operations in the debug log. This kind of failure was most probably caused by either failed or forgotten cleanup after a previous operation.
Created attachment 1701370 [details] vdsm, engine, and libvirt on debug level Attached libvirt-debug.tar.gz - if further libvirt debug flags are requested please let me know.
(In reply to Michal Skrivanek from comment #2) > Mordechai, why is this filed as RHV bug? > > @Nir, unhandled exception must be handled regardless. We can’t have infinite > tasks in engine Regarding why this is opened up on VDSM is because the exception occurs and VDSM is not handling it, and the tasks continues to appear to run from the engine, in addition to the VM Snapshot removal not completing until the VM is forced to power down. The steps are: 1) Live VM Snapshot 2) Removal Repeat twice On the third time exception occurs. I've attached libvirt debug level and vdsm, and engine log for this flow.
(In reply to mlehrer from comment #5) ... > Regarding why this is opened up on VDSM is because the exception occurs and > VDSM is not handling it, and the tasks continues to appear to run from the > engine, in addition to the VM Snapshot removal not completing until the VM > is forced to power down. Good point, it looks like vdsm does not handle the error in tryPivot properly: 2020-07-15 12:34:50,268-0400 INFO (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 1d06804a-ba2a-404c-bcb2-809b9d68ebf7) (vm:5681) 2020-07-15 12:34:50,271-0400 INFO (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56) 2020-07-15 12:34:50,271-0400 ERROR (merge/1d06804a) [root] Unhandled exception (logutils:450) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper return f(*a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run self.tryPivot() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot self.vm._dom.blockJobAbort(self.drive.name, flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite 2020-07-15 12:34:50,272-0400 ERROR (merge/1d06804a) [root] FINISH thread <Thread(merge/1d06804a, started daemon 139697930086144)> failed (concurrent:267) Pivot to new top layer failed, but we don't see the expected log: Pivot failed ... 5679 self.vm.log.info("Requesting pivot to complete active layer commit " 5680 "(job %s)", self.job['jobID']) 5681 try: 5682 flags = libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT 5683 self.vm._dom.blockJobAbort(self.drive.name, flags) 5684 except libvirt.libvirtError as e: 5685 self.vm.drive_monitor.enable() 5686 if e.get_error_code() != libvirt.VIR_ERR_BLOCK_COPY_ACTIVE: 5687 raise 5688 raise BlockCopyActiveError(self.job['jobID']) 5689 except: 5690 self.vm.drive_monitor.enable() 5691 raise The code is handling only libvirt.VIR_ERR_BLOCK_COPY_ACTIVE error for some reason, and treat any other error as unexpected error. Unexpected error would result in failure of the cleanup thread, and a new cleanup thread will be started on the next job monitoring cycle. However the fact that vdsm retry the cleanup is expected, it is designed to retry merge cleanup after errors, since libivrt reports that the merge was complete and the block job can be completed by pivoting. Looks like vdsm need to improve error handling in this case to avoid endless loop. Peter, regardless of the bug causing this error, do we have a way to detect that the block job would never be completed successfully, so we can abort it without pivot and stop the endless loop? Maybe we are not handling events properly? Or maybe we should have limited number of retries, (e.g. 3), and abort the merge without pivoting?
Mordechai, lets file a new bug for the libvirt issue, and handle the vdsm issue in this bug, or the other way around.
(In reply to Nir Soffer from comment #6) > (In reply to mlehrer from comment #5) [...] I've figured out the problem on libvirt's side in the logs. Definitely file a libvirt bug if you plan to do something in VDSM > Peter, regardless of the bug causing this error, do we have a way to > detect that the block job would never be completed successfully, so we > can abort it without pivot and stop the endless loop? > > Maybe we are not handling events properly? Well, in this instance I don't think there is an recoverable error besides VIR_ERR_BLOCK_COPY_ACTIVE which means that the job was not synchronized yet. > Or maybe we should have limited number of retries, (e.g. 3), and abort > the merge without pivoting? That is another option, but if there is a different error it's unlikely that it will ever succeed, and thus it seems to be pointless to retry.
Since we have good info for the vdsm side, lets keep this bug for vdsm. I updated the title to reflect the issue in vdsm. We need to either add limited retry policy or fail on the first unexpected error. This will fail the merge on engine side. The user can retry the merge to complete the operation. Since we mark the volume as ILLEGAL before the pivot, this will probably cause the vm to fail on the next restart, but this is another issue.
(In reply to Peter Krempa from comment #8) > (In reply to Nir Soffer from comment #6) > > (In reply to mlehrer from comment #5) > > [...] > > I've figured out the problem on libvirt's side in the logs. Definitely file > a libvirt bug if you plan to do something in VDSM I've created https://bugzilla.redhat.com/show_bug.cgi?id=1857735 for libvirt side of the issue. > > > Peter, regardless of the bug causing this error, do we have a way to > > detect that the block job would never be completed successfully, so we > > can abort it without pivot and stop the endless loop? > > > > Maybe we are not handling events properly? > > Well, in this instance I don't think there is an recoverable error besides > VIR_ERR_BLOCK_COPY_ACTIVE which means that the job was not synchronized yet. > > > Or maybe we should have limited number of retries, (e.g. 3), and abort > > the merge without pivoting? > > That is another option, but if there is a different error it's unlikely that > it will ever succeed, and thus it seems to be pointless to retry.
Tested on : engine: Red-01 cluster: L0_Group_0 version: rhv-release-4.4.3-9-001.noarch vdsm-4.40.34-1.el8ev.x86_64 ran 5 cycles on the "vm livesnapshot" on: vm with 2 disks, by creating snapshot, then deleting it via ansibles the Live snapshot deletion and creation works without exception.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.3]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5213
Due to QE capacity, we are not going to cover this issue in our automation