1857347 – Live merge in endless loop trying to pivot after unrecoverable error from libvirt

Bug 1857347 - Live merge in endless loop trying to pivot after unrecoverable error from libvirt

Summary: Live merge in endless loop trying to pivot after unrecoverable error from lib...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	ovirt-4.4.3
Target Release:	4.4.3
Assignee:	Amit Bawer
QA Contact:	Tzahi Ashkenazi
Docs Contact:
URL:
Whiteboard:
Depends On:	1857735 1861681
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-15 16:54 UTC by mlehrer
Modified:	2024-06-13 22:48 UTC (History)
CC List:	13 users (show)
Fixed In Version:	vdsm-4.40.31
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-24 13:11:27 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
engine, vdsm, and vm log (2.51 MB, application/gzip) 2020-07-15 16:54 UTC, mlehrer	no flags	Details
vdsm, engine, and libvirt on debug level (3.81 MB, application/gzip) 2020-07-16 11:26 UTC, mlehrer	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	5375411	None	None	None	2020-09-07 03:30:02 UTC
Red Hat Product Errata	RHBA-2020:5213	None	None	None	2020-11-24 13:11:48 UTC
oVirt gerrit	110950	master	MERGED	vm: Stop live merge cleanup attempts for unrecoverable pivot error	2021-01-15 15:01:36 UTC
oVirt gerrit	110961	master	MERGED	vm: Use vdsm.common.errors.Base for internal module exceptions base	2021-01-15 15:02:16 UTC
oVirt gerrit	111034	master	MERGED	vm: Use state indication for live merge cleanup thread	2021-01-15 15:01:37 UTC
oVirt gerrit	111347	master	MERGED	tests: Add merge_test module to virt tests	2021-01-15 15:02:17 UTC

Description mlehrer 2020-07-15 16:54:08 UTC

Created attachment 1701277 [details]
engine, vdsm, and vm log

Description of problem:

Performing vm livesnapshot on vm with 2 disks, by creating snapshot, then deleting it via ansible:
On removal of snapshot the task seems to hang endlessly from the UI but VDSM has already thrown exception shown below.

The following ansible is executed:

- name: Delete snapshot for VM "{{ regression_vm_name }}"
  ovirt_snapshot:
    auth: "{{ ovirt_auth }}"
    state: absent
    vm_name: "{{ regression_vm_name }}"
    snapshot_id: "{{ snapshot_id }}"
    wait: True
    timeout: 1800
  register: delete_snapshot_attempt




Live Snapshot continues to run endlessly, while VDSM throws exception: 
vdsm[7618]: ERROR Unhandled exception#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper#012    return f(*a, **kw)#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run#012    self.tryPivot()#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot#012    self.vm._dom.blockJobAbort(self.drive.name, flags)#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f#012    ret = attr(*args, **kwargs)#012  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper#012    ret = f(*args, **kwargs)#012  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper#012    return func(inst, *args, **kwargs)#012  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort#012    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)#012libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite


Version-Release number of selected component (if applicable):

rhv-release-4.4.1-10-001.noarch
vdsm-4.40.22-1.el8ev.x86_64

How reproducible:
Seems to happen on 3rd time.

Steps to Reproduce:
1. Create Vm lime snapshot
2. Delete VM live snapshot


Actual results:

Live snapshot delete continues to execute for very long time in Tasks, while vdsm has already thrown exception which seems to come from libvirt.


2020-07-15 12:34:49,939-0400 INFO  (merge/93ad982c) [vdsm.api] FINISH getVolumeInfo return={'info': {'uuid': 'fe95fcaf-be53-4e70-8042-94bfc51695ca', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '107374182400', 'parent': '96baeeb5-c817-44db-a21c-e0b3c926ff4e', 'description': '', 'pool': '', 'domain': 'ef9ec245-1905-4d6f-a25c-9cf0148d60bc', 'image': '8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', 'ctime': '1594827156', 'mtime': '0', 'legality': 'ILLEGAL', 'generation': 0, 'apparentsize': '2147483648', 'truesize': '2147483648', 'status': 'ILLEGAL', 'lease': {'path': '/dev/ef9ec245-1905-4d6f-a25c-9cf0148d60bc/leases', 'offset': 807403520, 'owners': [], 'version': None}, 'children': []}} from=internal, task_id=2f9fbce1-887d-4d33-a573-4cee23e36fd3 (api:54)
2020-07-15 12:34:49,941-0400 INFO  (merge/93ad982c) [vdsm.api] START setVolumeSize(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', spUUID='17770700-4e03-46de-820b-e110cf05d51e', imgUUID='8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', volUUID='96baeeb5-c817-44db-a21c-e0b3c926ff4e', capacity='107374182400') from=internal, task_id=20c8950a-4d72-413a-9d4c-e290c9b3bf3f (api:48)
2020-07-15 12:34:49,956-0400 INFO  (merge/1d06804a) [vdsm.api] FINISH setVolumeSize return=None from=internal, task_id=60531e67-5b87-4919-9b11-4a5eece63473 (api:54)
2020-07-15 12:34:49,957-0400 INFO  (merge/1d06804a) [vdsm.api] START imageSyncVolumeChain(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', imgUUID='93b42c76-4eb3-499f-843d-6bb170c94ba8', volUUID='ba635360-ae72-4ec0-9ed0-1b59e22b68a7', newChain=['62b698cb-7b05-4f6f-a99a-ff4b2d848c48']) from=internal, task_id=df10c4a8-f0bb-41ed-9655-3a00997b06f2 (api:48)
2020-07-15 12:34:49,973-0400 INFO  (merge/93ad982c) [vdsm.api] FINISH setVolumeSize return=None from=internal, task_id=20c8950a-4d72-413a-9d4c-e290c9b3bf3f (api:54)
2020-07-15 12:34:49,975-0400 INFO  (merge/93ad982c) [vdsm.api] START imageSyncVolumeChain(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', imgUUID='8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', volUUID='fe95fcaf-be53-4e70-8042-94bfc51695ca', newChain=['96baeeb5-c817-44db-a21c-e0b3c926ff4e']) from=internal, task_id=7b5bceb8-d6ea-468d-8e2c-b7d491eacfa4 (api:48)
2020-07-15 12:34:50,141-0400 INFO  (merge/1d06804a) [storage.Image] Current chain=62b698cb-7b05-4f6f-a99a-ff4b2d848c48 < ba635360-ae72-4ec0-9ed0-1b59e22b68a7 (top)  (image:871)
2020-07-15 12:34:50,141-0400 INFO  (merge/1d06804a) [storage.Image] Unlinking subchain: ['ba635360-ae72-4ec0-9ed0-1b59e22b68a7'] (image:881)
2020-07-15 12:34:50,151-0400 INFO  (merge/1d06804a) [storage.Image] Leaf volume ba635360-ae72-4ec0-9ed0-1b59e22b68a7 is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:889)
2020-07-15 12:34:50,152-0400 INFO  (merge/1d06804a) [storage.VolumeManifest] sdUUID=ef9ec245-1905-4d6f-a25c-9cf0148d60bc imgUUID=93b42c76-4eb3-499f-843d-6bb170c94ba8 volUUID = ba635360-ae72-4ec0-9ed0-1b59e22b68a7 legality = ILLEGAL  (volume:385)
2020-07-15 12:34:50,266-0400 INFO  (merge/1d06804a) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=df10c4a8-f0bb-41ed-9655-3a00997b06f2 (api:54)
2020-07-15 12:34:50,268-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Disabling drive monitoring (drivemonitor:60)
2020-07-15 12:34:50,268-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 1d06804a-ba2a-404c-bcb2-809b9d68ebf7) (vm:5681)
2020-07-15 12:34:50,271-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56)
2020-07-15 12:34:50,271-0400 ERROR (merge/1d06804a) [root] Unhandled exception (logutils:450)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite
2020-07-15 12:34:50,272-0400 ERROR (merge/1d06804a) [root] FINISH thread <Thread(merge/1d06804a, started daemon 139697930086144)> failed (concurrent:267)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite
2020-07-15 12:34:50,287-0400 INFO  (merge/93ad982c) [storage.Image] Current chain=96baeeb5-c817-44db-a21c-e0b3c926ff4e < fe95fcaf-be53-4e70-8042-94bfc51695ca (top)  (image:871)
2020-07-15 12:34:50,287-0400 INFO  (merge/93ad982c) [storage.Image] Unlinking subchain: ['fe95fcaf-be53-4e70-8042-94bfc51695ca'] (image:881)
2020-07-15 12:34:50,297-0400 INFO  (merge/93ad982c) [storage.Image] Leaf volume fe95fcaf-be53-4e70-8042-94bfc51695ca is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:889)
2020-07-15 12:34:50,297-0400 INFO  (merge/93ad982c) [storage.VolumeManifest] sdUUID=ef9ec245-1905-4d6f-a25c-9cf0148d60bc imgUUID=8b2d5d66-a07b-4fe7-8194-d76c3daf64e8 volUUID = fe95fcaf-be53-4e70-8042-94bfc51695ca legality = ILLEGAL  (volume:385)
2020-07-15 12:34:50,309-0400 INFO  (merge/93ad982c) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=7b5bceb8-d6ea-468d-8e2c-b7d491eacfa4 (api:54)
2020-07-15 12:34:50,309-0400 INFO  (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Disabling drive monitoring (drivemonitor:60)
2020-07-15 12:34:50,309-0400 INFO  (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 93ad982c-9d00-4e0c-9cac-2c6780b2493e) (vm:5681)
2020-07-15 12:34:50,312-0400 INFO  (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56)
2020-07-15 12:34:50,312-0400 ERROR (merge/93ad982c) [root] Unhandled exception (logutils:450


Expected results:
Live snapshot deletion and creation to work without exception.

Additional info:
Stopping and then starting the VM from the UI will allow the removal of the snapshot to complete, but the task is still continuing to execute from the UI task view.

Comment 1 Nir Soffer 2020-07-15 20:29:47 UTC

Mordechai, this is a libvirt bug, but we must have libvirt debug logs.

Please see this page on how to enable debug logs:
https://wiki.libvirt.org/page/DebugLogs

When we reproduce with debug logs we can move this bug to libvirt.

Comment 2 Michal Skrivanek 2020-07-16 04:28:05 UTC

Mordechai, why is this filed as RHV bug?

@Nir, unhandled exception must be handled regardless. We can’t have infinite tasks in engine

Comment 3 Peter Krempa 2020-07-16 07:52:31 UTC

Please make sure to capture also the previous operations in the debug log. This kind of failure was most probably caused by either failed or forgotten cleanup after a previous operation.

Comment 4 mlehrer 2020-07-16 11:26:28 UTC

Created attachment 1701370 [details]
vdsm, engine, and libvirt on debug level

Attached libvirt-debug.tar.gz  - if further libvirt debug flags are requested please let me know.

Comment 5 mlehrer 2020-07-16 11:28:20 UTC

(In reply to Michal Skrivanek from comment #2)
> Mordechai, why is this filed as RHV bug?
> 
> @Nir, unhandled exception must be handled regardless. We can’t have infinite
> tasks in engine

Regarding why this is opened up on VDSM is because the exception occurs and VDSM is not handling it, and the tasks continues to appear to run from the engine, in addition to the VM Snapshot removal not completing until the VM is forced to power down.

The steps are:
1) Live VM Snapshot
2) Removal
Repeat twice
On the third time exception occurs.

I've attached libvirt debug level and vdsm, and engine log for this flow.

Comment 6 Nir Soffer 2020-07-16 12:27:46 UTC

(In reply to mlehrer from comment #5)
...
> Regarding why this is opened up on VDSM is because the exception occurs and
> VDSM is not handling it, and the tasks continues to appear to run from the
> engine, in addition to the VM Snapshot removal not completing until the VM
> is forced to power down.

Good point, it looks like vdsm does not handle the error in tryPivot properly:

2020-07-15 12:34:50,268-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 1d06804a-ba2a-404c-bcb2-809b9d68ebf7) (vm:5681)
2020-07-15 12:34:50,271-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56)
2020-07-15 12:34:50,271-0400 ERROR (merge/1d06804a) [root] Unhandled exception (logutils:450)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite
2020-07-15 12:34:50,272-0400 ERROR (merge/1d06804a) [root] FINISH thread <Thread(merge/1d06804a, started daemon 139697930086144)> failed (concurrent:267)

Pivot to new top layer failed, but we don't see the expected log:

   Pivot failed ...

5679         self.vm.log.info("Requesting pivot to complete active layer commit "
5680                          "(job %s)", self.job['jobID'])
5681         try:
5682             flags = libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
5683             self.vm._dom.blockJobAbort(self.drive.name, flags)
5684         except libvirt.libvirtError as e:
5685             self.vm.drive_monitor.enable()
5686             if e.get_error_code() != libvirt.VIR_ERR_BLOCK_COPY_ACTIVE:
5687                 raise
5688             raise BlockCopyActiveError(self.job['jobID'])
5689         except:
5690             self.vm.drive_monitor.enable()
5691             raise

The code is handling only libvirt.VIR_ERR_BLOCK_COPY_ACTIVE error for
some reason, and treat any other error as unexpected error. Unexpected
error would result in failure of the cleanup thread, and a new cleanup
thread will be started on the next job monitoring cycle.

However the fact that vdsm retry the cleanup is expected, it is designed
to retry merge cleanup after errors, since libivrt reports that the merge
was complete and the block job can be completed by pivoting.

Looks like vdsm need to improve error handling in this case to avoid 
endless loop.

Peter, regardless of the bug causing this error, do we have a way to 
detect that the block job would never be completed successfully, so we
can abort it without pivot and stop the endless loop?

Maybe we are not handling events properly?

Or maybe we should have limited number of retries, (e.g. 3), and abort
the merge without pivoting?

Comment 7 Nir Soffer 2020-07-16 12:28:47 UTC

Mordechai, lets file a new bug for the libvirt issue, and handle the 
vdsm issue in this bug, or the other way around.

Comment 8 Peter Krempa 2020-07-16 12:38:27 UTC

(In reply to Nir Soffer from comment #6)
> (In reply to mlehrer from comment #5)

[...]

I've figured out the problem on libvirt's side in the logs. Definitely file a libvirt bug if you plan to do something in VDSM

> Peter, regardless of the bug causing this error, do we have a way to 
> detect that the block job would never be completed successfully, so we
> can abort it without pivot and stop the endless loop?
> 
> Maybe we are not handling events properly?

Well, in this instance I don't think there is an recoverable error besides VIR_ERR_BLOCK_COPY_ACTIVE which means that the job was not synchronized yet.

> Or maybe we should have limited number of retries, (e.g. 3), and abort
> the merge without pivoting?

That is another option, but if there is a different error it's unlikely that it will ever succeed, and thus it seems to be pointless to retry.

Comment 9 Nir Soffer 2020-07-16 13:00:50 UTC

Since we have good info for the vdsm side, lets keep this bug for vdsm. 

I updated the title to reflect the issue in vdsm. We need to either add
limited retry policy or fail on the first unexpected error.

This will fail the merge on engine side. The user can retry the merge
to complete the operation.

Since we mark the volume as ILLEGAL before the pivot, this will probably
cause the vm to fail on the next restart, but this is another issue.

Comment 10 mlehrer 2020-07-16 13:06:07 UTC

(In reply to Peter Krempa from comment #8)
> (In reply to Nir Soffer from comment #6)
> > (In reply to mlehrer from comment #5)
> 
> [...]
> 
> I've figured out the problem on libvirt's side in the logs. Definitely file
> a libvirt bug if you plan to do something in VDSM

I've created https://bugzilla.redhat.com/show_bug.cgi?id=1857735 for libvirt side of the issue.



> 
> > Peter, regardless of the bug causing this error, do we have a way to 
> > detect that the block job would never be completed successfully, so we
> > can abort it without pivot and stop the endless loop?
> > 
> > Maybe we are not handling events properly?
> 
> Well, in this instance I don't think there is an recoverable error besides
> VIR_ERR_BLOCK_COPY_ACTIVE which means that the job was not synchronized yet.
> 
> > Or maybe we should have limited number of retries, (e.g. 3), and abort
> > the merge without pivoting?
> 
> That is another option, but if there is a different error it's unlikely that
> it will ever succeed, and thus it seems to be pointless to retry.

Comment 17 Tzahi Ashkenazi 2020-11-02 13:39:11 UTC

Tested  on :

engine: Red-01
cluster: L0_Group_0
version:
    rhv-release-4.4.3-9-001.noarch
    vdsm-4.40.34-1.el8ev.x86_64

ran 5 cycles on  the "vm livesnapshot" on:
   vm with 2 disks, by creating snapshot, then deleting it via ansibles the Live snapshot deletion and creation works without exception.

Comment 21 errata-xmlrpc 2020-11-24 13:11:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.3]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5213

Comment 22 meital avital 2022-08-14 13:57:30 UTC

Due to QE capacity, we are not going to cover this issue in our automation

Note You need to log in before you can comment on or make changes to this bug.