Bug 1857347 - Live merge in endless loop trying to pivot after unrecoverable error from libvirt
Summary: Live merge in endless loop trying to pivot after unrecoverable error from lib...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 4.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.4.3
: 4.4.3
Assignee: Amit Bawer
QA Contact: Tzahi Ashkenazi
URL:
Whiteboard:
Depends On: 1857735 1861681
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-15 16:54 UTC by mlehrer
Modified: 2024-06-13 22:48 UTC (History)
13 users (show)

Fixed In Version: vdsm-4.40.31
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-24 13:11:27 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
engine, vdsm, and vm log (2.51 MB, application/gzip)
2020-07-15 16:54 UTC, mlehrer
no flags Details
vdsm, engine, and libvirt on debug level (3.81 MB, application/gzip)
2020-07-16 11:26 UTC, mlehrer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5375411 0 None None None 2020-09-07 03:30:02 UTC
Red Hat Product Errata RHBA-2020:5213 0 None None None 2020-11-24 13:11:48 UTC
oVirt gerrit 110950 0 master MERGED vm: Stop live merge cleanup attempts for unrecoverable pivot error 2021-01-15 15:01:36 UTC
oVirt gerrit 110961 0 master MERGED vm: Use vdsm.common.errors.Base for internal module exceptions base 2021-01-15 15:02:16 UTC
oVirt gerrit 111034 0 master MERGED vm: Use state indication for live merge cleanup thread 2021-01-15 15:01:37 UTC
oVirt gerrit 111347 0 master MERGED tests: Add merge_test module to virt tests 2021-01-15 15:02:17 UTC

Description mlehrer 2020-07-15 16:54:08 UTC
Created attachment 1701277 [details]
engine, vdsm, and vm log

Description of problem:

Performing vm livesnapshot on vm with 2 disks, by creating snapshot, then deleting it via ansible:
On removal of snapshot the task seems to hang endlessly from the UI but VDSM has already thrown exception shown below.

The following ansible is executed:

- name: Delete snapshot for VM "{{ regression_vm_name }}"
  ovirt_snapshot:
    auth: "{{ ovirt_auth }}"
    state: absent
    vm_name: "{{ regression_vm_name }}"
    snapshot_id: "{{ snapshot_id }}"
    wait: True
    timeout: 1800
  register: delete_snapshot_attempt




Live Snapshot continues to run endlessly, while VDSM throws exception: 
vdsm[7618]: ERROR Unhandled exception#012Traceback (most recent call last):#012  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper#012    return f(*a, **kw)#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run#012    self.tryPivot()#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot#012    self.vm._dom.blockJobAbort(self.drive.name, flags)#012  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f#012    ret = attr(*args, **kwargs)#012  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper#012    ret = f(*args, **kwargs)#012  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper#012    return func(inst, *args, **kwargs)#012  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort#012    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)#012libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite


Version-Release number of selected component (if applicable):

rhv-release-4.4.1-10-001.noarch
vdsm-4.40.22-1.el8ev.x86_64

How reproducible:
Seems to happen on 3rd time.

Steps to Reproduce:
1. Create Vm lime snapshot
2. Delete VM live snapshot


Actual results:

Live snapshot delete continues to execute for very long time in Tasks, while vdsm has already thrown exception which seems to come from libvirt.


2020-07-15 12:34:49,939-0400 INFO  (merge/93ad982c) [vdsm.api] FINISH getVolumeInfo return={'info': {'uuid': 'fe95fcaf-be53-4e70-8042-94bfc51695ca', 'type': 'SPARSE', 'format': 'COW', 'disktype': 'DATA', 'voltype': 'LEAF', 'capacity': '107374182400', 'parent': '96baeeb5-c817-44db-a21c-e0b3c926ff4e', 'description': '', 'pool': '', 'domain': 'ef9ec245-1905-4d6f-a25c-9cf0148d60bc', 'image': '8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', 'ctime': '1594827156', 'mtime': '0', 'legality': 'ILLEGAL', 'generation': 0, 'apparentsize': '2147483648', 'truesize': '2147483648', 'status': 'ILLEGAL', 'lease': {'path': '/dev/ef9ec245-1905-4d6f-a25c-9cf0148d60bc/leases', 'offset': 807403520, 'owners': [], 'version': None}, 'children': []}} from=internal, task_id=2f9fbce1-887d-4d33-a573-4cee23e36fd3 (api:54)
2020-07-15 12:34:49,941-0400 INFO  (merge/93ad982c) [vdsm.api] START setVolumeSize(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', spUUID='17770700-4e03-46de-820b-e110cf05d51e', imgUUID='8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', volUUID='96baeeb5-c817-44db-a21c-e0b3c926ff4e', capacity='107374182400') from=internal, task_id=20c8950a-4d72-413a-9d4c-e290c9b3bf3f (api:48)
2020-07-15 12:34:49,956-0400 INFO  (merge/1d06804a) [vdsm.api] FINISH setVolumeSize return=None from=internal, task_id=60531e67-5b87-4919-9b11-4a5eece63473 (api:54)
2020-07-15 12:34:49,957-0400 INFO  (merge/1d06804a) [vdsm.api] START imageSyncVolumeChain(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', imgUUID='93b42c76-4eb3-499f-843d-6bb170c94ba8', volUUID='ba635360-ae72-4ec0-9ed0-1b59e22b68a7', newChain=['62b698cb-7b05-4f6f-a99a-ff4b2d848c48']) from=internal, task_id=df10c4a8-f0bb-41ed-9655-3a00997b06f2 (api:48)
2020-07-15 12:34:49,973-0400 INFO  (merge/93ad982c) [vdsm.api] FINISH setVolumeSize return=None from=internal, task_id=20c8950a-4d72-413a-9d4c-e290c9b3bf3f (api:54)
2020-07-15 12:34:49,975-0400 INFO  (merge/93ad982c) [vdsm.api] START imageSyncVolumeChain(sdUUID='ef9ec245-1905-4d6f-a25c-9cf0148d60bc', imgUUID='8b2d5d66-a07b-4fe7-8194-d76c3daf64e8', volUUID='fe95fcaf-be53-4e70-8042-94bfc51695ca', newChain=['96baeeb5-c817-44db-a21c-e0b3c926ff4e']) from=internal, task_id=7b5bceb8-d6ea-468d-8e2c-b7d491eacfa4 (api:48)
2020-07-15 12:34:50,141-0400 INFO  (merge/1d06804a) [storage.Image] Current chain=62b698cb-7b05-4f6f-a99a-ff4b2d848c48 < ba635360-ae72-4ec0-9ed0-1b59e22b68a7 (top)  (image:871)
2020-07-15 12:34:50,141-0400 INFO  (merge/1d06804a) [storage.Image] Unlinking subchain: ['ba635360-ae72-4ec0-9ed0-1b59e22b68a7'] (image:881)
2020-07-15 12:34:50,151-0400 INFO  (merge/1d06804a) [storage.Image] Leaf volume ba635360-ae72-4ec0-9ed0-1b59e22b68a7 is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:889)
2020-07-15 12:34:50,152-0400 INFO  (merge/1d06804a) [storage.VolumeManifest] sdUUID=ef9ec245-1905-4d6f-a25c-9cf0148d60bc imgUUID=93b42c76-4eb3-499f-843d-6bb170c94ba8 volUUID = ba635360-ae72-4ec0-9ed0-1b59e22b68a7 legality = ILLEGAL  (volume:385)
2020-07-15 12:34:50,266-0400 INFO  (merge/1d06804a) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=df10c4a8-f0bb-41ed-9655-3a00997b06f2 (api:54)
2020-07-15 12:34:50,268-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Disabling drive monitoring (drivemonitor:60)
2020-07-15 12:34:50,268-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 1d06804a-ba2a-404c-bcb2-809b9d68ebf7) (vm:5681)
2020-07-15 12:34:50,271-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56)
2020-07-15 12:34:50,271-0400 ERROR (merge/1d06804a) [root] Unhandled exception (logutils:450)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite
2020-07-15 12:34:50,272-0400 ERROR (merge/1d06804a) [root] FINISH thread <Thread(merge/1d06804a, started daemon 139697930086144)> failed (concurrent:267)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/concurrent.py", line 260, in run
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite
2020-07-15 12:34:50,287-0400 INFO  (merge/93ad982c) [storage.Image] Current chain=96baeeb5-c817-44db-a21c-e0b3c926ff4e < fe95fcaf-be53-4e70-8042-94bfc51695ca (top)  (image:871)
2020-07-15 12:34:50,287-0400 INFO  (merge/93ad982c) [storage.Image] Unlinking subchain: ['fe95fcaf-be53-4e70-8042-94bfc51695ca'] (image:881)
2020-07-15 12:34:50,297-0400 INFO  (merge/93ad982c) [storage.Image] Leaf volume fe95fcaf-be53-4e70-8042-94bfc51695ca is being removed from the chain. Marking it ILLEGAL to prevent data corruption (image:889)
2020-07-15 12:34:50,297-0400 INFO  (merge/93ad982c) [storage.VolumeManifest] sdUUID=ef9ec245-1905-4d6f-a25c-9cf0148d60bc imgUUID=8b2d5d66-a07b-4fe7-8194-d76c3daf64e8 volUUID = fe95fcaf-be53-4e70-8042-94bfc51695ca legality = ILLEGAL  (volume:385)
2020-07-15 12:34:50,309-0400 INFO  (merge/93ad982c) [vdsm.api] FINISH imageSyncVolumeChain return=None from=internal, task_id=7b5bceb8-d6ea-468d-8e2c-b7d491eacfa4 (api:54)
2020-07-15 12:34:50,309-0400 INFO  (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Disabling drive monitoring (drivemonitor:60)
2020-07-15 12:34:50,309-0400 INFO  (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 93ad982c-9d00-4e0c-9cac-2c6780b2493e) (vm:5681)
2020-07-15 12:34:50,312-0400 INFO  (merge/93ad982c) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56)
2020-07-15 12:34:50,312-0400 ERROR (merge/93ad982c) [root] Unhandled exception (logutils:450


Expected results:
Live snapshot deletion and creation to work without exception.

Additional info:
Stopping and then starting the VM from the UI will allow the removal of the snapshot to complete, but the task is still continuing to execute from the UI task view.

Comment 1 Nir Soffer 2020-07-15 20:29:47 UTC
Mordechai, this is a libvirt bug, but we must have libvirt debug logs.

Please see this page on how to enable debug logs:
https://wiki.libvirt.org/page/DebugLogs

When we reproduce with debug logs we can move this bug to libvirt.

Comment 2 Michal Skrivanek 2020-07-16 04:28:05 UTC
Mordechai, why is this filed as RHV bug?

@Nir, unhandled exception must be handled regardless. We can’t have infinite tasks in engine

Comment 3 Peter Krempa 2020-07-16 07:52:31 UTC
Please make sure to capture also the previous operations in the debug log. This kind of failure was most probably caused by either failed or forgotten cleanup after a previous operation.

Comment 4 mlehrer 2020-07-16 11:26:28 UTC
Created attachment 1701370 [details]
vdsm, engine, and libvirt on debug level

Attached libvirt-debug.tar.gz  - if further libvirt debug flags are requested please let me know.

Comment 5 mlehrer 2020-07-16 11:28:20 UTC
(In reply to Michal Skrivanek from comment #2)
> Mordechai, why is this filed as RHV bug?
> 
> @Nir, unhandled exception must be handled regardless. We can’t have infinite
> tasks in engine

Regarding why this is opened up on VDSM is because the exception occurs and VDSM is not handling it, and the tasks continues to appear to run from the engine, in addition to the VM Snapshot removal not completing until the VM is forced to power down.

The steps are:
1) Live VM Snapshot
2) Removal
Repeat twice
On the third time exception occurs.

I've attached libvirt debug level and vdsm, and engine log for this flow.

Comment 6 Nir Soffer 2020-07-16 12:27:46 UTC
(In reply to mlehrer from comment #5)
...
> Regarding why this is opened up on VDSM is because the exception occurs and
> VDSM is not handling it, and the tasks continues to appear to run from the
> engine, in addition to the VM Snapshot removal not completing until the VM
> is forced to power down.

Good point, it looks like vdsm does not handle the error in tryPivot properly:

2020-07-15 12:34:50,268-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Requesting pivot to complete active layer commit (job 1d06804a-ba2a-404c-bcb2-809b9d68ebf7) (vm:5681)
2020-07-15 12:34:50,271-0400 INFO  (merge/1d06804a) [virt.vm] (vmId='062bd65f-df9e-4858-8c18-4a61bddd7269') Enabling drive monitoring (drivemonitor:56)
2020-07-15 12:34:50,271-0400 ERROR (merge/1d06804a) [root] Unhandled exception (logutils:450)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/common/logutils.py", line 447, in wrapper
    return f(*a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5723, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 5684, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'transaction': Bitmap already exists: libvirt-tmp-activewrite
2020-07-15 12:34:50,272-0400 ERROR (merge/1d06804a) [root] FINISH thread <Thread(merge/1d06804a, started daemon 139697930086144)> failed (concurrent:267)

Pivot to new top layer failed, but we don't see the expected log:

   Pivot failed ...

5679         self.vm.log.info("Requesting pivot to complete active layer commit "
5680                          "(job %s)", self.job['jobID'])
5681         try:
5682             flags = libvirt.VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT
5683             self.vm._dom.blockJobAbort(self.drive.name, flags)
5684         except libvirt.libvirtError as e:
5685             self.vm.drive_monitor.enable()
5686             if e.get_error_code() != libvirt.VIR_ERR_BLOCK_COPY_ACTIVE:
5687                 raise
5688             raise BlockCopyActiveError(self.job['jobID'])
5689         except:
5690             self.vm.drive_monitor.enable()
5691             raise

The code is handling only libvirt.VIR_ERR_BLOCK_COPY_ACTIVE error for
some reason, and treat any other error as unexpected error. Unexpected
error would result in failure of the cleanup thread, and a new cleanup
thread will be started on the next job monitoring cycle.

However the fact that vdsm retry the cleanup is expected, it is designed
to retry merge cleanup after errors, since libivrt reports that the merge
was complete and the block job can be completed by pivoting.

Looks like vdsm need to improve error handling in this case to avoid 
endless loop.

Peter, regardless of the bug causing this error, do we have a way to 
detect that the block job would never be completed successfully, so we
can abort it without pivot and stop the endless loop?

Maybe we are not handling events properly?

Or maybe we should have limited number of retries, (e.g. 3), and abort
the merge without pivoting?

Comment 7 Nir Soffer 2020-07-16 12:28:47 UTC
Mordechai, lets file a new bug for the libvirt issue, and handle the 
vdsm issue in this bug, or the other way around.

Comment 8 Peter Krempa 2020-07-16 12:38:27 UTC
(In reply to Nir Soffer from comment #6)
> (In reply to mlehrer from comment #5)

[...]

I've figured out the problem on libvirt's side in the logs. Definitely file a libvirt bug if you plan to do something in VDSM

> Peter, regardless of the bug causing this error, do we have a way to 
> detect that the block job would never be completed successfully, so we
> can abort it without pivot and stop the endless loop?
> 
> Maybe we are not handling events properly?

Well, in this instance I don't think there is an recoverable error besides VIR_ERR_BLOCK_COPY_ACTIVE which means that the job was not synchronized yet.

> Or maybe we should have limited number of retries, (e.g. 3), and abort
> the merge without pivoting?

That is another option, but if there is a different error it's unlikely that it will ever succeed, and thus it seems to be pointless to retry.

Comment 9 Nir Soffer 2020-07-16 13:00:50 UTC
Since we have good info for the vdsm side, lets keep this bug for vdsm. 

I updated the title to reflect the issue in vdsm. We need to either add
limited retry policy or fail on the first unexpected error.

This will fail the merge on engine side. The user can retry the merge
to complete the operation.

Since we mark the volume as ILLEGAL before the pivot, this will probably
cause the vm to fail on the next restart, but this is another issue.

Comment 10 mlehrer 2020-07-16 13:06:07 UTC
(In reply to Peter Krempa from comment #8)
> (In reply to Nir Soffer from comment #6)
> > (In reply to mlehrer from comment #5)
> 
> [...]
> 
> I've figured out the problem on libvirt's side in the logs. Definitely file
> a libvirt bug if you plan to do something in VDSM

I've created https://bugzilla.redhat.com/show_bug.cgi?id=1857735 for libvirt side of the issue.



> 
> > Peter, regardless of the bug causing this error, do we have a way to 
> > detect that the block job would never be completed successfully, so we
> > can abort it without pivot and stop the endless loop?
> > 
> > Maybe we are not handling events properly?
> 
> Well, in this instance I don't think there is an recoverable error besides
> VIR_ERR_BLOCK_COPY_ACTIVE which means that the job was not synchronized yet.
> 
> > Or maybe we should have limited number of retries, (e.g. 3), and abort
> > the merge without pivoting?
> 
> That is another option, but if there is a different error it's unlikely that
> it will ever succeed, and thus it seems to be pointless to retry.

Comment 17 Tzahi Ashkenazi 2020-11-02 13:39:11 UTC
Tested  on :

engine: Red-01
cluster: L0_Group_0
version:
    rhv-release-4.4.3-9-001.noarch
    vdsm-4.40.34-1.el8ev.x86_64

ran 5 cycles on  the "vm livesnapshot" on:
   vm with 2 disks, by creating snapshot, then deleting it via ansibles the Live snapshot deletion and creation works without exception.

Comment 21 errata-xmlrpc 2020-11-24 13:11:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.3]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5213

Comment 22 meital avital 2022-08-14 13:57:30 UTC
Due to QE capacity, we are not going to cover this issue in our automation


Note You need to log in before you can comment on or make changes to this bug.