Bug 1945675 - delete-snapshot stress with extreme load- Vdsm aborts live merge job without aborting the libvirt block job
Summary: delete-snapshot stress with extreme load- Vdsm aborts live merge job without ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.30.31
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ovirt-4.4.6
: 4.40.60.4
Assignee: Nir Soffer
QA Contact: mlehrer
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-01 15:36 UTC by Nir Soffer
Modified: 2022-08-11 02:01 UTC (History)
8 users (show)

Fixed In Version: vdsm-4.40.60.4
Doc Type: Bug Fix
Doc Text:
Cause: If delete snapshot failed in libvirt because of unexpected temporary error, vdsm was aborting the delete snapshot operation without aborting the the underlying libvirt block job. Consequence: Future attempt to delete the snapshot would fail, because the libvirt block job was running. If the VM is stopped, it cannot be started again because the top volume is marked as illegal. This requires manual fix in the volume metadata. Fix: Vdsm retry to complete the delete snapshot operation after unexpected errors. Result: The delete snapshot operation will complete successfully after temporary unexpected errors.
Clone Of:
Environment:
Last Closed: 2021-05-05 05:36:15 UTC
oVirt Team: Storage
Embargoed:
aoconnor: blocker-
lsvaty: testing_plan_complete-


Attachments (Terms of Use)
Test scripts and logs from run with 3 failures out of 10 runs (6.58 MB, application/gzip)
2021-04-01 15:39 UTC, Nir Soffer
no flags Details
Test configuration that reproduce the issue on visualized environment (4.79 KB, application/gzip)
2021-04-14 10:35 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 6129321 0 None None None 2021-06-20 01:36:09 UTC
oVirt gerrit 114192 0 master MERGED livemerge: Always retry failed pivot 2021-04-14 10:19:48 UTC
oVirt gerrit 114193 0 master MERGED livemerge: Keep pivot state in job 2021-04-14 10:19:51 UTC
oVirt gerrit 114194 0 master MERGED livemerge: Fix cleanup if active commit job has gone 2021-04-14 10:19:54 UTC

Description Nir Soffer 2021-04-01 15:36:14 UTC
Description of problem:

In bug 1857347 we tried to fix a case when libvirt block commit job failed
with unrecoverable error. Unfortunately the fix was not correct, making the
situation even worse.

The fix was to abort the merge job after unexpected libvirt error, but
the fix did not abort the libvirt block job. Since the block job was left
running, it is not possible to to retry the live merge again, it will fail
immediately.

Because the top volume is marked as ILLEGAL before trying to complete the
libvirt block job, the VM cannot run without manually fixing the volume
metadata.


Eample flow showing this bug:

1. Vdsm try to complete a block commit job:

2021-03-21 01:07:53,511+0200 INFO  (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Requesting pivot to complete active layer commit (job 7963caa8-20d0-4975-8afc-f17f463dc9da) (livemerge:771)

2. Compelting the job fail:

2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840)
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 774, in tryPivot
    self.vm._dom.blockJobAbort(self.drive.name, flags=flags)
  File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort
    if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self)
libvirt.libvirtError: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 820, in run
    self.tryPivot()
  File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 778, in tryPivot
    raise JobUnrecoverableError(self.job.id, e)
vdsm.virt.livemerge.JobUnrecoverableError: Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete'

3. Vdsm abort the job in CLEANUP state

2021-03-21 01:08:07,242+0200 ERROR (periodic/1) [virt.livemerge] Cleanup aborted, untracking job 7963caa8-20d0-4975-8afc-f17f463dc9da (livemerge:659)


Libvirt block commit job is left running!

Trying to delete snapshot again will always fail, since libvirt already
has a block job for this disk.

If the vm is stopped, it will not start again without fixing the volume
metadata.


Version-Release number of selected component (if applicable):
The regression was introduced in 4.30.31
Tested with 4.40.50.7.

How reproducible:
Always, but the libvirt error is extremly hard to reproduce.

Steps to Reproduce:
1. Run delete-snapshot stress with extreme load, introducing delay of more than
   30 seconds between block commit ready event and trying to complete the job.
2. Repeat many times

Actual results:
Snapshot cannot be deleted
VM cannot run after stoopping it.

Expected results:
Vdsm must abort the libvirt block job before untracking the block job.

Additional info:

Before the fix, if a libvirt block commit fails, vdsm would try to complete
the block job every 15 seconds. According to Peter krempa, these attempts
are unlikely to be successful, but bug 1945635 is example when rertrying
would succeed.

So I think the best way to fix this is:

- Keep the cleanup method in the job (e.g. "pivot", "abort")
- Keep pivot_attempts counter in the job (like extend_attempts)
- When pivot fail, increase the pivot counter.
- When starting cleanup, if pivot counter exceed the maximum number,
  change the job cleanp method to "abort". From this point, the job
  will try to abort instead of pivot.
- There is not limit the the number of abort atempts, we must not
  leave libvirt block job running.

Expected flow, starting at the point we start the cleanup, assuming maximum 3
pivot attempts:

00:00 try to pivot, fail: wait for next update
00:15 try to pivot, fail: wait for next update
00:30 try to pivot, fail: switch job to cleanup="abort"
09:45 try to abort, success: untrack job

Comment 1 Nir Soffer 2021-04-01 15:39:10 UTC
Created attachment 1768291 [details]
Test scripts and logs from run with 3 failures out of 10 runs

Comment 3 Nir Soffer 2021-04-05 11:47:44 UTC
Based on discussion with John Snow regarding bug 1945635, and mailing list thread:
https://lists.nongnu.org/archive/html/qemu-block/2021-04/msg00035.html

I don't expect this issue to be solved soon, so having retries in vdsm is likely
the only thing we can do for 4.4.6 (rhel 8.4). If a live merge job is flipping 
between "ready" and "standby" state, retrying the pivot several times is very likely
to succeed.

The most important change we need to do is to revert the fix for bug 1857347, and go
back to pivot retry.

Once we have pivot retries, limiting the number of retries and switching to new cleanup
mode when we abort the block commit is nice to have in 4.4.6, but can also be completed
later.

Comment 4 Nir Soffer 2021-04-13 16:12:19 UTC
Example log with the fix, showing how we recover from temporary pivot error
due to  bug 1945635.

1. Starting merge

2021-04-12 02:31:40,720+0300 INFO  (jsonrpc/6) [api.virt] START merge(drive={'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': '5f76b8de-1396-472e-966b-4258ec4a7f3c', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b'}, baseVolUUID='be3e6031-7386-4188-acac-e3c9d3f373d0', topVolUUID='5f76b8de-1396-472e-966b-4258ec4a7f3c', bandwidth='0', jobUUID='e80636f1-66f0-4fdf-a1a7-233ba80a56db') from=::ffff:192.168.122.11,51588, flow_id=4cfcd6a0-40db-4914-a18f-f13795f2fdb9, vmId=cd61cfb5-8863-44b8-b287-50f3b65f0067 (api:48)

2. Normal EXTEND flow

2021-04-12 02:31:40,786+0300 INFO  (jsonrpc/6) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from INIT to EXTEND (livemerge:170)

2021-04-12 02:31:40,838+0300 INFO  (jsonrpc/6) [virt.livemerge] Starting extend 1/10 for job=e80636f1-66f0-4fdf-a1a7-233ba80a56db drive=vdh volume=be3e6031-7386-4188-acac-e3c9d3f373d0 (livemerge:466)

2021-04-12 02:31:40,840+0300 INFO  (jsonrpc/6) [vdsm.api] START sendExtendMsg(spUUID='fb7797bd-3cca-4baa-bcb3-35e8851015f7', volDict={'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'internal': True, 'name': 'vdh', 'newSize': 3544186880, 'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': 'be3e6031-7386-4188-acac-e3c9d3f373d0', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': functools.partial(<bound method DriveMerger._extend_completed of <vdsm.virt.livemerge.DriveMerger object at 0x7f04343d1080>>, job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db')}, newSize=3544186880, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f04341c3518>>) from=::ffff:192.168.122.11,51588, flow_id=4cfcd6a0-40db-4914-a18f-f13795f2fdb9, task_id=a860dae6-a753-4197-b208-75bd7293a0a8 (api:48)

2021-04-12 02:31:43,064+0300 INFO  (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 7288bf8d-1516-41fc-bcce-32790db907fa (cmd=<function runTask at 0x7f0481111840>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f04341c3518>>, {'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'internal': True, 'name': 'vdh', 'newSize': 3544186880, 'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': 'be3e6031-7386-4188-acac-e3c9d3f373d0', 'clock': <Clock(total=2.23*, extend-volume=2.23, refresh-volume=0.01*)>, 'callback': functools.partial(<bound method DriveMerger._extend_completed of <vdsm.virt.livemerge.DriveMerger object at 0x7f04343d1080>>, job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db')})) (threadPool:149)

3. EXTEND completed, starting COMMIT

2021-04-12 02:31:43,395+0300 INFO  (mailbox-hsm/2) [virt.livemerge] Extend completed for job e80636f1-66f0-4fdf-a1a7-233ba80a56db, starting commit (livemerge:545)

2021-04-12 02:31:43,395+0300 INFO  (mailbox-hsm/2) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from EXTEND to COMMIT (livemerge:170)

2021-04-12 02:31:43,417+0300 INFO  (mailbox-hsm/2) [virt.livemerge] Starting merge with job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db', original chain=be3e6031-7386-4188-acac-e3c9d3f373d0 < 5f76b8de-1396-472e-966b-4258ec4a7f3c (top), disk='vdh', base='vdh[5]', top=None, bandwidth=0, flags=12 (livemerge:408)

4. Receiving first ready event

2021-04-12 02:34:15,133+0300 INFO  (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672)

5. COMMIT completed, job is read for pivot, starting CLEANUP

2021-04-12 02:34:42,379+0300 INFO  (periodic/0) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db is ready for pivot (livemerge:687)

2021-04-12 02:34:42,379+0300 INFO  (periodic/0) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from COMMIT to CLEANUP (livemerge:170)

6. Starting first cleanup attempt

2021-04-12 02:34:42,400+0300 INFO  (periodic/0) [virt.livemerge] Starting cleanup for job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:748)

2021-04-12 02:34:50,614+0300 INFO  (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Requesting pivot to complete active layer commit (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:835)

6. Receiving second ready event (job is flipping between ready
   and standby states on qemu side) 

2021-04-12 02:34:51,312+0300 INFO  (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672)

7. Pivot failed because job is in standby state

2021-04-12 02:34:51,312+0300 WARN  (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Pivot failed for Job e80636f1-66f0-4fdf-a1a7-233ba80a56db: internal error: unable to execute QEMU command 'job-complete': Job 'commit-vdh-libvirt-23-format' in state 'standby' cannot accept command verb 'complete' (livemerge:898)

8. Staring second cleanup

2021-04-12 02:35:12,379+0300 INFO  (periodic/2) [virt.livemerge] Starting cleanup for job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:748)

2021-04-12 02:35:21,993+0300 INFO  (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Requesting pivot to complete active layer commit (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:835)

9. Third ready event (job is still flipping on qemu side)

2021-04-12 02:35:24,782+0300 INFO  (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672)

10. Receiving completed event

2021-04-12 02:35:25,856+0300 INFO  (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh has completed (vm:5663)

11. Pivot completed, finishing cleanup

2021-04-12 02:35:25,882+0300 INFO  (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Pivot completed (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:849)

2021-04-12 02:35:25,882+0300 INFO  (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Synchronizing volume chain after live merge (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:886)

2021-04-12 02:35:32,025+0300 INFO  (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Synchronization completed (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:895)

2021-04-12 02:35:42,347+0300 INFO  (periodic/2) [virt.livemerge] Cleanup completed, untracking job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:727)

Comment 5 Nir Soffer 2021-04-14 10:35:52 UTC
Created attachment 1771854 [details]
Test configuration that reproduce the issue on visualized environment

Comment 6 Nir Soffer 2021-04-14 10:39:22 UTC
The issue can be reproduced and tested using delete-snapshot stress test.

To reproduce, get the ovirt-test repo:
https://gitlab.com/nirs/ovirt-stress.git

Create conf.yml file from conf.yml.example and configure
to adapt to your engine details as explained in:
https://gitlab.com/nirs/ovirt-stress/-/blob/master/delete-snapshot/README.md

Reproducing on a real server with fast storage is very hard and can take
few days, but if you run the stress test on a virtual environment with 
slow storage, the issue can be reproduce quickly in 1-2 hours.

I reproduced it on a vm based system:
- engine vm
- 2 hosts vm
- 1 storage vm

All running on a laptop.

I used this configuration (changed from the defalt):
- setup.yml
  - 8 virtio thin disks instead of 3 virtio, 3 virtio-scsi-thin, 3 virtio-scsi raw
- conf.yml:
  - write_data_mb: 1024
  - vm_count: 4
  - iterations: 20

See attachment 1771854 [details] for example configuration and test script.

With vdsm < 4.40.60.4 we expect to see at least one failed iteration,
and in vdsm log you will find error like:

2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840)

And the delete snapshot operation will fail.

With vdsm >= 4.40.60.4 this libvirt error is logged as warning:

vdsm.log.2.xz:2021-04-13 21:21:05,327+0300 WARN  (merge/a2219037) [virt.vm] (vmId='ccfbb456-be77-4a0a-a158-6887ccc62313') Pivot failed for Job a2219037-d0e0-47f1-a6ee-88680afe1741: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-11-format' in state 'standby' cannot accept command verb 'complete' (livemerge:898)

And vdsm will retry the pivot operation. The next pivot retry is likely to
succeed, and if not vdsm will retry the operation again.
See example recovery flow in comment 4.

The test run should complete without any failures.

There may be tests errors like failing to create a vm or a snapshot, which 
are not related to this fix. We have other bug tracking these issues.

Comment 7 Nir Soffer 2021-04-14 11:15:48 UTC
This fix do not limit the number of pivot attempts. Fixing this is
tracked in bug 1949470.

Comment 9 Daniel Gur 2021-05-03 06:44:08 UTC
The verification scenario mentioned in comments 4 and comment 6 was already run and verified by Nir with the fix(vdsm-4.40.60.4).
Moving to verified.

Comment 10 Sandro Bonazzola 2021-05-05 05:36:15 UTC
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021.

Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.