Description of problem: In bug 1857347 we tried to fix a case when libvirt block commit job failed with unrecoverable error. Unfortunately the fix was not correct, making the situation even worse. The fix was to abort the merge job after unexpected libvirt error, but the fix did not abort the libvirt block job. Since the block job was left running, it is not possible to to retry the live merge again, it will fail immediately. Because the top volume is marked as ILLEGAL before trying to complete the libvirt block job, the VM cannot run without manually fixing the volume metadata. Eample flow showing this bug: 1. Vdsm try to complete a block commit job: 2021-03-21 01:07:53,511+0200 INFO (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Requesting pivot to complete active layer commit (job 7963caa8-20d0-4975-8afc-f17f463dc9da) (livemerge:771) 2. Compelting the job fail: 2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 774, in tryPivot self.vm._dom.blockJobAbort(self.drive.name, flags=flags) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 888, in blockJobAbort if ret == -1: raise libvirtError ('virDomainBlockJobAbort() failed', dom=self) libvirt.libvirtError: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 820, in run self.tryPivot() File "/usr/lib/python3.6/site-packages/vdsm/virt/livemerge.py", line 778, in tryPivot raise JobUnrecoverableError(self.job.id, e) vdsm.virt.livemerge.JobUnrecoverableError: Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete' 3. Vdsm abort the job in CLEANUP state 2021-03-21 01:08:07,242+0200 ERROR (periodic/1) [virt.livemerge] Cleanup aborted, untracking job 7963caa8-20d0-4975-8afc-f17f463dc9da (livemerge:659) Libvirt block commit job is left running! Trying to delete snapshot again will always fail, since libvirt already has a block job for this disk. If the vm is stopped, it will not start again without fixing the volume metadata. Version-Release number of selected component (if applicable): The regression was introduced in 4.30.31 Tested with 4.40.50.7. How reproducible: Always, but the libvirt error is extremly hard to reproduce. Steps to Reproduce: 1. Run delete-snapshot stress with extreme load, introducing delay of more than 30 seconds between block commit ready event and trying to complete the job. 2. Repeat many times Actual results: Snapshot cannot be deleted VM cannot run after stoopping it. Expected results: Vdsm must abort the libvirt block job before untracking the block job. Additional info: Before the fix, if a libvirt block commit fails, vdsm would try to complete the block job every 15 seconds. According to Peter krempa, these attempts are unlikely to be successful, but bug 1945635 is example when rertrying would succeed. So I think the best way to fix this is: - Keep the cleanup method in the job (e.g. "pivot", "abort") - Keep pivot_attempts counter in the job (like extend_attempts) - When pivot fail, increase the pivot counter. - When starting cleanup, if pivot counter exceed the maximum number, change the job cleanp method to "abort". From this point, the job will try to abort instead of pivot. - There is not limit the the number of abort atempts, we must not leave libvirt block job running. Expected flow, starting at the point we start the cleanup, assuming maximum 3 pivot attempts: 00:00 try to pivot, fail: wait for next update 00:15 try to pivot, fail: wait for next update 00:30 try to pivot, fail: switch job to cleanup="abort" 09:45 try to abort, success: untrack job
Created attachment 1768291 [details] Test scripts and logs from run with 3 failures out of 10 runs
Based on discussion with John Snow regarding bug 1945635, and mailing list thread: https://lists.nongnu.org/archive/html/qemu-block/2021-04/msg00035.html I don't expect this issue to be solved soon, so having retries in vdsm is likely the only thing we can do for 4.4.6 (rhel 8.4). If a live merge job is flipping between "ready" and "standby" state, retrying the pivot several times is very likely to succeed. The most important change we need to do is to revert the fix for bug 1857347, and go back to pivot retry. Once we have pivot retries, limiting the number of retries and switching to new cleanup mode when we abort the block commit is nice to have in 4.4.6, but can also be completed later.
Example log with the fix, showing how we recover from temporary pivot error due to bug 1945635. 1. Starting merge 2021-04-12 02:31:40,720+0300 INFO (jsonrpc/6) [api.virt] START merge(drive={'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': '5f76b8de-1396-472e-966b-4258ec4a7f3c', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b'}, baseVolUUID='be3e6031-7386-4188-acac-e3c9d3f373d0', topVolUUID='5f76b8de-1396-472e-966b-4258ec4a7f3c', bandwidth='0', jobUUID='e80636f1-66f0-4fdf-a1a7-233ba80a56db') from=::ffff:192.168.122.11,51588, flow_id=4cfcd6a0-40db-4914-a18f-f13795f2fdb9, vmId=cd61cfb5-8863-44b8-b287-50f3b65f0067 (api:48) 2. Normal EXTEND flow 2021-04-12 02:31:40,786+0300 INFO (jsonrpc/6) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from INIT to EXTEND (livemerge:170) 2021-04-12 02:31:40,838+0300 INFO (jsonrpc/6) [virt.livemerge] Starting extend 1/10 for job=e80636f1-66f0-4fdf-a1a7-233ba80a56db drive=vdh volume=be3e6031-7386-4188-acac-e3c9d3f373d0 (livemerge:466) 2021-04-12 02:31:40,840+0300 INFO (jsonrpc/6) [vdsm.api] START sendExtendMsg(spUUID='fb7797bd-3cca-4baa-bcb3-35e8851015f7', volDict={'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'internal': True, 'name': 'vdh', 'newSize': 3544186880, 'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': 'be3e6031-7386-4188-acac-e3c9d3f373d0', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': functools.partial(<bound method DriveMerger._extend_completed of <vdsm.virt.livemerge.DriveMerger object at 0x7f04343d1080>>, job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db')}, newSize=3544186880, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f04341c3518>>) from=::ffff:192.168.122.11,51588, flow_id=4cfcd6a0-40db-4914-a18f-f13795f2fdb9, task_id=a860dae6-a753-4197-b208-75bd7293a0a8 (api:48) 2021-04-12 02:31:43,064+0300 INFO (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 7288bf8d-1516-41fc-bcce-32790db907fa (cmd=<function runTask at 0x7f0481111840>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f04341c3518>>, {'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'internal': True, 'name': 'vdh', 'newSize': 3544186880, 'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': 'be3e6031-7386-4188-acac-e3c9d3f373d0', 'clock': <Clock(total=2.23*, extend-volume=2.23, refresh-volume=0.01*)>, 'callback': functools.partial(<bound method DriveMerger._extend_completed of <vdsm.virt.livemerge.DriveMerger object at 0x7f04343d1080>>, job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db')})) (threadPool:149) 3. EXTEND completed, starting COMMIT 2021-04-12 02:31:43,395+0300 INFO (mailbox-hsm/2) [virt.livemerge] Extend completed for job e80636f1-66f0-4fdf-a1a7-233ba80a56db, starting commit (livemerge:545) 2021-04-12 02:31:43,395+0300 INFO (mailbox-hsm/2) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from EXTEND to COMMIT (livemerge:170) 2021-04-12 02:31:43,417+0300 INFO (mailbox-hsm/2) [virt.livemerge] Starting merge with job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db', original chain=be3e6031-7386-4188-acac-e3c9d3f373d0 < 5f76b8de-1396-472e-966b-4258ec4a7f3c (top), disk='vdh', base='vdh[5]', top=None, bandwidth=0, flags=12 (livemerge:408) 4. Receiving first ready event 2021-04-12 02:34:15,133+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672) 5. COMMIT completed, job is read for pivot, starting CLEANUP 2021-04-12 02:34:42,379+0300 INFO (periodic/0) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db is ready for pivot (livemerge:687) 2021-04-12 02:34:42,379+0300 INFO (periodic/0) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from COMMIT to CLEANUP (livemerge:170) 6. Starting first cleanup attempt 2021-04-12 02:34:42,400+0300 INFO (periodic/0) [virt.livemerge] Starting cleanup for job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:748) 2021-04-12 02:34:50,614+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Requesting pivot to complete active layer commit (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:835) 6. Receiving second ready event (job is flipping between ready and standby states on qemu side) 2021-04-12 02:34:51,312+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672) 7. Pivot failed because job is in standby state 2021-04-12 02:34:51,312+0300 WARN (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Pivot failed for Job e80636f1-66f0-4fdf-a1a7-233ba80a56db: internal error: unable to execute QEMU command 'job-complete': Job 'commit-vdh-libvirt-23-format' in state 'standby' cannot accept command verb 'complete' (livemerge:898) 8. Staring second cleanup 2021-04-12 02:35:12,379+0300 INFO (periodic/2) [virt.livemerge] Starting cleanup for job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:748) 2021-04-12 02:35:21,993+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Requesting pivot to complete active layer commit (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:835) 9. Third ready event (job is still flipping on qemu side) 2021-04-12 02:35:24,782+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672) 10. Receiving completed event 2021-04-12 02:35:25,856+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh has completed (vm:5663) 11. Pivot completed, finishing cleanup 2021-04-12 02:35:25,882+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Pivot completed (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:849) 2021-04-12 02:35:25,882+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Synchronizing volume chain after live merge (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:886) 2021-04-12 02:35:32,025+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Synchronization completed (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:895) 2021-04-12 02:35:42,347+0300 INFO (periodic/2) [virt.livemerge] Cleanup completed, untracking job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:727)
Created attachment 1771854 [details] Test configuration that reproduce the issue on visualized environment
The issue can be reproduced and tested using delete-snapshot stress test. To reproduce, get the ovirt-test repo: https://gitlab.com/nirs/ovirt-stress.git Create conf.yml file from conf.yml.example and configure to adapt to your engine details as explained in: https://gitlab.com/nirs/ovirt-stress/-/blob/master/delete-snapshot/README.md Reproducing on a real server with fast storage is very hard and can take few days, but if you run the stress test on a virtual environment with slow storage, the issue can be reproduce quickly in 1-2 hours. I reproduced it on a vm based system: - engine vm - 2 hosts vm - 1 storage vm All running on a laptop. I used this configuration (changed from the defalt): - setup.yml - 8 virtio thin disks instead of 3 virtio, 3 virtio-scsi-thin, 3 virtio-scsi raw - conf.yml: - write_data_mb: 1024 - vm_count: 4 - iterations: 20 See attachment 1771854 [details] for example configuration and test script. With vdsm < 4.40.60.4 we expect to see at least one failed iteration, and in vdsm log you will find error like: 2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840) And the delete snapshot operation will fail. With vdsm >= 4.40.60.4 this libvirt error is logged as warning: vdsm.log.2.xz:2021-04-13 21:21:05,327+0300 WARN (merge/a2219037) [virt.vm] (vmId='ccfbb456-be77-4a0a-a158-6887ccc62313') Pivot failed for Job a2219037-d0e0-47f1-a6ee-88680afe1741: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-11-format' in state 'standby' cannot accept command verb 'complete' (livemerge:898) And vdsm will retry the pivot operation. The next pivot retry is likely to succeed, and if not vdsm will retry the operation again. See example recovery flow in comment 4. The test run should complete without any failures. There may be tests errors like failing to create a vm or a snapshot, which are not related to this fix. We have other bug tracking these issues.
This fix do not limit the number of pivot attempts. Fixing this is tracked in bug 1949470.
The verification scenario mentioned in comments 4 and comment 6 was already run and verified by Nir with the fix(vdsm-4.40.60.4). Moving to verified.
This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.