Bug 1945675
| Summary: | delete-snapshot stress with extreme load- Vdsm aborts live merge job without aborting the libvirt block job | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm | Reporter: | Nir Soffer <nsoffer> | ||||||
| Component: | Core | Assignee: | Nir Soffer <nsoffer> | ||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | mlehrer | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 4.30.31 | CC: | aefrat, aoconnor, bugs, dagur, eshenitz, gveitmic, lsvaty, mavital | ||||||
| Target Milestone: | ovirt-4.4.6 | Keywords: | Performance | ||||||
| Target Release: | 4.40.60.4 | Flags: | aoconnor:
blocker-
lsvaty: testing_plan_complete- |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | vdsm-4.40.60.4 | Doc Type: | Bug Fix | ||||||
| Doc Text: |
Cause:
If delete snapshot failed in libvirt because of unexpected temporary error, vdsm was aborting the delete snapshot operation without aborting the the underlying libvirt block job.
Consequence:
Future attempt to delete the snapshot would fail, because the libvirt block job was running. If the VM is stopped, it cannot be started again because the top volume is marked as illegal. This requires manual fix in the volume metadata.
Fix:
Vdsm retry to complete the delete snapshot operation after unexpected errors.
Result:
The delete snapshot operation will complete successfully after temporary unexpected errors.
|
Story Points: | --- | ||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2021-05-05 05:36:15 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
|
Description
Nir Soffer
2021-04-01 15:36:14 UTC
Created attachment 1768291 [details]
Test scripts and logs from run with 3 failures out of 10 runs
Based on discussion with John Snow regarding bug 1945635, and mailing list thread: https://lists.nongnu.org/archive/html/qemu-block/2021-04/msg00035.html I don't expect this issue to be solved soon, so having retries in vdsm is likely the only thing we can do for 4.4.6 (rhel 8.4). If a live merge job is flipping between "ready" and "standby" state, retrying the pivot several times is very likely to succeed. The most important change we need to do is to revert the fix for bug 1857347, and go back to pivot retry. Once we have pivot retries, limiting the number of retries and switching to new cleanup mode when we abort the block commit is nice to have in 4.4.6, but can also be completed later. Example log with the fix, showing how we recover from temporary pivot error due to bug 1945635. 1. Starting merge 2021-04-12 02:31:40,720+0300 INFO (jsonrpc/6) [api.virt] START merge(drive={'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': '5f76b8de-1396-472e-966b-4258ec4a7f3c', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b'}, baseVolUUID='be3e6031-7386-4188-acac-e3c9d3f373d0', topVolUUID='5f76b8de-1396-472e-966b-4258ec4a7f3c', bandwidth='0', jobUUID='e80636f1-66f0-4fdf-a1a7-233ba80a56db') from=::ffff:192.168.122.11,51588, flow_id=4cfcd6a0-40db-4914-a18f-f13795f2fdb9, vmId=cd61cfb5-8863-44b8-b287-50f3b65f0067 (api:48) 2. Normal EXTEND flow 2021-04-12 02:31:40,786+0300 INFO (jsonrpc/6) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from INIT to EXTEND (livemerge:170) 2021-04-12 02:31:40,838+0300 INFO (jsonrpc/6) [virt.livemerge] Starting extend 1/10 for job=e80636f1-66f0-4fdf-a1a7-233ba80a56db drive=vdh volume=be3e6031-7386-4188-acac-e3c9d3f373d0 (livemerge:466) 2021-04-12 02:31:40,840+0300 INFO (jsonrpc/6) [vdsm.api] START sendExtendMsg(spUUID='fb7797bd-3cca-4baa-bcb3-35e8851015f7', volDict={'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'internal': True, 'name': 'vdh', 'newSize': 3544186880, 'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': 'be3e6031-7386-4188-acac-e3c9d3f373d0', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'callback': functools.partial(<bound method DriveMerger._extend_completed of <vdsm.virt.livemerge.DriveMerger object at 0x7f04343d1080>>, job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db')}, newSize=3544186880, callbackFunc=<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f04341c3518>>) from=::ffff:192.168.122.11,51588, flow_id=4cfcd6a0-40db-4914-a18f-f13795f2fdb9, task_id=a860dae6-a753-4197-b208-75bd7293a0a8 (api:48) 2021-04-12 02:31:43,064+0300 INFO (mailbox-hsm/2) [storage.ThreadPool.WorkerThread] START task 7288bf8d-1516-41fc-bcce-32790db907fa (cmd=<function runTask at 0x7f0481111840>, args=(<bound method Vm.after_volume_extension of <vdsm.virt.vm.Vm object at 0x7f04341c3518>>, {'domainID': 'a682d0d8-b135-43ca-ab3c-48155cebd40b', 'imageID': '09ae0748-f465-4c06-b828-e8c60a980660', 'internal': True, 'name': 'vdh', 'newSize': 3544186880, 'poolID': 'fb7797bd-3cca-4baa-bcb3-35e8851015f7', 'volumeID': 'be3e6031-7386-4188-acac-e3c9d3f373d0', 'clock': <Clock(total=2.23*, extend-volume=2.23, refresh-volume=0.01*)>, 'callback': functools.partial(<bound method DriveMerger._extend_completed of <vdsm.virt.livemerge.DriveMerger object at 0x7f04343d1080>>, job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db')})) (threadPool:149) 3. EXTEND completed, starting COMMIT 2021-04-12 02:31:43,395+0300 INFO (mailbox-hsm/2) [virt.livemerge] Extend completed for job e80636f1-66f0-4fdf-a1a7-233ba80a56db, starting commit (livemerge:545) 2021-04-12 02:31:43,395+0300 INFO (mailbox-hsm/2) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from EXTEND to COMMIT (livemerge:170) 2021-04-12 02:31:43,417+0300 INFO (mailbox-hsm/2) [virt.livemerge] Starting merge with job_id='e80636f1-66f0-4fdf-a1a7-233ba80a56db', original chain=be3e6031-7386-4188-acac-e3c9d3f373d0 < 5f76b8de-1396-472e-966b-4258ec4a7f3c (top), disk='vdh', base='vdh[5]', top=None, bandwidth=0, flags=12 (livemerge:408) 4. Receiving first ready event 2021-04-12 02:34:15,133+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672) 5. COMMIT completed, job is read for pivot, starting CLEANUP 2021-04-12 02:34:42,379+0300 INFO (periodic/0) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db is ready for pivot (livemerge:687) 2021-04-12 02:34:42,379+0300 INFO (periodic/0) [virt.livemerge] Job e80636f1-66f0-4fdf-a1a7-233ba80a56db switching state from COMMIT to CLEANUP (livemerge:170) 6. Starting first cleanup attempt 2021-04-12 02:34:42,400+0300 INFO (periodic/0) [virt.livemerge] Starting cleanup for job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:748) 2021-04-12 02:34:50,614+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Requesting pivot to complete active layer commit (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:835) 6. Receiving second ready event (job is flipping between ready and standby states on qemu side) 2021-04-12 02:34:51,312+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672) 7. Pivot failed because job is in standby state 2021-04-12 02:34:51,312+0300 WARN (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Pivot failed for Job e80636f1-66f0-4fdf-a1a7-233ba80a56db: internal error: unable to execute QEMU command 'job-complete': Job 'commit-vdh-libvirt-23-format' in state 'standby' cannot accept command verb 'complete' (livemerge:898) 8. Staring second cleanup 2021-04-12 02:35:12,379+0300 INFO (periodic/2) [virt.livemerge] Starting cleanup for job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:748) 2021-04-12 02:35:21,993+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Requesting pivot to complete active layer commit (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:835) 9. Third ready event (job is still flipping on qemu side) 2021-04-12 02:35:24,782+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh is ready (vm:5672) 10. Receiving completed event 2021-04-12 02:35:25,856+0300 INFO (libvirt/events) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Block job e80636f1-66f0-4fdf-a1a7-233ba80a56db type ACTIVE_COMMIT for drive vdh has completed (vm:5663) 11. Pivot completed, finishing cleanup 2021-04-12 02:35:25,882+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Pivot completed (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:849) 2021-04-12 02:35:25,882+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Synchronizing volume chain after live merge (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:886) 2021-04-12 02:35:32,025+0300 INFO (merge/e80636f1) [virt.vm] (vmId='cd61cfb5-8863-44b8-b287-50f3b65f0067') Synchronization completed (job e80636f1-66f0-4fdf-a1a7-233ba80a56db) (livemerge:895) 2021-04-12 02:35:42,347+0300 INFO (periodic/2) [virt.livemerge] Cleanup completed, untracking job e80636f1-66f0-4fdf-a1a7-233ba80a56db (livemerge:727) Created attachment 1771854 [details]
Test configuration that reproduce the issue on visualized environment
The issue can be reproduced and tested using delete-snapshot stress test. To reproduce, get the ovirt-test repo: https://gitlab.com/nirs/ovirt-stress.git Create conf.yml file from conf.yml.example and configure to adapt to your engine details as explained in: https://gitlab.com/nirs/ovirt-stress/-/blob/master/delete-snapshot/README.md Reproducing on a real server with fast storage is very hard and can take few days, but if you run the stress test on a virtual environment with slow storage, the issue can be reproduce quickly in 1-2 hours. I reproduced it on a vm based system: - engine vm - 2 hosts vm - 1 storage vm All running on a laptop. I used this configuration (changed from the defalt): - setup.yml - 8 virtio thin disks instead of 3 virtio, 3 virtio-scsi-thin, 3 virtio-scsi raw - conf.yml: - write_data_mb: 1024 - vm_count: 4 - iterations: 20 See attachment 1771854 [details] for example configuration and test script. With vdsm < 4.40.60.4 we expect to see at least one failed iteration, and in vdsm log you will find error like: 2021-03-21 01:07:56,122+0200 ERROR (merge/7963caa8) [virt.vm] (vmId='8d68ed91-ab0c-42fe-9c42-96bcb5975e19') Pivot failed (job: 7963caa8-20d0-4975-8afc-f17f463dc9da): Job 7963caa8-20d0-4975-8afc-f17f463dc9da failed with libvirt error: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-15-format' in state 'standby' cannot accept command verb 'complete', aborting due to an unrecoverable error (livemerge:840) And the delete snapshot operation will fail. With vdsm >= 4.40.60.4 this libvirt error is logged as warning: vdsm.log.2.xz:2021-04-13 21:21:05,327+0300 WARN (merge/a2219037) [virt.vm] (vmId='ccfbb456-be77-4a0a-a158-6887ccc62313') Pivot failed for Job a2219037-d0e0-47f1-a6ee-88680afe1741: internal error: unable to execute QEMU command 'job-complete': Job 'commit-sda-libvirt-11-format' in state 'standby' cannot accept command verb 'complete' (livemerge:898) And vdsm will retry the pivot operation. The next pivot retry is likely to succeed, and if not vdsm will retry the operation again. See example recovery flow in comment 4. The test run should complete without any failures. There may be tests errors like failing to create a vm or a snapshot, which are not related to this fix. We have other bug tracking these issues. This fix do not limit the number of pivot attempts. Fixing this is tracked in bug 1949470. The verification scenario mentioned in comments 4 and comment 6 was already run and verified by Nir with the fix(vdsm-4.40.60.4). Moving to verified. This bugzilla is included in oVirt 4.4.6 release, published on May 4th 2021. Since the problem described in this bug report should be resolved in oVirt 4.4.6 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |