Description of problem: Starting the bug on VDSM, but this may well turn into an LVM2 bug. Customer sometimes fails to start VMs with this: ~~~ 2021-10-28 12:02:16,058+0200 WARN (vm/89788eb9) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600601603cc0450024da7861542452d7$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--refresh\', \'15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69\'] rc=5 out=[] err=[\' device-mapper: resume ioctl on (253:119) failed: Invalid argument\', \' Unable to resume 15dacc79--bbb7--4eb3--b1de--82066e97d5fa-ca897bf5--5079--40aa--b54d--cc0522ea9c69 (253:119).\', \' Failed to reactivate 15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69.\', \' Releasing activation in critical section.\', \' Releasing activation in critical section.\', \' libdevmapper exiting with 1 device(s) still suspended.\']' (lvm:534) ~~~ So prepareImage() fails and the VM won't start. Looking at the setup, we have dozens of thin VMs based on that template volume 15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69, and we are starting more VMs in parallel, with parent being the same volume. So we do things like 2 refreshes at the same time, with the LV already active and in use by other VMs. For example, just before that error, we have 2 VMs starting and doing prepareImage() 2021-10-28 12:02:15,799+0200 INFO (vm/aa471eae) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785) 2021-10-28 12:02:15,810+0200 INFO (vm/89788eb9) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785) There are 2 parts do this: a) lvchange suceeds, but with this warning, as if something is wrong but it did not fail. This is super easy to reproduce and I hit every time. ~~~ Releasing activation in critical section. Releasing activation in critical section. libdevmapper exiting with 1 device(s) still suspended. ~~~ Example, 2 VMs starting: 2021-10-28 15:55:26,116+0200 INFO (vm/de14dc94) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785) 2021-10-28 15:55:26,127+0200 INFO (vm/1d41df32) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785) 2021-10-28 15:55:26,366+0200 WARN (vm/1d41df32) [storage.LVM] Command ['/sbin/lvm', 'lvchange', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/3600601603cc0450024da7861542452d7$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }', '--refresh', '15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69'] succeeded with warnings: [' Releasing activation in critical section.', ' Releasing activation in critical section.', ' libdevmapper exiting with 1 device(s) still suspended.'] (lvm:355) b) lvchange fails, causing the VM to fail to start. Then, in addition to the warning above, we get the full failure. This is a bit harder to hit, takes a few retries. ~~~ device-mapper: resume ioctl on (253:119) failed: Invalid argument Unable to resume 15dacc79--bbb7--4eb3--b1de--82066e97d5fa-ca897bf5--5079--40aa--b54d--cc0522ea9c69 (253:119) Releasing activation in critical section. Releasing activation in critical section. libdevmapper exiting with 1 device(s) still suspended. ~~~ I can reproduce both on my test env by doing this: 1. Create template on SD 2. Create a few thin VMs based on that template 3. Start 1-2 VMs At this point the LVs on the host look like this: b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 -wi-ao---- IU_edbe5c3c-e562-4cb4-8137-521bd37e1991,MD_4, PU_00000000-0000-0000-0000-000000000000 <-- template 05cc5eb7-62c1-4b1f-ab09-f561d3938b7d -wi------- IU_42b4b2ef-91cb-42d4-83dc-b40e6bba2205,MD_9, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 261d2265-5c9c-41fc-9a8d-b3b9ec9e141f -wi------- IU_438d055b-548a-43da-bd4e-fff8ad247274,MD_6, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 77564096-a75f-4011-9df7-86ea6edc305e -wi-ao---- IU_fa25d518-c16d-44c6-abad-d9167e0ff5fd,MD_8, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 <-- running VM 8d9cd490-241e-463b-ae4d-806c1bdd56be -wi------- IU_f182f31b-d4c4-40dc-8751-75dade8cc882,MD_10,PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 b4eac262-ca7b-4223-ab4d-e4c1a458073a -wi-ao---- IU_51d225c3-9cab-4e10-a6db-c5769853a381,MD_3, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 <-- running VM bc81835c-7649-4fcd-a166-01bade4fdc67 -wi------- IU_f5fa1b3a-339d-4b6c-9f46-631d84c2c94a,MD_5, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 bef7ce3a-bc05-4ad8-a534-158abd58d1f4 -wi------- IU_4c3a192b-04ba-42d1-a9cd-f4bdccf68b20,MD_11,PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 4. In the host running the VMs, run the exact same lvrefresh that VDSM is doing, but in a quick loop. In my case, the template LV is that b2ab8b07, so I do this: ~~~ [root@rhvh-1 ~]# $ while [ true ]; do lvchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --refresh 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292; done ~~~ 5. Start the other VMs ~~~ device-mapper: resume ioctl on (252:8) failed: Invalid argument Unable to resume 2f93a595--a602--4bfc--b35e--9e78b89f27b4-b2ab8b07--a2aa--408a--a4b4--4cd64a7f0292 (252:8). Failed to reactivate 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292. ~~~ Version-Release number of selected component (if applicable): lvm2-2.03.11-5.el8.x86_64 vdsm-4.40.80.6-1.el8ev.x86_64 kernel-4.18.0-305.17.1.el8_4.x86_64 How reproducible: * Always Steps to Reproduce: * As above Actual results: * lvchange fails sometimes Expected results: * lvchange always succeeds and the VM runs
It sounds like two issues, and both sound similar to issues we've seen on RHV before. 1. Concurrent LV refreshes on one host. 2. Concurrent LV refresh on one host and a VG modification on another host (Checksum error). It would be helpful if you could reproduce these with lvm debug logging enabled (lvm.conf log/file=<path> and log/level=7).
David, thanks for helping on this one too :) First of all, I've updated all to latest 8.5: vdsm-4.40.90.4-1.el8ev.x86_64 lvm2-2.03.12-10.el8.x86_64 kernel-4.18.0-348.2.1.el8_5.x86_64 So, while looping this on the host: [root@rhvh-1 ~]# while [ true ]; do lvchange --config 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }' --refresh 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292; done Got this on the terminal while starting 7 VMs: ~~~ Releasing activation in critical section. Releasing activation in critical section. libdevmapper exiting with 1 device(s) still suspended. device-mapper: resume ioctl on (253:22) failed: Invalid argument Unable to resume 2f93a595--a602--4bfc--b35e--9e78b89f27b4-b2ab8b07--a2aa--408a--a4b4--4cd64a7f0292 (253:22). Failed to reactivate 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292. ~~~ VDSM also saw some problems executing its own LVM commands, including the main one that prevents VMs from running on customer ticket (Unable to resume...) ~~~ 2021-11-17 15:55:30,375+1000 WARN (vm/24e01cb6) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }\', \'--refresh\', \'2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292\'] rc=5 out=[] err=[\' device-mapper: resume ioctl on (253:22) failed: Invalid argument\', \' Unable to resume 2f93a595--a602--4bfc--b35e--9e78b89f27b4-b2ab8b07--a2aa--408a--a4b4--4cd64a7f0292 (253:22).\', \' Failed to reactivate 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292.\']' (lvm:534) ~~~ ~~~ 2021-11-17 15:56:12,986+1000 WARN (vm/2840d10f) [storage.LVM] Command ['/sbin/lvm', 'lvchange', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 use_lvmpolld=1 } backup { retain_min=50 retain_days=0 }', '--refresh', '2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292'] succeeded with warnings: [' Attempted to decrement suspended device counter below zero.'] (lvm:355) ~~~ Logs uploading...
We received a patch to fix this exact problem in August: https://sourceware.org/git/?p=lvm2.git;a=commit;h=47bcb446b83f9aa4fca74301fda109a8bf670fbb lvchange: fix lvchange refresh failed for dm suspend or resume failed When multiple lvchange refresh processes executed at the same time, suspend/resume ioctl on the same dm, some of these commands will be failed for dm aready change status, and ioctl will return EINVAL in _do_dm_ioctl function. to avoid this problem, add READ_FOR_ACTIVATE flags in lvchange refresh process, it will hold LCK_WRITE lock and avoid suspend/resume dm at the same time. Signed-off-by: Long YunJian <long.yunjian.cn> Signed-off-by: Yi Wang <wang.yi59.cn>
(In reply to David Teigland from comment #10) > We received a patch to fix this exact problem in August: > https://sourceware.org/git/?p=lvm2.git;a=commit; > h=47bcb446b83f9aa4fca74301fda109a8bf670fbb > > lvchange: fix lvchange refresh failed for dm suspend or resume failed > > When multiple lvchange refresh processes executed at the same time, > suspend/resume ioctl on the same dm, some of these commands will be failed > for dm aready change status, and ioctl will return EINVAL in _do_dm_ioctl > function. > to avoid this problem, add READ_FOR_ACTIVATE flags in lvchange refresh > process, > it will hold LCK_WRITE lock and avoid suspend/resume dm at the same time. > > Signed-off-by: Long YunJian <long.yunjian.cn> > Signed-off-by: Yi Wang <wang.yi59.cn> David, so if I understand correctly this patch should fix it. Do we have an LVM bug to close this one as duplication?
We'll need an lvm bz created to track/test the bug, or have it backported. (The fix will already be included in 8.6 and 9.0 from lvm rebases.)
(In reply to David Teigland from comment #12) > We'll need an lvm bz created to track/test the bug, or have it backported. > (The fix will already be included in 8.6 and 9.0 from lvm rebases.) OK, Thanks David. Germano, can you please open an LVM bug so this one will be dependent on it?
Of course, done: BZ2025367
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat Virtualization security, bug fix, and enhancement update [ovirt-4.5.0]), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:4896