Bug 2020497 - Need to include lvm2-2.03.14-1.el8
Summary: Need to include lvm2-2.03.14-1.el8
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: redhat-virtualization-host
Version: 4.4.8
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.0-1
: 4.5.0
Assignee: Vojtech Juranek
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On: 2025367
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-05 06:10 UTC by Germano Veit Michel
Modified: 2022-08-21 07:58 UTC (History)
12 users (show)

Fixed In Version: vdsm-4.50.0.5
Doc Type: If docs needed, set a value
Doc Text:
Previously, concurrent executions of LV refresh (lvchange) failed. This hindered simultaneous starts of virtual machines that have thin-provisioned disks based on the same disk on a block storage domain. In this release, concurrent execution of LV refresh has been fixed in LVM2.
Clone Of:
Environment:
Last Closed: 2022-06-03 13:48:51 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43929 0 None None None 2021-11-05 06:13:43 UTC
Red Hat Knowledge Base (Solution) 6530261 0 None None None 2021-11-22 00:54:34 UTC
Red Hat Product Errata RHSA-2022:4896 0 None None None 2022-06-03 13:49:10 UTC

Description Germano Veit Michel 2021-11-05 06:10:54 UTC
Description of problem:

Starting the bug on VDSM, but this may well turn into an LVM2 bug.

Customer sometimes fails to start VMs with this:
~~~
2021-10-28 12:02:16,058+0200 WARN  (vm/89788eb9) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600601603cc0450024da7861542452d7$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--refresh\', \'15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69\'] rc=5 out=[] err=[\'  device-mapper: resume ioctl on  (253:119) failed: Invalid argument\', \'  Unable to resume 15dacc79--bbb7--4eb3--b1de--82066e97d5fa-ca897bf5--5079--40aa--b54d--cc0522ea9c69 (253:119).\', \'  Failed to reactivate 15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69.\', \'  Releasing activation in critical section.\', \'  Releasing activation in critical section.\', \'  libdevmapper exiting with 1 device(s) still suspended.\']' (lvm:534)
~~~

So prepareImage() fails and the VM won't start.

Looking at the setup, we have dozens of thin VMs based on that template volume 15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69, and we are starting more VMs in parallel, with parent being the same volume. 

So we do things like 2 refreshes at the same time, with the LV already active and in use by other VMs. For example, just before that error, we have 2 VMs starting and doing prepareImage()
2021-10-28 12:02:15,799+0200 INFO  (vm/aa471eae) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785)
2021-10-28 12:02:15,810+0200 INFO  (vm/89788eb9) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785)

There are 2 parts do this:

a) lvchange suceeds, but with this warning, as if something is wrong but it did not fail. This is super easy to reproduce and I hit every time.
~~~
  Releasing activation in critical section.
  Releasing activation in critical section.
  libdevmapper exiting with 1 device(s) still suspended.
~~~

Example, 2 VMs starting:

2021-10-28 15:55:26,116+0200 INFO  (vm/de14dc94) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785)
2021-10-28 15:55:26,127+0200 INFO  (vm/1d41df32) [storage.LVM] Refreshing active lvs: vg=15dacc79-bbb7-4eb3-b1de-82066e97d5fa lvs=['ca897bf5-5079-40aa-b54d-cc0522ea9c69'] (lvm:1785)
2021-10-28 15:55:26,366+0200 WARN  (vm/1d41df32) [storage.LVM] Command ['/sbin/lvm', 'lvchange', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/3600601603cc0450024da7861542452d7$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }', '--refresh', '15dacc79-bbb7-4eb3-b1de-82066e97d5fa/ca897bf5-5079-40aa-b54d-cc0522ea9c69'] succeeded with warnings: ['  Releasing activation in critical section.', '  Releasing activation in critical section.', '  libdevmapper exiting with 1 device(s) still suspended.'] (lvm:355)


b) lvchange fails, causing the VM to fail to start. Then, in addition to the warning above, we get the full failure. This is a bit harder to hit, takes a few retries.
~~~
  device-mapper: resume ioctl on  (253:119) failed: Invalid argument
  Unable to resume 15dacc79--bbb7--4eb3--b1de--82066e97d5fa-ca897bf5--5079--40aa--b54d--cc0522ea9c69 (253:119)
  Releasing activation in critical section.
  Releasing activation in critical section.
  libdevmapper exiting with 1 device(s) still suspended.
~~~

I can reproduce both on my test env by doing this:

1. Create template on SD
2. Create a few thin VMs based on that template
3. Start 1-2 VMs

At this point the LVs on the host look like this:

b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292 -wi-ao---- IU_edbe5c3c-e562-4cb4-8137-521bd37e1991,MD_4, PU_00000000-0000-0000-0000-000000000000  <-- template
05cc5eb7-62c1-4b1f-ab09-f561d3938b7d -wi------- IU_42b4b2ef-91cb-42d4-83dc-b40e6bba2205,MD_9, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292  
261d2265-5c9c-41fc-9a8d-b3b9ec9e141f -wi------- IU_438d055b-548a-43da-bd4e-fff8ad247274,MD_6, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292
77564096-a75f-4011-9df7-86ea6edc305e -wi-ao---- IU_fa25d518-c16d-44c6-abad-d9167e0ff5fd,MD_8, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292  <-- running VM
8d9cd490-241e-463b-ae4d-806c1bdd56be -wi------- IU_f182f31b-d4c4-40dc-8751-75dade8cc882,MD_10,PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292
b4eac262-ca7b-4223-ab4d-e4c1a458073a -wi-ao---- IU_51d225c3-9cab-4e10-a6db-c5769853a381,MD_3, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292  <-- running VM
bc81835c-7649-4fcd-a166-01bade4fdc67 -wi------- IU_f5fa1b3a-339d-4b6c-9f46-631d84c2c94a,MD_5, PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292
bef7ce3a-bc05-4ad8-a534-158abd58d1f4 -wi------- IU_4c3a192b-04ba-42d1-a9cd-f4bdccf68b20,MD_11,PU_b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292


4. In the host running the VMs, run the exact same lvrefresh that VDSM is doing, but in a quick loop. In my case, the template LV is that b2ab8b07, so I do this:
~~~
[root@rhvh-1 ~]# $ while [ true ]; do lvchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --refresh 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292; done
~~~

5. Start the other VMs
~~~
  device-mapper: resume ioctl on  (252:8) failed: Invalid argument
  Unable to resume 2f93a595--a602--4bfc--b35e--9e78b89f27b4-b2ab8b07--a2aa--408a--a4b4--4cd64a7f0292 (252:8).
  Failed to reactivate 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292.
~~~

Version-Release number of selected component (if applicable):
lvm2-2.03.11-5.el8.x86_64
vdsm-4.40.80.6-1.el8ev.x86_64
kernel-4.18.0-305.17.1.el8_4.x86_64

How reproducible:
* Always

Steps to Reproduce:
* As above

Actual results:
* lvchange fails sometimes

Expected results:
* lvchange always succeeds and the VM runs

Comment 7 David Teigland 2021-11-15 16:47:50 UTC
It sounds like two issues, and both sound similar to issues we've seen on RHV before.

1. Concurrent LV refreshes on one host.
2. Concurrent LV refresh on one host and a VG modification on another host (Checksum error).

It would be helpful if you could reproduce these with lvm debug logging enabled (lvm.conf log/file=<path> and log/level=7).

Comment 8 Germano Veit Michel 2021-11-17 06:05:49 UTC
David, thanks for helping on this one too  :)

First of all, I've updated all to latest 8.5:

vdsm-4.40.90.4-1.el8ev.x86_64
lvm2-2.03.12-10.el8.x86_64
kernel-4.18.0-348.2.1.el8_5.x86_64

So, while looping this on the host:

[root@rhvh-1 ~]# while [ true ]; do lvchange --config 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }' --refresh 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292; done

Got this on the terminal while starting 7 VMs:
~~~
  Releasing activation in critical section.
  Releasing activation in critical section.
  libdevmapper exiting with 1 device(s) still suspended.
  device-mapper: resume ioctl on  (253:22) failed: Invalid argument
  Unable to resume 2f93a595--a602--4bfc--b35e--9e78b89f27b4-b2ab8b07--a2aa--408a--a4b4--4cd64a7f0292 (253:22).
  Failed to reactivate 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292.
~~~

VDSM also saw some problems executing its own LVM commands, including the main one that prevents VMs from running on customer ticket (Unable to resume...)
~~~
2021-11-17 15:55:30,375+1000 WARN  (vm/24e01cb6) [storage.LVM] Command with specific filter failed or returned no data, retrying with a wider filter: LVM command failed: 'cmd=[\'/sbin/lvm\', \'lvchange\', \'--config\', \'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }\', \'--refresh\', \'2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292\'] rc=5 out=[] err=[\'  device-mapper: resume ioctl on  (253:22) failed: Invalid argument\', \'  Unable to resume 2f93a595--a602--4bfc--b35e--9e78b89f27b4-b2ab8b07--a2aa--408a--a4b4--4cd64a7f0292 (253:22).\', \'  Failed to reactivate 2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292.\']' (lvm:534)
~~~

~~~
2021-11-17 15:56:12,986+1000 WARN  (vm/2840d10f) [storage.LVM] Command ['/sbin/lvm', 'lvchange', '--config', 'devices {  preferred_names=["^/dev/mapper/"]  ignore_suspended_devices=1  write_cache_state=0  disable_after_error_count=3  filter=["a|^/dev/mapper/360014058e776cbadb02fd473edb913d2$|", "r|.*|"]  hints="none"  obtain_device_list_from_udev=0 } global {  locking_type=1  prioritise_write_locks=1  wait_for_locks=1  use_lvmetad=0  use_lvmpolld=1 } backup {  retain_min=50  retain_days=0 }', '--refresh', '2f93a595-a602-4bfc-b35e-9e78b89f27b4/b2ab8b07-a2aa-408a-a4b4-4cd64a7f0292'] succeeded with warnings: ['  Attempted to decrement suspended device counter below zero.'] (lvm:355)
~~~

Logs uploading...

Comment 10 David Teigland 2021-11-17 17:40:52 UTC
We received a patch to fix this exact problem in August:
https://sourceware.org/git/?p=lvm2.git;a=commit;h=47bcb446b83f9aa4fca74301fda109a8bf670fbb

lvchange: fix lvchange refresh failed for dm suspend or resume failed

When multiple lvchange refresh processes executed at the same time,
suspend/resume ioctl on the same dm, some of these commands will be failed
for dm aready change status, and ioctl will return EINVAL in _do_dm_ioctl function.
to avoid this problem, add READ_FOR_ACTIVATE flags in lvchange refresh process,
it will hold LCK_WRITE lock and avoid suspend/resume dm at the same time.

Signed-off-by: Long YunJian <long.yunjian.cn>
Signed-off-by: Yi Wang <wang.yi59.cn>

Comment 11 Eyal Shenitzky 2021-11-18 14:55:17 UTC
(In reply to David Teigland from comment #10)
> We received a patch to fix this exact problem in August:
> https://sourceware.org/git/?p=lvm2.git;a=commit;
> h=47bcb446b83f9aa4fca74301fda109a8bf670fbb
> 
> lvchange: fix lvchange refresh failed for dm suspend or resume failed
> 
> When multiple lvchange refresh processes executed at the same time,
> suspend/resume ioctl on the same dm, some of these commands will be failed
> for dm aready change status, and ioctl will return EINVAL in _do_dm_ioctl
> function.
> to avoid this problem, add READ_FOR_ACTIVATE flags in lvchange refresh
> process,
> it will hold LCK_WRITE lock and avoid suspend/resume dm at the same time.
> 
> Signed-off-by: Long YunJian <long.yunjian.cn>
> Signed-off-by: Yi Wang <wang.yi59.cn>

David, so if I understand correctly this patch should fix it.

Do we have an LVM bug to close this one as duplication?

Comment 12 David Teigland 2021-11-18 15:04:57 UTC
We'll need an lvm bz created to track/test the bug, or have it backported.  (The fix will already be included in 8.6 and 9.0 from lvm rebases.)

Comment 13 Eyal Shenitzky 2021-11-21 12:55:08 UTC
(In reply to David Teigland from comment #12)
> We'll need an lvm bz created to track/test the bug, or have it backported. 
> (The fix will already be included in 8.6 and 9.0 from lvm rebases.)

OK, Thanks David.

Germano, can you please open an LVM bug so this one will be dependent on it?

Comment 14 Germano Veit Michel 2021-11-22 00:38:25 UTC
Of course, done: BZ2025367

Comment 32 errata-xmlrpc 2022-06-03 13:48:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Virtualization security, bug fix, and enhancement update [ovirt-4.5.0]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:4896


Note You need to log in before you can comment on or make changes to this bug.