Bug 1842053 - LVM commands may fail with "Failed to udev_enumerate_scan_devices"
Summary: LVM commands may fail with "Failed to udev_enumerate_scan_devices"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.40.14
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.4.2
: ---
Assignee: Nir Soffer
QA Contact: David Vaanunu
URL:
Whiteboard:
Depends On: 1812801
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-30 14:15 UTC by Nir Soffer
Modified: 2020-09-18 07:12 UTC (History)
7 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2020-09-18 07:12:45 UTC
oVirt Team: Storage
Embargoed:
mtessun: ovirt-4.4?
mtessun: planning_ack+


Attachments (Terms of Use)
run-01.log.xz - failed run 1 (377.86 KB, application/x-xz)
2020-05-30 14:15 UTC, Nir Soffer
no flags Details
run-02.log.xz - failed run 2 (73.51 KB, application/x-xz)
2020-05-30 14:16 UTC, Nir Soffer
no flags Details
run-04.log.xz - successful run using obtain_device_list_from_udev=0 (1.63 MB, application/x-xz)
2020-05-30 14:17 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 109343 0 master MERGED tests: reload: Add option to disable udev 2020-09-21 06:36:19 UTC
oVirt gerrit 109921 0 master MERGED lvm: Disable obtain_device_list_from_udev 2020-09-21 06:36:19 UTC

Description Nir Soffer 2020-05-30 14:15:34 UTC
Created attachment 1693658 [details]
run-01.log.xz - failed run 1

Description of problem:

On RHEL 8.2.1 we see random failures in lvcreate and lvchange:

    Failed to udev_enumerate_scan_devices.
    Volume group "bz1837199-000000000000000000000-0006" not found.
    Cannot process volume group bz1837199-000000000000000000000-0006

We don't know if real use case are affected. So far I have seen few errors 
when running stress test doing 1000's of lvm operations.

run-01.log.xz:
errors:              1
create lv:        4860
change lv tags:   9359
activate lv:      9000
deactivate lv:   13859
remove lv:        4500

run-02.log.xz (interrupted):
errors:              2
create lv:        3712
change lv tags:   3708


These errors may fail creating volume or snapshot (or other flows), leading
to locked disks and other unwanted issues.

Version-Release number of selected component (if applicable):

# uname -a
Linux host4 4.18.0-193.5.1.el8_2.x86_64 #1 SMP Thu May 21 14:23:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

# rpm -q lvm2
lvm2-2.03.08-3.el8.x86_64

David Teigland suggested to try obtain_device_list_from_udev=0 to
mitigate this. Testing show that this works:

run-04.log.xz:
errors:              0
create lv:       20000
change lv tags:  40000
activate lv:     40000
deactivate lv:   60000
remove lv:       20000

I think we need to apply this change until the platform bug is resolved.

So far I did not see this issue on RHEL 7.8.

Comment 1 Nir Soffer 2020-05-30 14:16:14 UTC
Created attachment 1693659 [details]
run-02.log.xz - failed run 2

Comment 2 Nir Soffer 2020-05-30 14:17:06 UTC
Created attachment 1693660 [details]
run-04.log.xz - successful run using obtain_device_list_from_udev=0

Comment 3 Nir Soffer 2020-05-30 14:25:41 UTC
Mordechai, do we see this error in scale tests?

    Failed to udev_enumerate_scan_devices

Do we check and report errors on all hosts during scale tests?

Comment 4 Nir Soffer 2020-05-31 15:49:01 UTC
The attach patch is addition to stress test for testing this option.

We need vdsm patch to add this option.

Comment 5 David Vaanunu 2020-06-03 12:07:57 UTC
(In reply to Nir Soffer from comment #3)
> Mordechai, do we see this error in scale tests?
> 
>     Failed to udev_enumerate_scan_devices
> 
> Do we check and report errors on all hosts during scale tests?



I checked all VDSM log files (9 Hosts) after 60Hrs scale test.
The error doesn't exist.

Comment 6 Tal Nisan 2020-06-15 14:11:39 UTC
Nir, I think we got the answer now, should we send a patch for this fix?

Comment 7 Nir Soffer 2020-06-15 14:46:26 UTC
(In reply to Tal Nisan from comment #6)
I'm waiting for clarification on:
https://bugzilla.redhat.com/show_bug.cgi?id=1812801#c12

Comment 8 Nir Soffer 2020-06-26 20:41:51 UTC
(In reply to Tal Nisan from comment #6)
> Nir, I think we got the answer now, should we send a patch for this fix?

David confirm that this should be safe for vdsm use case, so I posted
https://gerrit.ovirt.org/c/109921/

Since this is a trivial fix, and we have a stress tests showing that
it works, I think we can include this in 4.4.2, based on QE acks.

How to test this:

- No change in behavior is expected, except random errors which
  should not happen now in stress test.

- You can run the vdsm stress test from:
  tests/storage/stress/reload.py

- Otherwise existing functional and scale tests should be good to verify
  that there are no regressions.

Comment 10 Nir Soffer 2020-08-03 13:40:52 UTC
Yes, this happens only in stress tests.

Comment 11 David Vaanunu 2020-08-24 09:32:17 UTC
Tested version:

rhv-release-4.4.2-4-001
vdsm-4.40.26-1

Tests:
1. VM snapshot with 13Disks
2. VM snapshot X 30 --> 50 Users

Comment 12 Sandro Bonazzola 2020-09-18 07:12:45 UTC
This bugzilla is included in oVirt 4.4.2 release, published on September 17th 2020.

Since the problem described in this bug report should be resolved in oVirt 4.4.2 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.