Created attachment 1693658 [details] run-01.log.xz - failed run 1 Description of problem: On RHEL 8.2.1 we see random failures in lvcreate and lvchange: Failed to udev_enumerate_scan_devices. Volume group "bz1837199-000000000000000000000-0006" not found. Cannot process volume group bz1837199-000000000000000000000-0006 We don't know if real use case are affected. So far I have seen few errors when running stress test doing 1000's of lvm operations. run-01.log.xz: errors: 1 create lv: 4860 change lv tags: 9359 activate lv: 9000 deactivate lv: 13859 remove lv: 4500 run-02.log.xz (interrupted): errors: 2 create lv: 3712 change lv tags: 3708 These errors may fail creating volume or snapshot (or other flows), leading to locked disks and other unwanted issues. Version-Release number of selected component (if applicable): # uname -a Linux host4 4.18.0-193.5.1.el8_2.x86_64 #1 SMP Thu May 21 14:23:29 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux # rpm -q lvm2 lvm2-2.03.08-3.el8.x86_64 David Teigland suggested to try obtain_device_list_from_udev=0 to mitigate this. Testing show that this works: run-04.log.xz: errors: 0 create lv: 20000 change lv tags: 40000 activate lv: 40000 deactivate lv: 60000 remove lv: 20000 I think we need to apply this change until the platform bug is resolved. So far I did not see this issue on RHEL 7.8.
Created attachment 1693659 [details] run-02.log.xz - failed run 2
Created attachment 1693660 [details] run-04.log.xz - successful run using obtain_device_list_from_udev=0
Mordechai, do we see this error in scale tests? Failed to udev_enumerate_scan_devices Do we check and report errors on all hosts during scale tests?
The attach patch is addition to stress test for testing this option. We need vdsm patch to add this option.
(In reply to Nir Soffer from comment #3) > Mordechai, do we see this error in scale tests? > > Failed to udev_enumerate_scan_devices > > Do we check and report errors on all hosts during scale tests? I checked all VDSM log files (9 Hosts) after 60Hrs scale test. The error doesn't exist.
Nir, I think we got the answer now, should we send a patch for this fix?
(In reply to Tal Nisan from comment #6) I'm waiting for clarification on: https://bugzilla.redhat.com/show_bug.cgi?id=1812801#c12
(In reply to Tal Nisan from comment #6) > Nir, I think we got the answer now, should we send a patch for this fix? David confirm that this should be safe for vdsm use case, so I posted https://gerrit.ovirt.org/c/109921/ Since this is a trivial fix, and we have a stress tests showing that it works, I think we can include this in 4.4.2, based on QE acks. How to test this: - No change in behavior is expected, except random errors which should not happen now in stress test. - You can run the vdsm stress test from: tests/storage/stress/reload.py - Otherwise existing functional and scale tests should be good to verify that there are no regressions.
Yes, this happens only in stress tests.
Tested version: rhv-release-4.4.2-4-001 vdsm-4.40.26-1 Tests: 1. VM snapshot with 13Disks 2. VM snapshot X 30 --> 50 Users
This bugzilla is included in oVirt 4.4.2 release, published on September 17th 2020. Since the problem described in this bug report should be resolved in oVirt 4.4.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.