Bug 1842600
Summary: | pvs/vgs/lvs fails randomly under load with "Metadata on <device> has wrong VG name "" expected <vg-name>" | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Nir Soffer <nsoffer> | ||||||||||||||||
Component: | lvm2 | Assignee: | David Teigland <teigland> | ||||||||||||||||
lvm2 sub component: | Command-line tools | QA Contact: | cluster-qe <cluster-qe> | ||||||||||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||||||||||
Severity: | urgent | ||||||||||||||||||
Priority: | unspecified | CC: | abawer, agk, cmarthal, corwin, gveitmic, heinzm, jbrassow, jmagrini, jortialc, mcsontos, mkalinin, msnitzer, prajnoha, pvlasin, rhandlin, teigland, thornber, ymankad, zkabelac | ||||||||||||||||
Version: | 7.8 | ||||||||||||||||||
Target Milestone: | rc | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
Fixed In Version: | lvm2-2.02.187-6.el7 | Doc Type: | If docs needed, set a value | ||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2020-09-29 19:56:06 UTC | Type: | Bug | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 1837199, 1849595 | ||||||||||||||||||
Attachments: |
|
Description
Nir Soffer
2020-06-01 16:17:04 UTC
Created attachment 1694144 [details]
reproducer run log
Created attachment 1694145 [details]
errors collected during test run
We know that using --select increase the chance of this error, since lvm process
all the vgs before applying the selection. We will change RHV to avoid this
issue.
But we see errors even when running without --select, so I run the reproduer
like this:
./reload.py --no-udev --no-select --verbose 4 2>run-no-udev-no-select-verbose-4.log
I interrupted the script once some errors where collected. The errors are in
the attached tarball.
Without having verified, I believe this is a bug in the scan/read optimization that I added to stable/RHEL7 in commit aee27dc7bad5734012885fe9f174def0a3f26771. That was intended to reuse the scanned data blocks for vg_read, however it failed to account for the fact that the mda_header would still be reread from disk, and not reused from scan, making it potentially inconsistent with the rest of the metadata text area being reused from scan. So, the new mda_header would point to new metadata text on disk but that is not present in the cache. This issue is fixed in master/RHEL8 by commit 0c1316cda876849d5d1375d40e8cdc08db37c2b5, where the mda_header is reread from disk and then compared with the mda_header values found during scan, and if they differ the metadata text is reread from disk. I will be looking at using the same fix in stable/RHEL7. Created attachment 1694175 [details]
Reproducer script
New version of the reproducer script with fixed --debug mode.
Created attachment 1694192 [details]
patch
An initial backport of a potential fix for testing.
Created attachment 1694884 [details]
patch
minor updates
Created attachment 1696476 [details] Reproducer runs with lvm2-2.02.187-5.el7.bz1842600_1 Testing patched brew lvm2 rpm from [1] seems to be doing well under bz1837199 reproducer [2] runs over RHEL7.8 with 8 cpus: # ./reload.py --no-select run 2020-06-09 05:40:36,610 INFO (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=False, use_udev=True, verbose=0, vg_count=10) 2020-06-09 05:40:36,618 INFO (MainThread) Starting pv reloader ... 2020-06-09 07:06:20,005 INFO (worker/06) Trial 1 finished 2020-06-09 07:06:20,006 INFO (worker/06) Worker finished 2020-06-09 07:06:20,006 INFO (MainThread) Workers stopped 2020-06-09 07:06:20,083 INFO (reload/pv) Stats: reloads=5171 errors=0 error_rate=0.00% failures=0 avg_time=0.994 med_time=0.857 min_time=0.161 max_time=9.177 2020-06-09 07:06:20,088 INFO (reload/vg) Stats: reloads=5226 errors=0 error_rate=0.00% failures=0 avg_time=0.984 med_time=0.863 min_time=0.157 max_time=9.345 2020-06-09 07:06:20,125 INFO (reload/lv) Stats: reloads=7104 errors=0 error_rate=0.00% failures=0 avg_time=0.723 med_time=0.715 min_time=0.158 max_time=8.594 2020-06-09 07:06:20,126 INFO (MainThread) Reloaders stopped # ./reload.py run 2020-06-09 10:33:07,105 INFO (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=True, use_udev=True, verbose=0, vg_count=10) ... 2020-06-09 12:01:30,350 INFO (worker/09) Trial 1 finished 2020-06-09 12:01:30,351 INFO (worker/09) Worker finished 2020-06-09 12:01:30,351 INFO (MainThread) Workers stopped 2020-06-09 12:01:30,558 INFO (reload/vg) Stats: reloads=1594 errors=0 error_rate=0.00% failures=0 avg_time=3.326 med_time=2.918 min_time=0.437 max_time=18.255 2020-06-09 12:01:30,584 INFO (reload/pv) Stats: reloads=3801 errors=0 error_rate=0.00% failures=0 avg_time=1.395 med_time=1.242 min_time=0.298 max_time=12.502 2020-06-09 12:01:30,593 INFO (reload/lv) Stats: reloads=3263 errors=0 error_rate=0.00% failures=0 avg_time=1.625 med_time=1.530 min_time=0.297 max_time=12.450 2020-06-09 12:01:30,594 INFO (MainThread) Reloaders stopped # ./reload.py --no-udev run 2020-06-09 12:01:51,479 INFO (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=True, use_udev=False, verbose=0, vg_count=10) ... 2020-06-09 13:30:17,445 INFO (worker/08) Trial 1 finished 2020-06-09 13:30:17,445 INFO (worker/08) Worker finished 2020-06-09 13:30:17,445 INFO (MainThread) Workers stopped 2020-06-09 13:30:17,576 INFO (reload/lv) Stats: reloads=3273 errors=0 error_rate=0.00% failures=0 avg_time=1.621 med_time=1.489 min_time=0.252 max_time=12.317 2020-06-09 13:30:17,874 INFO (reload/pv) Stats: reloads=3824 errors=0 error_rate=0.00% failures=0 avg_time=1.387 med_time=1.204 min_time=0.259 max_time=12.076 2020-06-09 13:30:17,902 INFO (reload/vg) Stats: reloads=1703 errors=0 error_rate=0.00% failures=0 avg_time=3.115 med_time=2.412 min_time=0.292 max_time=18.136 2020-06-09 13:30:17,902 INFO (MainThread) Reloaders stopped # ./reload.py --no-udev --no-select run 2020-06-09 14:22:48,117 INFO (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=False, use_udev=False, verbose=0, vg_count=10) ... 2020-06-09 15:48:43,875 INFO (worker/05) Trial 1 finished 2020-06-09 15:48:43,875 INFO (worker/05) Worker finished 2020-06-09 15:48:43,876 INFO (MainThread) Workers stopped 2020-06-09 15:48:44,104 INFO (reload/vg) Stats: reloads=5323 errors=0 error_rate=0.00% failures=0 avg_time=0.968 med_time=0.852 min_time=0.152 max_time=10.829 2020-06-09 15:48:44,169 INFO (reload/pv) Stats: reloads=5186 errors=0 error_rate=0.00% failures=0 avg_time=0.994 med_time=0.845 min_time=0.171 max_time=11.555 2020-06-09 15:48:44,176 INFO (reload/lv) Stats: reloads=7095 errors=0 error_rate=0.00% failures=0 avg_time=0.726 med_time=0.709 min_time=0.153 max_time=9.606 2020-06-09 15:48:44,176 INFO (MainThread) Reloaders stopped [1] http://download.eng.bos.redhat.com/brewroot/work/tasks/7661/29187661/ [2] https://raw.githubusercontent.com/oVirt/vdsm/master/tests/storage/stress/reload.py commit in stable-2.02 branch: https://sourceware.org/git/?p=lvm2.git;a=commit;h=1fb7a9d9e53cbde38250bb9ece72a25191899b78 Marking verified based on full regression testing of the latest 7.9 build as well as a focus on running a multitude of different pvs/vgs/lvs cmds. We also ran the reproducer script given in bug 1837199 (tweaked for of size limitations) as well as the one provided in comment #0. 3.10.0-1154.el7.x86_64 lvm2-2.02.187-6.el7 BUILT: Mon Jun 29 07:02:09 CDT 2020 lvm2-libs-2.02.187-6.el7 BUILT: Mon Jun 29 07:02:09 CDT 2020 lvm2-cluster-2.02.187-6.el7 BUILT: Mon Jun 29 07:02:09 CDT 2020 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (lvm2 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3927 |