1842600 – pvs/vgs/lvs fails randomly under load with "Metadata on <device> has wrong VG name "" expected <vg-name>"

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1842600 - pvs/vgs/lvs fails randomly under load with "Metadata on <device> has wrong VG name "" expected <vg-name>"

Summary: pvs/vgs/lvs fails randomly under load with "Metadata on <device> has wrong VG...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1837199 1849595
TreeView+	depends on / blocked

Reported:	2020-06-01 16:17 UTC by Nir Soffer
Modified:	2023-12-15 18:03 UTC (History)
CC List:	19 users (show)
Fixed In Version:	lvm2-2.02.187-6.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-29 19:56:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Reproducer script (23.50 KB, text/x-python) 2020-06-01 16:17 UTC, Nir Soffer	no flags	Details
reproducer run log (148.04 KB, application/x-xz) 2020-06-01 16:18 UTC, Nir Soffer	no flags	Details
errors collected during test run (90.43 KB, application/x-xz) 2020-06-01 16:22 UTC, Nir Soffer	no flags	Details
Reproducer script (23.56 KB, text/x-python) 2020-06-01 20:33 UTC, Nir Soffer	no flags	Details
patch (12.60 KB, text/plain) 2020-06-01 21:28 UTC, David Teigland	no flags	Details
patch (14.01 KB, text/plain) 2020-06-03 18:37 UTC, David Teigland	no flags	Details
Reproducer runs with lvm2-2.02.187-5.el7.bz1842600_1 (2.28 MB, application/gzip) 2020-06-10 11:43 UTC, Amit Bawer	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:3927	0	None	None	None	2020-09-29 19:56:18 UTC

Internal Links: 1852344

Description Nir Soffer 2020-06-01 16:17:04 UTC

Created attachment 1694143 [details]
Reproducer script

Description of problem:

pvs/vgs/lvs commands may fail randomly when concurrent lvm commands are
modifying the same VG.  The failing commands cause failures in layered products
(such as RHV) which cannot tell if if the failed command was harmless (this
case) or caused by a data corruption. See bug 1837199 for more info.

We can reproduce these errors in stress test running pvs, vgs, and lvs in a
loop, and other commands creating lvs, changing lvs, activating lvs,
deactivating lvs, extending lvs, and removing lvs.

We start the stress test with new loop devices backed by empty sparse files.
The commands fail with first this error:

    Metadata on /dev/mapper/delay0000000000000000000000000000 at 52623872 has
    wrong VG name "" expected bz1837199-000000000000000000000-0000.\n

Later, the commands fail with:

    Metadata on /dev/mapper/delay0000000000000000000000000001 at 927232 has
    wrong VG name "= "host1"\nsegment_count = 1\n\nsegment1 {\nstart_extent =
    0\n extent_count = 8\n\ntype = "striped"\nstripe_count = 1\n\nstripes =
    [\n"pv0"," expected bz1837199-000000000000000000000-0001.\n

I think this can be explained by the metadata area wrapping around when it
becomes full.  When we start the metadata is full of zeroes, and lvm is trying
to read metadata from before the metadata was written, so it see zeroes and
report empty string "\0". Once we wrap around and we start to see older
metadata when we try to read metadata before it was written.

Running the reproducer with --verbose option, I collected errors from pvs
-vvvv, vgs -vvvv, and lvs -vvvv. Here is example failure in vgs command:

# vgs-error-0001.txt

Scan pahse:

...
#label/label.c:320         /dev/mapper/delay0000000000000000000000000000: lvm2 label detected at sector 1
#cache/lvmcache.c:2086          lvmcache /dev/mapper/delay0000000000000000000000000000: now in VG #orphans_lvm2 (#orphans_lvm2) with 0 mda(s).
#format_text/text_label.c:423           /dev/mapper/delay0000000000000000000000000000: PV header extension version 2 found
#format_text/format-text.c:330           Reading mda header sector from /dev/mapper/delay0000000000000000000000000000 at 4096
#format_text/import.c:58            Reading metadata summary from /dev/mapper/delay0000000000000000000000000000 at 66324480 size 164655 (+0)
#format_text/format-text.c:1336          Found metadata summary on /dev/mapper/delay0000000000000000000000000000 at 66324480 size 164655 for VG bz1837199-000000000000000000000-0000
...

Processing phase:

...
#toollib.c:2008        Processing VG bz1837199-000000000000000000000-0000 FR6is4-bTdC-wHve-RpJv-RdvH-G5OX-nmG8Ud
#locking/locking.c:367           Dropping cache for bz1837199-000000000000000000000-0000.
#misc/lvm-flock.c:202         Locking /run/lock/lvm/V_bz1837199-000000000000000000000-0000 RB
#libdm-common.c:984           Preparing SELinux context for /run/lock/lvm/V_bz1837199-000000000000000000000-0000 to system_u:object_r:lvm_lock_t:s0.
#misc/lvm-flock.c:100           _do_flock /run/lock/lvm/V_bz1837199-000000000000000000000-0000:aux WB
#misc/lvm-flock.c:47            _undo_flock /run/lock/lvm/V_bz1837199-000000000000000000000-0000:aux
#misc/lvm-flock.c:100           _do_flock /run/lock/lvm/V_bz1837199-000000000000000000000-0000 RB
#libdm-common.c:987           Resetting SELinux context to default value.
#metadata/metadata.c:3824        Reading VG bz1837199-000000000000000000000-0000 FR6is4-bTdC-wHve-RpJv-RdvH-G5OX-nmG8Ud
#metadata/metadata.c:3923          Skipped rescanning devices for bz1837199-000000000000000000000-0000
#metadata/metadata.c:4006          Reading VG bz1837199-000000000000000000000-0000 from /dev/mapper/delay0000000000000000000000000000
#format_text/format-text.c:330           Reading mda header sector from /dev/mapper/delay0000000000000000000000000000 at 4096
#format_text/format-text.c:484     Metadata on /dev/mapper/delay0000000000000000000000000000 at 66489344 has wrong VG name "" expected bz1837199-000000000000000000000-0000.
...

Notes the offsets:

- In the can phase we detected metadata at 66324480 size 164655
- In the processing phase we detected metadata at 66489344

>>> 66324480 + 164655
66489135

>>> 66489135 - (66489135 % 512) + 512
66489344

66324480 is the committed metadata area
66489344 is the precommitted metadata area

I think what happens here is:

1. In the scan phase we read the metadata without locks, and read precoomitted
   raw_locn[1] pointing to area on storage that does not include the metadata
   yet. Another command wrote the offset to slot 1 before writing the data.

2. In the processing phase we take a lock on the vg:

    #misc/lvm-flock.c:100           _do_flock /run/lock/lvm/V_bz1837199-000000000000000000000-0000 RB

3. And we skipped reading the vg metadata

    #metadata/metadata.c:3923          Skipped rescanning devices for bz1837199-000000000000000000000-0000

4. And we probably think that it is ok to read precommited metadata at slot 1
   since we hold the lock. But the data came from the scan phase read without
   locks.

    # from lib/format_text/format-text.c line 439

    /* Should we use precommitted metadata? */
    if (*precommitted && rlocn_precommitted->size &&
        (rlocn_precommitted->offset != rlocn->offset)) {
        rlocn = rlocn_precommitted;
    } else 
        *precommitted = 0; 

   On master we have detailed log here, but in 2.02 there are no logs.


Version-Release number of selected component (if applicable):

lvm2-2.02.186-7.el7_8.1.x86_64


How reproducible:

- Always with lvm2-2.02.186-7.el7_8.1.x86_64 using reproducer script.
- We have one customer case showing this issue: bug 1837199
- Not reproducible with lvm2-2.03.08-3.el8.x86_64 - no error after 6 hours


Steps to Reproduce:

Using the script attached reload.py script:

1. Setup storage:

    # ./reload.py setup

2. Run the reproducer:

    # ./reload.py run --verbose 4 2>run.log

3. Watch for errors:

    # watch ls *-error-*.txt

4. To delete storage:

    # ./reload.py teardown


Actual results:

pvs/vgs/lvs commands fail with bogus error about wrong VG name while actual
metadata is fine.


Expected results:

No failures in LVM unless underling storage fails.


More info:

The reproducer script support several useful options:

--trials (default 1) - One trial takes about 75 minutes and generates 570
  errors per run. This can be useful to run several iterations to check that a
  fix actually solve the issue.

--vg-count (default 10) - I could not reproduce the issue with one VG. Looks like
  more VGs helps to reproduce the issue faster.

- lv-count (default 500) - Number lvs created, modified and removed on every
  VG.  No error is reproduced with --lv-count=50.

--pv-size (default 2t) - should be big enough to hold lv-count lvs * 2 MiB
  (test writes 2 MiB data to every lv)

--delay-msec (default 10) The delay argument passed to dm-delay device on top
  of the loop device. I did not try longer delay but it may help to reproduce
  faster. It is very hard to reproduce the issue without using a delay device,
  loop devices reads/writes are just too fast.

--verbose (default 0) - Pass -v (multiple times) to the underlying LVM pvs,
  vgs, lvs commands. The output is logged if the command fails, or if using
  --debug.

--no-udev (default False) - Use obtain_device_list_from_udev=0 in the
  underlying LVM command --config. Required to avoid random failures on RHEL
  8.2 (see bug 1837199). On RHEL 7.8 this seems to slightly lower error rate.

--no-select (default False) - Disable usage of --select in the underlying
  pvs/vgs/lvs commands (e.g. pvs --select 'pv_name = /dev/mapper/delay001') and
  specify the entity instead (e.g. pvs /dev/mapper/delay001). Using this option
  double the number of pvs,vgs,lvs commands run during on run and give 10 times
  lower error rate, but we still see errors (e.g 83 errors of 21,054 commands).

--debug (default False) - Log debug messages (command arguments, exit code,
  stdout, stderr). Mainly useful to debug the test script itself.

The test script is maintained in vdsm:
https://gerrit.ovirt.org/c/109355/

The current version of the test script is attached.

Comment 2 Nir Soffer 2020-06-01 16:18:32 UTC

Created attachment 1694144 [details]
reproducer run log

Comment 3 Nir Soffer 2020-06-01 16:22:30 UTC

Created attachment 1694145 [details]
errors collected during test run

We know that using --select increase the chance of this error, since lvm process
all the vgs before applying the selection. We will change RHV to avoid this
issue.

But we see errors even when running without --select, so I run the reproduer
like this:

    ./reload.py --no-udev --no-select --verbose 4 2>run-no-udev-no-select-verbose-4.log

I interrupted the script once some errors where collected. The errors are in
the attached tarball.

Comment 4 David Teigland 2020-06-01 20:04:55 UTC

Without having verified, I believe this is a bug in the scan/read optimization that I added to stable/RHEL7 in commit aee27dc7bad5734012885fe9f174def0a3f26771.  That was intended to reuse the scanned data blocks for vg_read, however it failed to account for the fact that the mda_header would still be reread from disk, and not reused from scan, making it potentially inconsistent with the rest of the metadata text area being reused from scan.  So, the new mda_header would point to new metadata text on disk but that is not present in the cache.  This issue is fixed in master/RHEL8 by commit 0c1316cda876849d5d1375d40e8cdc08db37c2b5, where the mda_header is reread from disk and then compared with the mda_header values found during scan, and if they differ the metadata text is reread from disk.  I will be looking at using the same fix in stable/RHEL7.

Comment 5 Nir Soffer 2020-06-01 20:33:37 UTC

Created attachment 1694175 [details]
Reproducer script

New version of the reproducer script with fixed --debug mode.

Comment 6 David Teigland 2020-06-01 21:28:21 UTC

Created attachment 1694192 [details]
patch

An initial backport of a potential fix for testing.

Comment 7 David Teigland 2020-06-03 18:37:24 UTC

Created attachment 1694884 [details]
patch

minor updates

Comment 8 Amit Bawer 2020-06-10 11:43:53 UTC

Created attachment 1696476 [details]
Reproducer runs with lvm2-2.02.187-5.el7.bz1842600_1

Testing patched brew lvm2 rpm from [1] seems to be doing well under bz1837199 reproducer [2] runs over RHEL7.8 with 8 cpus:

# ./reload.py --no-select run

2020-06-09 05:40:36,610 INFO    (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=False, use_udev=True, verbose=0, vg_count=10)
2020-06-09 05:40:36,618 INFO    (MainThread) Starting pv reloader
...
2020-06-09 07:06:20,005 INFO    (worker/06) Trial 1 finished
2020-06-09 07:06:20,006 INFO    (worker/06) Worker finished
2020-06-09 07:06:20,006 INFO    (MainThread) Workers stopped
2020-06-09 07:06:20,083 INFO    (reload/pv) Stats: reloads=5171 errors=0 error_rate=0.00% failures=0 avg_time=0.994 med_time=0.857 min_time=0.161 max_time=9.177
2020-06-09 07:06:20,088 INFO    (reload/vg) Stats: reloads=5226 errors=0 error_rate=0.00% failures=0 avg_time=0.984 med_time=0.863 min_time=0.157 max_time=9.345
2020-06-09 07:06:20,125 INFO    (reload/lv) Stats: reloads=7104 errors=0 error_rate=0.00% failures=0 avg_time=0.723 med_time=0.715 min_time=0.158 max_time=8.594
2020-06-09 07:06:20,126 INFO    (MainThread) Reloaders stopped


# ./reload.py run

2020-06-09 10:33:07,105 INFO    (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=True, use_udev=True, verbose=0, vg_count=10)
...
2020-06-09 12:01:30,350 INFO    (worker/09) Trial 1 finished
2020-06-09 12:01:30,351 INFO    (worker/09) Worker finished
2020-06-09 12:01:30,351 INFO    (MainThread) Workers stopped
2020-06-09 12:01:30,558 INFO    (reload/vg) Stats: reloads=1594 errors=0 error_rate=0.00% failures=0 avg_time=3.326 med_time=2.918 min_time=0.437 max_time=18.255
2020-06-09 12:01:30,584 INFO    (reload/pv) Stats: reloads=3801 errors=0 error_rate=0.00% failures=0 avg_time=1.395 med_time=1.242 min_time=0.298 max_time=12.502
2020-06-09 12:01:30,593 INFO    (reload/lv) Stats: reloads=3263 errors=0 error_rate=0.00% failures=0 avg_time=1.625 med_time=1.530 min_time=0.297 max_time=12.450
2020-06-09 12:01:30,594 INFO    (MainThread) Reloaders stopped
 

# ./reload.py --no-udev run


2020-06-09 12:01:51,479 INFO    (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=True, use_udev=False, verbose=0, vg_count=10)
...
2020-06-09 13:30:17,445 INFO    (worker/08) Trial 1 finished
2020-06-09 13:30:17,445 INFO    (worker/08) Worker finished
2020-06-09 13:30:17,445 INFO    (MainThread) Workers stopped
2020-06-09 13:30:17,576 INFO    (reload/lv) Stats: reloads=3273 errors=0 error_rate=0.00% failures=0 avg_time=1.621 med_time=1.489 min_time=0.252 max_time=12.317
2020-06-09 13:30:17,874 INFO    (reload/pv) Stats: reloads=3824 errors=0 error_rate=0.00% failures=0 avg_time=1.387 med_time=1.204 min_time=0.259 max_time=12.076
2020-06-09 13:30:17,902 INFO    (reload/vg) Stats: reloads=1703 errors=0 error_rate=0.00% failures=0 avg_time=3.115 med_time=2.412 min_time=0.292 max_time=18.136
2020-06-09 13:30:17,902 INFO    (MainThread) Reloaders stopped


# ./reload.py --no-udev --no-select run


2020-06-09 14:22:48,117 INFO    (MainThread) Running trials args=Namespace(command='run', debug=False, delay_msec=10, lv_count=500, pv_size=2199023255552, read_only=False, retries=0, trials=1, use_select=False, use_udev=False, verbose=0, vg_count=10)
...
2020-06-09 15:48:43,875 INFO    (worker/05) Trial 1 finished
2020-06-09 15:48:43,875 INFO    (worker/05) Worker finished
2020-06-09 15:48:43,876 INFO    (MainThread) Workers stopped
2020-06-09 15:48:44,104 INFO    (reload/vg) Stats: reloads=5323 errors=0 error_rate=0.00% failures=0 avg_time=0.968 med_time=0.852 min_time=0.152 max_time=10.829
2020-06-09 15:48:44,169 INFO    (reload/pv) Stats: reloads=5186 errors=0 error_rate=0.00% failures=0 avg_time=0.994 med_time=0.845 min_time=0.171 max_time=11.555
2020-06-09 15:48:44,176 INFO    (reload/lv) Stats: reloads=7095 errors=0 error_rate=0.00% failures=0 avg_time=0.726 med_time=0.709 min_time=0.153 max_time=9.606
2020-06-09 15:48:44,176 INFO    (MainThread) Reloaders stopped




[1] http://download.eng.bos.redhat.com/brewroot/work/tasks/7661/29187661/
[2] https://raw.githubusercontent.com/oVirt/vdsm/master/tests/storage/stress/reload.py

Comment 19 David Teigland 2020-06-26 19:53:50 UTC

commit in stable-2.02 branch:
https://sourceware.org/git/?p=lvm2.git;a=commit;h=1fb7a9d9e53cbde38250bb9ece72a25191899b78

Comment 24 Corey Marthaler 2020-07-09 21:27:31 UTC

Marking verified based on full regression testing of the latest 7.9 build as well as a focus on running a multitude of different pvs/vgs/lvs cmds. 
We also ran the reproducer script given in bug 1837199 (tweaked for of size limitations) as well as the one provided in comment #0.

3.10.0-1154.el7.x86_64

lvm2-2.02.187-6.el7    BUILT: Mon Jun 29 07:02:09 CDT 2020
lvm2-libs-2.02.187-6.el7    BUILT: Mon Jun 29 07:02:09 CDT 2020
lvm2-cluster-2.02.187-6.el7    BUILT: Mon Jun 29 07:02:09 CDT 2020

Comment 26 errata-xmlrpc 2020-09-29 19:56:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (lvm2 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3927

Note You need to log in before you can comment on or make changes to this bug.