1643651 – lvm can overwrite extents beyond metadata area

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1643651 - lvm can overwrite extents beyond metadata area

Summary: lvm can overwrite extents beyond metadata area

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	lvm2
Sub Component:
Version:	7.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	David Teigland
QA Contact:	cluster-qe@redhat.com
Docs Contact:	Marek Suchánek
URL:
Whiteboard:
Depends On:
Blocks:	1644199 1644206 1644684
TreeView+	depends on / blocked

Reported:	2018-10-26 21:35 UTC by David Teigland
Modified:	2021-09-03 12:51 UTC (History)
CC List:	33 users (show)
Fixed In Version:	lvm2-2.02.184-1.el7
Doc Type:	Bug Fix
Doc Text:	.LVM no longer causes data corruption in the first 128kB of allocatable space of a physical volume Previously, a bug in the I/O layer of LVM might have caused data corruption in rare cases. The bug could manifest only when the following conditions were true at the same time: * A physical volume (PV) was created with a non-default alignment. The default is 1MB. * An LVM command was modifying metadata at the tail end of the metadata region of the PV. * A user or a file system was modifying the same bytes (racing). No cases of the data corruption have been reported. With this update, the problem has been fixed, and LVM can no longer cause data corruption under these conditions.
Clone Of:
Clones:	1644199 1644206 (view as bug list)
Environment:
Last Closed:	2019-08-06 13:10:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
tested patch (5.99 KB, text/plain) 2018-10-26 21:45 UTC, David Teigland	no flags	Details
cleanup (2.39 KB, text/plain) 2018-10-26 21:47 UTC, David Teigland	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2253	0	None	None	None	2019-08-06 13:11:12 UTC

Description David Teigland 2018-10-26 21:35:15 UTC

Description of problem:

In 7.6 (and 8.0), lvm began using a new i/o layer (bcache)
to read and write data blocks.

lvm uses a bcache block size of 128K.  A bcache block
at the end of the metadata area will overlap the PEs
from which LVs are allocated.  How much depends on
alignments.  When lvm reads and writes one of these
bcache blocks to update VG metadata, it can also be
reading and writing PEs that belong to an LV.
    
If these overlapping PEs are being written to by the
LV user (e.g. filesystem) at the same time that lvm
is modifying VG metadata in the overlapping bcache
block, then the user's updates to the PEs can be lost.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 David Teigland 2018-10-26 21:45:52 UTC

Created attachment 1497911 [details]
tested patch

This is a hacky patch that I have tested and it fixes the reproducible corruption I was setting in a test.

This patch needs a bit of cleanup, I'm attaching it as is because it's the one I've tested.

Comment 3 David Teigland 2018-10-26 21:47:40 UTC

Created attachment 1497912 [details]
cleanup

This is some incremental cleanup to the fix that should be tested.

Comment 4 David Teigland 2018-10-26 22:17:34 UTC

This patch I'm testing further which includes fixed cleanup:

https://sourceware.org/git/?p=lvm2.git;a=shortlog;h=refs/heads/dev-dct-last-byte-1

(for some reason buildbot won't build this, not sure why)

Comment 5 David Teigland 2018-10-26 22:26:15 UTC

The reproducible LV corruption I was seeing was from the following:

setup and start sanlock and lvmlockd

vgcreate --shared --metadatasize 1m foo /dev/sdg

(Note that this creates an internal "lvmlock" LV that sanlock uses to store leases.)

lvcreate 500 inactive LVs in foo

vgremove foo

During the vgremove step, sanlock notices that its updates to the internal "lvmlock" LV are periodically lost.  It's because when vgremove writes metadata at the end of the metadata area, it also clobbers PEs that were allocated to the lvmlock LV.  (sanlock reads/writes blocks to the lvmlock LV and notices if data changes out from under it.)

It should be straight forward to reproduce this same issue without lvmlockd and sanlock.  Create an ordinary VG, create an initial small LV (that uses the first PEs in the VG), start a script or program that reads/writes data to that LV and verifies that what it wrote comes back again.  Then start creating 500 other LVs in the VG, and removing those 500 LVs.  This causes the LV metadata to grow large and wrap around the end of the metadata area.  When lvm writes to the end of the metadata area, it will clobber data that the test program wrote and the test program should eventually notice that its last write is missing.

Comment 6 Alasdair Kergon 2018-10-27 01:17:28 UTC

vg_validate() was designed as an independent in-tree check to guarantee only valid metadata can ever hit disk.

You might consider adding some similar hook called and implemented independently of the faulty new code here to try to make this class of bug impossible.

Comment 8 Alasdair Kergon 2018-10-27 14:23:12 UTC

Possible ideas for mitigation (untested)

- Revert to a release before the bad commit

- Don't change any VG metadata (if lvm.conf locking_type is 1 change to 4 to enforce) - not appropriate if you have anything that makes metadata changes automatically (eg dmeventd for LV extension)

- Identify which LVs contain the PEs that might be affected and deactivate those LVs before changing VG metadata

- Use pvmove to move the LEs at risk elsewhere, then lvcreate to allocate "do not use" LVs on those PEs, taking them out of use.

Comment 10 David Teigland 2018-10-29 19:10:25 UTC

I expect this is the fix we will use (waiting for buildbot to finish testing it):

https://sourceware.org/git/?p=lvm2.git;a=commit;h=dabbed4f6af9d37d56671dd68a048b2462cd1da2

Comment 12 David Teigland 2018-10-29 22:33:38 UTC

Pushed out final fixes

stable branch for 7.6
https://sourceware.org/git/?p=lvm2.git;a=commit;h=ab27d5dc2a5c3bf23ab8fed438f1542015dc723d

master branch
https://sourceware.org/git/?p=lvm2.git;a=commit;h=aecf542126640faa17c240afbb1ea61f11355c39

Comment 16 David Teigland 2018-10-30 14:20:34 UTC

"LVM might cause data corruption in the first 128kB of the disk" is not technically correct.  It is the first 128kB of the first lvm physical extent of each PV in the VG.  lvm can potentially use these physical extents anywhere in any LV.

Comment 17 Jonathan Earl Brassow 2018-10-30 14:45:23 UTC

LVM might cause data corruption in the first 128kB of the disk

A bug in the I/O layer of LVM causes LVM to read and write back the first 128kB of data that immediately follows the LVM metadata on the disk. If another program or the file system is modifying these blocks when you use an LVM command, changes might be lost. As a consequence, this might lead to data corruption in rare cases.

To work around this problem, do not use LVM commands that change volume group (VG) metadata, such as "lvcreate" or "lvextend", while logical volumes (LVs) in the VG are in use.

1) s/128k of the disk/128k of allocatable space of a PV/

2) s/sometimes//

3) s/follow/follows/

4) Should there be any word on when this issue will be resolved?  Users will want to know if this is something that will be fixed soon.

Comment 18 Jonathan Earl Brassow 2018-10-30 14:59:27 UTC

5) s/"do not use LVM commands"/"avoid using LVM commands"/

With the likelihood so low of problems, users do have some discretion of performing these commands; but they should try to avoid it.

Comment 19 Lenka Špačková 2018-10-30 15:03:32 UTC

Thanks a lot, David, Jon, and Marek!

(In reply to Jonathan Earl Brassow from comment #17)
> LVM might cause data corruption in the first 128kB of the disk
> 
> A bug in the I/O layer of LVM causes LVM to read and write back the first
> 128kB of data that immediately follows the LVM metadata on the disk. If
> another program or the file system is modifying these blocks when you use an
> LVM command, changes might be lost. As a consequence, this might lead to
> data corruption in rare cases.
> 
> To work around this problem, do not use LVM commands that change volume
> group (VG) metadata, such as "lvcreate" or "lvextend", while logical volumes
> (LVs) in the VG are in use.
> 
> 1) s/128k of the disk/128k of allocatable space of a PV/
> 
> 2) s/sometimes//
> 
> 3) s/follow/follows/


(In reply to Jonathan Earl Brassow from comment #18)
> 5) s/"do not use LVM commands"/"avoid using LVM commands"/
> 
> With the likelihood so low of problems, users do have some discretion of
> performing these commands; but they should try to avoid it.

All above fixed.

> 
> 4) Should there be any word on when this issue will be resolved?  Users will
> want to know if this is something that will be fixed soon.


We cannot make any promises regarding future fixes in docs, sorry. 

I am now going to use this version and republish the Release Notes with it. If you have any further suggestions for improvements, we will be republishing the book quite frequently in the upcoming days (there is a number of last-minute release notes, as always).

Thanks again!

Comment 21 David Teigland 2018-10-30 20:20:32 UTC

By default in RHEL7, lvm seems to consistently create VGs with the first physical extent (PE) at offset 1MB.  When a PE begins at a multiple of 128KB (like 1MB) it is immune to this bug.  To check the PE offset:

$ vgs -o name,pe_start --units k vg
  VG 1st PE  
  vg 1024.00k

The config setting "default_data_alignment" is 1 (MB) by default, and this leads to the common pe_start value of 1MB.  If this setting is turned off (0), then pe_start values may not be aligned with 128KB, and potentially affected by this bug.

Comment 23 David Teigland 2018-10-30 20:31:31 UTC

We have had the fix since yesterday.  Marian, is there anything else you need before creating upstream releases and a 7.6 build?

Comment 24 Elad 2018-10-30 21:31:38 UTC

Hi David, I'm from RHV storage QE. We would like to test RHV sanity with the fix. Can you please provide us the lvm build that contains the fix? Thanks

Comment 26 David Teigland 2018-10-31 14:11:26 UTC

Elad, can you use lvm2-2.02.180-10.el7_6.2 ?

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=791193

Comment 28 David Teigland 2018-10-31 20:16:16 UTC

Email about this bug to the upstream lvm mailing list:
https://www.redhat.com/archives/linux-lvm/2018-October/msg00054.html

Comment 39 David Teigland 2019-05-23 15:43:03 UTC

There's nothing relevant to say about how it's fixed.  I think the doc text may be stating the problem a little too severely:

1. This problem could only arise if an LVM PV was created with a non-default alignment (the default is 1MB).  This is uncommon.

2. This problem could only arise while an LVM command is running and modifying metadata.  This is usually infrequent.

3. The problem could only arise when the LVM command is modifying metadata at the tail end of the metadata region.  This is usually infrequent.

4. The problem could only arise if the previous three items are all true at the same moment that a user or fs happens to be modifying the same bytes (racing).

5. Data loss could occur if the race in 4 went one way.

That is probably too much detail for the doc text, so perhaps saying that it's "rare" is sufficient.

Comment 41 Marek Suchánek 2019-05-24 11:19:51 UTC

Thanks for the additional details. I've rewritten the release note to be more specific and not to give the impression that there were reported cases of the data corruption.

Comment 42 David Teigland 2019-05-24 14:31:42 UTC

Let's remove the "found through code inspection" phrase, because it's not entirely accurate. I discovered this through an unusal test that I was running that revealed this issue.

Comment 43 Marek Suchánek 2019-05-24 15:02:19 UTC

(In reply to David Teigland from comment #42)
> Let's remove the "found through code inspection" phrase, because it's not
> entirely accurate. I discovered this through an unusal test that I was
> running that revealed this issue.

Removed.

Comment 44 Corey Marthaler 2019-07-02 21:21:49 UTC

Marking verified with the latest rpms.

3.10.0-1057.el7.x86_64

lvm2-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-libs-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-cluster-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
lvm2-lockd-2.02.185-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-libs-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-event-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-event-libs-1.02.158-2.el7    BUILT: Fri Jun 21 04:18:48 CDT 2019
device-mapper-persistent-data-0.8.5-1.el7    BUILT: Mon Jun 10 03:58:20 CDT 2019
sanlock-3.7.3-1.el7    BUILT: Tue May 21 10:44:00 CDT 2019
sanlock-lib-3.7.3-1.el7    BUILT: Tue May 21 10:44:00 CDT 2019


[root@hayes-02 ~]# vgcreate --config devices/default_data_alignment=0 --metadatasize 520k gg /dev/sdg1
  Physical volume "/dev/sdg1" successfully created.
  Volume group "gg" successfully created
[root@hayes-02 ~]# vgs -o+pe_start gg
  VG #PV #LV #SN Attr   VSize    VFree    1st PE 
  gg   1   0   0 wz--n- <931.25g <931.25g 576.00k
[root@hayes-02 ~]# lvcreate -l1 -n test gg
  Logical volume "test" created.
[root@hayes-02 ~]# sanlock daemon -w 0
[root@hayes-02 ~]# sanlock client init -s LS:0:/dev/gg/test:0 -o 1
init
init done 0
[root@hayes-02 ~]# sanlock client add_lockspace -s LS:1:/dev/gg/test:0 -o 1
add_lockspace_timeout 1
add_lockspace_timeout done 0
[root@hayes-02 ~]# tail -f /var/log/sanlock.log
2019-07-02 16:07:41 81985 [9327]: sanlock daemon started 3.7.3 host 7856c016-0652-4b98-9b9b-b472c5895c90.hayes-02.l
2019-07-02 16:07:55 81999 [9330]: s1 lockspace LS:1:/dev/gg/test:0
2019-07-02 16:07:58 82002 [9327]: s1 host 1 1 81999 7856c016-0652-4b98-9b9b-b472c5895c90.hayes-02.l



[root@hayes-02 ~]# for i in `seq 1 1000`; do lvcreate -an -l1 -n lv$i gg; done
[...]
  VG gg metadata on /dev/sdg1 (292777 bytes) too large for circular buffer (585216 bytes with 292471 used)
  Failed to write VG gg.
  VG gg metadata on /dev/sdg1 (292777 bytes) too large for circular buffer (585216 bytes with 292471 used)
  Failed to write VG gg.
  VG gg metadata on /dev/sdg1 (292778 bytes) too large for circular buffer (585216 bytes with 292471 used)
  Failed to write VG gg.

[root@hayes-02 ~]# lvremove -f gg
[...]
  Logical volume "lv948" successfully removed
  Logical volume "lv949" successfully removed
  Logical volume "lv950" successfully removed
  Logical volume "lv951" successfully removed
  Logical volume "lv952" successfully removed
  Logical volume "lv953" successfully removed


## No reported corruption
2019-07-02 16:12:19 82263 [9337]: s1 delta_renew long write time 1 sec
2019-07-02 16:19:08 82672 [9337]: s1 delta_renew long write time 1 sec

Comment 46 errata-xmlrpc 2019-08-06 13:10:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2253

Note You need to log in before you can comment on or make changes to this bug.

agk
bruce.howells
bugzilla
carl
cmarthal
ddumas
dominik.mierzejewski
ebenahar
fbrychta
fgarciad
gveitmic
heinzm
jbrassow
jdeenada
jpittman
jpriddy
lkuprova
loberman
mcsontos
mkalinin
msnitzer
pasik
pdwyer
prajnoha
prockai
rhandlin
salmy
sreber
teigland
therman
thornber
xzhou
zkabelac