Red Hat Bugzilla – Bug 1643651
lvm can overwrite extents beyond metadata area
Last modified: 2018-11-05 09:01:04 EST
Description of problem: In 7.6 (and 8.0), lvm began using a new i/o layer (bcache) to read and write data blocks. lvm uses a bcache block size of 128K. A bcache block at the end of the metadata area will overlap the PEs from which LVs are allocated. How much depends on alignments. When lvm reads and writes one of these bcache blocks to update VG metadata, it can also be reading and writing PEs that belong to an LV. If these overlapping PEs are being written to by the LV user (e.g. filesystem) at the same time that lvm is modifying VG metadata in the overlapping bcache block, then the user's updates to the PEs can be lost. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1497911 [details] tested patch This is a hacky patch that I have tested and it fixes the reproducible corruption I was setting in a test. This patch needs a bit of cleanup, I'm attaching it as is because it's the one I've tested.
Created attachment 1497912 [details] cleanup This is some incremental cleanup to the fix that should be tested.
This patch I'm testing further which includes fixed cleanup: https://sourceware.org/git/?p=lvm2.git;a=shortlog;h=refs/heads/dev-dct-last-byte-1 (for some reason buildbot won't build this, not sure why)
The reproducible LV corruption I was seeing was from the following: setup and start sanlock and lvmlockd vgcreate --shared --metadatasize 1m foo /dev/sdg (Note that this creates an internal "lvmlock" LV that sanlock uses to store leases.) lvcreate 500 inactive LVs in foo vgremove foo During the vgremove step, sanlock notices that its updates to the internal "lvmlock" LV are periodically lost. It's because when vgremove writes metadata at the end of the metadata area, it also clobbers PEs that were allocated to the lvmlock LV. (sanlock reads/writes blocks to the lvmlock LV and notices if data changes out from under it.) It should be straight forward to reproduce this same issue without lvmlockd and sanlock. Create an ordinary VG, create an initial small LV (that uses the first PEs in the VG), start a script or program that reads/writes data to that LV and verifies that what it wrote comes back again. Then start creating 500 other LVs in the VG, and removing those 500 LVs. This causes the LV metadata to grow large and wrap around the end of the metadata area. When lvm writes to the end of the metadata area, it will clobber data that the test program wrote and the test program should eventually notice that its last write is missing.
vg_validate() was designed as an independent in-tree check to guarantee only valid metadata can ever hit disk. You might consider adding some similar hook called and implemented independently of the faulty new code here to try to make this class of bug impossible.
Possible ideas for mitigation (untested) - Revert to a release before the bad commit - Don't change any VG metadata (if lvm.conf locking_type is 1 change to 4 to enforce) - not appropriate if you have anything that makes metadata changes automatically (eg dmeventd for LV extension) - Identify which LVs contain the PEs that might be affected and deactivate those LVs before changing VG metadata - Use pvmove to move the LEs at risk elsewhere, then lvcreate to allocate "do not use" LVs on those PEs, taking them out of use.
I expect this is the fix we will use (waiting for buildbot to finish testing it): https://sourceware.org/git/?p=lvm2.git;a=commit;h=dabbed4f6af9d37d56671dd68a048b2462cd1da2
Pushed out final fixes stable branch for 7.6 https://sourceware.org/git/?p=lvm2.git;a=commit;h=ab27d5dc2a5c3bf23ab8fed438f1542015dc723d master branch https://sourceware.org/git/?p=lvm2.git;a=commit;h=aecf542126640faa17c240afbb1ea61f11355c39
"LVM might cause data corruption in the first 128kB of the disk" is not technically correct. It is the first 128kB of the first lvm physical extent of each PV in the VG. lvm can potentially use these physical extents anywhere in any LV.
LVM might cause data corruption in the first 128kB of the disk A bug in the I/O layer of LVM causes LVM to read and write back the first 128kB of data that immediately follows the LVM metadata on the disk. If another program or the file system is modifying these blocks when you use an LVM command, changes might be lost. As a consequence, this might lead to data corruption in rare cases. To work around this problem, do not use LVM commands that change volume group (VG) metadata, such as "lvcreate" or "lvextend", while logical volumes (LVs) in the VG are in use. 1) s/128k of the disk/128k of allocatable space of a PV/ 2) s/sometimes// 3) s/follow/follows/ 4) Should there be any word on when this issue will be resolved? Users will want to know if this is something that will be fixed soon.
5) s/"do not use LVM commands"/"avoid using LVM commands"/ With the likelihood so low of problems, users do have some discretion of performing these commands; but they should try to avoid it.
Thanks a lot, David, Jon, and Marek! (In reply to Jonathan Earl Brassow from comment #17) > LVM might cause data corruption in the first 128kB of the disk > > A bug in the I/O layer of LVM causes LVM to read and write back the first > 128kB of data that immediately follows the LVM metadata on the disk. If > another program or the file system is modifying these blocks when you use an > LVM command, changes might be lost. As a consequence, this might lead to > data corruption in rare cases. > > To work around this problem, do not use LVM commands that change volume > group (VG) metadata, such as "lvcreate" or "lvextend", while logical volumes > (LVs) in the VG are in use. > > 1) s/128k of the disk/128k of allocatable space of a PV/ > > 2) s/sometimes// > > 3) s/follow/follows/ (In reply to Jonathan Earl Brassow from comment #18) > 5) s/"do not use LVM commands"/"avoid using LVM commands"/ > > With the likelihood so low of problems, users do have some discretion of > performing these commands; but they should try to avoid it. All above fixed. > > 4) Should there be any word on when this issue will be resolved? Users will > want to know if this is something that will be fixed soon. We cannot make any promises regarding future fixes in docs, sorry. I am now going to use this version and republish the Release Notes with it. If you have any further suggestions for improvements, we will be republishing the book quite frequently in the upcoming days (there is a number of last-minute release notes, as always). Thanks again!
By default in RHEL7, lvm seems to consistently create VGs with the first physical extent (PE) at offset 1MB. When a PE begins at a multiple of 128KB (like 1MB) it is immune to this bug. To check the PE offset: $ vgs -o name,pe_start --units k vg VG 1st PE vg 1024.00k The config setting "default_data_alignment" is 1 (MB) by default, and this leads to the common pe_start value of 1MB. If this setting is turned off (0), then pe_start values may not be aligned with 128KB, and potentially affected by this bug.
We have had the fix since yesterday. Marian, is there anything else you need before creating upstream releases and a 7.6 build?
Hi David, I'm from RHV storage QE. We would like to test RHV sanity with the fix. Can you please provide us the lvm build that contains the fix? Thanks
Elad, can you use lvm2-2.02.180-10.el7_6.2 ? https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=791193
Email about this bug to the upstream lvm mailing list: https://www.redhat.com/archives/linux-lvm/2018-October/msg00054.html