Bug 1723611
Summary: | VM disk creations fail while changing LV tags. | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Siddhant Rao <sirao> | |
Component: | lvm2 | Assignee: | David Teigland <teigland> | |
lvm2 sub component: | Default / Unclassified | QA Contact: | cluster-qe <cluster-qe> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | agk, cmarthal, heinzm, jbrassow, lsurette, mcsontos, msnitzer, nkshirsa, prajnoha, rhandlin, srevivo, teigland, ycui, zkabelac | |
Version: | 7.6 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | lvm2-2.02.186-1.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1732994 1740521 (view as bug list) | Environment: | ||
Last Closed: | 2020-03-31 20:04:51 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1732994, 1740521 |
Description
Siddhant Rao
2019-06-25 00:51:56 UTC
Please grab 'lvmdump -a -m' and attach. The physical block size of a device should usually be 512 (or sometimes 4K), but in this case it's 2MB: /dev/mapper/36000d31000242e0000000000000001c2: Physical block size is 2097152 bytes I'm still trying to understand what errors are being produced in LVM from using 2MB, but that analysis is somewhat academic because the value needs to be fixed for LVM to work. Until we can build a version of LVM that ignores invalid block sizes, you might be able to force the devices or the system to use 512 bytes. Could you please run the blockdev command with each of the following options on /dev/mapper/36000d31000242e0000000000000001c2 --getss get logical block (sector) size --getpbsz get physical block (sector) size --getiomin get minimum I/O size --getioopt get optimal I/O size --getbsz get blocksize The mda_header is a 512 byte block at the start of the metadata area, which lvm writes to commit new metadata (it points to the location of the metadata text within the circular metadata buffer following the header). When lvm writes this 512 byte mda_header, it tells bcache to not write beyond the end of the mda_header (metadata area start + 512.) This is the region of disk lvm wants to be writing to, and no more. LVM uses this write-limiting mechanism (set_last_byte) to avoid problems in case a bcache block (128KB) happens to go past the end of the metadata area. That's not common, but possible, and we must not be writing past the metadata area, or we'd be overwriting user data. When bcache goes to write the 512 byte mda_header from a 128KB bcache block, the write-limiting code goes into effect and reduces the write size to either 512 (when the mda_header is located at the start of the bcache block), or 4608 (when the mda_header is located 4096 bytes from the start of the bcache block). So, bcache is writing a partial bcache block, and is writing only 512 of 4609 bytes to disk, rather than 128KB. There is one more wrinkle in this write-limiting mechanism. After reducing the write size according to the limit imposed by lvm (making the write either 512 or 4608 bytes), the resulting write size is checked to see if it's a multiple of the disk sector size. We do not want to write partial disk sectors. If the reduced write size not an even number of sectors, the reduced size is extended to be an even number of sectors. So, the write size is first reduced per the lvm limit, and then extended again per the sector size. The sector size is expected to be either 512 of 4096 bytes. When 512, the reduced write size (either 512 or 4608) is always a multiple of 512 and does not need extending. When 4096, the reduced write size is extended to 4096 or 8192 bytes. In both cases, the reduced+extended write size is far below the 128KB bcache block size, and lvm is still only writing a partial bcache block to disk [1]. However, in some cases the sector size returned from BLKPBSZGET is not 512 or 4096; in one case it is 2MB. When 2MB, the reduced write size is extended to 2MB, which is far beyond the size of the 128KB bcache block. This leads to writing unknown data to disk, into the metadata area and often into the data area (since the default metadata area is 1MB, we'd often be writing nearly 1MB of junk into the data area.) LVM needs to check the sector size returned from the ioctl, and use only 512 or 4096, regardless of what other values are returned. (We may also want to use BLKSSZGET instead, which the kernel describes as the "lowest possible block size that the storage device can address". BLKPBSZGET is described as "the lowest possible sector size that the hardware can operate on without reverting to read-modify-write operations".) [1] There is some question here if lvm were to write within the final 512 bytes of a bcache block, and a 4K sector size extended the write to 4096. This may not happen in practice, but we need to handle it correctly. From the failing command trace which shows each mda_header write extended to 2097152 bytes (2MB). write first mda_header (precommit) the bcache block starts at offset 0, the mda_header is at offset 4096, so the limit on number of bytes written from the bcache block is 4608, which is where the first mda_header ends within the bcache block. This write succeeds. RHV uses a very large metadata area (128MB?), so the 2MB written when writing the mda_header wrote junk into the circular metadata buffer, and in an area not used by the current copy of the metadata. So, there was no damage done by this write. #device/bcache.c:205 Limit write at 0 len 131072 to len 4608 rounded to 2097152 #format_text/format-text.c:331 Reading mda header sector from /dev/mapper/36000d31000242e0000000000000001c2 at 109937741004 #format_text/format-text.c:800 Pre-Committing 0c06f7f6-bf55-4eff-a23b-72f4886675b7 metadata (500) to /dev/mapper/36000d3100 #device/dev-io.c:609 Opened /dev/mapper/36000d31000242e0000000000000001c2 RO O_DIRECT #device/dev-io.c:166 /dev/mapper/36000d31000242e0000000000000001c2: Block size is 4096 bytes #device/dev-io.c:177 /dev/mapper/36000d31000242e0000000000000001c2: Physical block size is 2097152 bytes #device/dev-io.c:658 Closed /dev/mapper/36000d31000242e0000000000000001c2 write second mda_header (precommit) the bcache block starts at offset 1099377410048, the mda_header is at offset 0 within the bcache block, so the limit on number of bytes written from the bcache block is 512, which is where the second mda_header ends within the bcache block. This write fails somewhere in bcache.c, but all we see is false returned by the top level bcache write function, so we don't know what caused the write error. If this write had succeeded, it would not have caused damage because the current copy of metadata is far enough into the metadata area to not be clobbered. (When writing the second mda header at the end of the disk, we don't have to be concerned about clobbering user data, but the incorrect write extension could lead to writing beyond the end of the device.) #device/bcache.c:205 Limit write at 1099377410048 len 131072 to len 512 rounded to 2097152 #label/label.c:1304 Error writing device /dev/mapper/36000d31000242e0000000000000001c2 at 1099377410048 length 512. #format_text/format-text.c:407 Failed to write mda header to /dev/mapper/36000d31000242e0000000000000001c2 fd -1 I can't explain why this VG had no problems earlier in its lifetime, especially when the metadata text was within the range of the errant 2MB writes. One theory is that adjacent bcache blocks in memory contained the current blocks of data from disk, so when unintended bcache blocks were written to disk they happened to contain the correct data. Also, until we can find out which bcache.c error is causing the write error, we don't have a complete explanation of the problem. Initial fix for further review and testing is here: https://sourceware.org/git/?p=lvm2.git;a=shortlog;h=refs/heads/dev-dct-fix-sector-size-1 > [1] There is some question here if lvm were to write within the final 512
> bytes of a bcache block, and a 4K sector size extended the write to 4096.
> This may not happen in practice, but we need to handle it correctly.
Looked into this and it's not a problem as long as the bcache block size is a multiple of the 512|4096 sector size.
Hi David, Thanks for your help, as you rightly mentioned the size of the block is 2MB, please find the output which you requested # blockdev -v --getss /dev/mapper/36000d31000242e0000000000000001c2 get logical block (sector) size: 512 # blockdev -v --getpbsz /dev/mapper/36000d31000242e0000000000000001c2 get physical block (sector) size: 2097152 # blockdev -v --getiomin /dev/mapper/36000d31000242e0000000000000001c2 get minimum I/O size: 2097152 # blockdev -v --getioopt /dev/mapper/36000d31000242e0000000000000001c2 get optimal I/O size: 2097152 # blockdev -v --getbsz /dev/mapper/36000d31000242e0000000000000001c2 get blocksize: 4096 -------- Let me know if you need anything else. Thanks! Second version of patch which saves the logical/physical sizes in the dev struct to reuse. https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=7550665ba49ac7d497d5b212e14b69298ef01361 One thing about this patch that I'm slighly unsure about is how to handle physical_sector_size=512 && logical_sector_size=4096, or when physical_sector_size=4096 && logical_sector_size=512. In both of these cases the device should handle 512 byte io's, so I'm rounding our io's to the nearest 512 bytes. Ben figured out how to use scsi_debug to test this: Create a 1GB scsi device with 2MB physical sector size and 4096 logical sector size: modprobe scsi_debug sector_size=4096 physblk_exp=9 dev_size_mb=1024 Find out which scsi device that is (assuming you're not already using scsi_debug): SDN=`grep scsi_debug /sys/block/*/device/model | sed 's/^\/sys\/block\/\(.*\)\/device.*/\1/'` vgcreate test /dev/$SDN for i in `seq 1 40`; do lvcreate -l1 -an test; done For me this results in corrupted VG metadata at about lvol38 when using the old version of lvm. pushed to stable-2.02 branch: https://sourceware.org/git/?p=lvm2.git;a=commitdiff;h=7550665ba49ac7d497d5b212e14b69298ef01361 Fixed verified with the latest rpms using the example in comment #25. 3.10.0-1109.el7.x86_64 lvm2-2.02.186-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 lvm2-libs-2.02.186-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 lvm2-cluster-2.02.186-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 lvm2-lockd-2.02.186-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 device-mapper-1.02.164-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 device-mapper-libs-1.02.164-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 device-mapper-event-1.02.164-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 device-mapper-event-libs-1.02.164-3.el7 BUILT: Fri Nov 8 07:07:01 CST 2019 device-mapper-persistent-data-0.8.5-1.el7 BUILT: Mon Jun 10 03:58:20 CDT 2019 With fix (no issues): [...] Logical volume "lvol33" created. WARNING: Logical volume test/lvol34 not zeroed. Logical volume "lvol34" created. WARNING: Logical volume test/lvol35 not zeroed. Logical volume "lvol35" created. WARNING: Logical volume test/lvol36 not zeroed. Logical volume "lvol36" created. WARNING: Logical volume test/lvol37 not zeroed. Logical volume "lvol37" created. WARNING: Logical volume test/lvol38 not zeroed. Logical volume "lvol38" created. WARNING: Logical volume test/lvol39 not zeroed. Logical volume "lvol39" created. Without fix (lvm2-2.02.185-2.el7.x86_64): Logical volume "lvol24" created. WARNING: Logical volume test/lvol25 not zeroed. Logical volume "lvol25" created. WARNING: Logical volume test/lvol26 not zeroed. Logical volume "lvol26" created. Metadata on /dev/sdl at 139264 has wrong VG name "" expected test. Metadata on /dev/sdl at 139264 has wrong VG name "" expected test. WARNING: Logical volume test/lvol27 not zeroed. Logical volume "lvol27" created. WARNING: Logical volume test/lvol28 not zeroed. Logical volume "lvol28" created. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:1129 |