Bug 2169947
Summary: | mdb_copy produces a corrupted database on btrfs | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andreas Schneider <asn> | ||||||
Component: | kernel | Assignee: | fedora-kernel-btrfs | ||||||
Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | unspecified | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 37 | CC: | abartlet, acaringi, adam900710, adscvr, airlied, alciregi, asn, bskeggs, bugzilla, davide, fweimer, hdegoede, hpa, hyc, jarodwilson, jforbes, jglisse, josef, jra, jrische, jstanek, kernel-maint, lgoncalv, linville, masami256, mchehab, michele, michel, ngompa13, ptalbert, rsroka, steved, todoleza | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2024-01-12 22:45:13 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Andreas Schneider
2023-02-15 08:07:13 UTC
Will take me a while to scare up space to create a btrfs partition. But just in general, mdb_copy doesn't do anything special - it just writes in multiples of OS pagesize in a loop until it's done. Does btrfs do anything weird like ZFS, that tries to configure its I/O blocksize based on the size of the first I/O to a file? Ah had a spare SSD handy. A hex dump of the backup file shows that the contents of the 3rd thru 15th pages are missing, all zeroes. Dump: samba-dc.ldb.backup Offset: 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 00000000: 00 00 00 00 00 00 00 00 00 00 08 00 00 00 00 00 | ................ | 00000010: de c0 ef be 01 00 00 00 00 00 00 00 00 00 00 00 | ................ | 00000020: 00 00 00 00 02 00 00 00 00 10 00 00 08 40 01 00 | .............@.. | 00000030: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................ | 00000040: 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 | ................ | 00000050: 3f 01 00 00 00 00 00 00 00 00 00 00 00 00 03 00 | ?............... | 00000060: 04 00 00 00 00 00 00 00 73 00 00 00 00 00 00 00 | ........s....... | 00000070: 13 01 00 00 00 00 00 00 36 06 00 00 00 00 00 00 | ........6....... | 00000080: 03 00 00 00 00 00 00 00 bf 01 00 00 00 00 00 00 | ................ | 00000090: 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | 000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | * 00001000: 01 00 00 00 00 00 00 00 00 00 08 00 00 00 00 00 | ................ | 00001010: de c0 ef be 01 00 00 00 00 00 00 00 00 00 00 00 | ................ | 00001020: 00 00 00 00 02 00 00 00 00 10 00 00 08 40 01 00 | .............@.. | 00001030: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................ | 00001040: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 | ................ | 00001050: bf 01 00 00 00 00 00 00 00 00 00 00 00 00 03 00 | ................ | 00001060: 04 00 00 00 00 00 00 00 73 00 00 00 00 00 00 00 | ........s....... | 00001070: 14 01 00 00 00 00 00 00 3d 06 00 00 00 00 00 00 | ........=....... | 00001080: 5d 01 00 00 00 00 00 00 bf 01 00 00 00 00 00 00 | ]............... | 00001090: 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | 000010a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | * 00010000: 10 00 00 00 00 00 00 00 00 00 04 00 01 00 00 00 | ................ | 00010010: 68 19 01 26 10 00 00 00 34 00 00 00 43 4e 3d 43 | h..&....4...CN=C | 00010020: 6f 6d 70 75 74 65 72 73 2c 44 43 3d 62 61 63 6b | omputers,DC=back | 00010030: 75 70 64 6f 6d 2c 44 43 3d 73 61 6d 62 61 2c 44 | updom,DC=samba,D | 00010040: 43 3d 65 78 61 6d 70 6c 65 2c 44 43 3d 63 6f 6d | C=example,DC=com | 00010050: 00 25 00 00 00 62 61 63 6b 75 70 64 6f 6d 2e 73 | .%...backupdom.s | 00010060: 61 6d 62 61 2e 65 78 61 6d 70 6c 65 2e 63 6f 6d | amba.example.com | 00010070: 2f 43 6f 6d 70 75 74 65 72 73 00 80 01 00 00 0b | /Computers...... | 00010080: 00 00 00 6f 62 6a 65 63 74 43 6c 61 73 73 00 02 | ...objectClass.. | 00010090: 00 00 00 01 03 09 02 00 00 00 63 6e 00 01 00 00 | ..........cn.... | 000100a0: 00 01 09 0c 00 00 00 69 6e 73 74 61 6e 63 65 54 | .......instanceT | 000100b0: 79 70 65 00 01 00 00 00 01 01 0b 00 00 00 77 68 | ype...........wh | 000100c0: 65 6e 43 72 65 61 74 65 64 00 01 00 00 00 01 11 | enCreated....... | 000100d0: 0b 00 00 00 77 68 65 6e 43 68 61 6e 67 65 64 00 | ....whenChanged. | 000100e0: 01 00 00 00 01 11 0a 00 00 00 75 53 4e 43 72 65 | ..........uSNCre | 000100f0: 61 74 65 64 00 01 00 00 00 01 04 14 00 00 00 6e | ated...........n | 00010100: 54 53 65 63 75 72 69 74 79 44 65 73 63 72 69 70 | TSecurityDescrip | 00010110: 74 6f 72 00 01 00 00 00 02 f4 05 04 00 00 00 6e | tor............n | 00010120: 61 6d 65 00 01 00 00 00 01 09 0a 00 00 00 6f 62 | ame...........ob | Haven't debugged yet to see why, but that explains the file/hash differences. Created attachment 1944292 [details]
gdb session single-stepping
As you can see in the attached gdb session transcript, mdb_copy does exactly 2 write() calls here, first for 8192 bytes to copy the first two (meta) pages, and then a single write for all of the remaining data of the DB from the mmap to the destination fd. There can be no LMDB bug here, all the action happens in a single write() syscall.
It looks like btrfs is failing to page-in the contents of the source file into the mmap before writing to the destination file.
Also a reminder, even if it works, btrfs is a poor choice of filesystem for use with LMDB. Best performance is with JFS with its journal on a separate device. Otherwise use a non-journaling filesystem like ext2, or just use a raw partition. All journaling filesystems are doing redundant work with LMDB's copy-on-write design. http://www.lmdb.tech/bench/microbench/july/#sec11 Thanks for the analysis Howard. btrfs is the default filesystem for new Fedora installations, so it is really bad if something like this can happen. By the way, this may have nothing to do with mmap or page-ins at all. A friend pointed me to a report of vaguely similar corruption in a file backup utility: https://www.reddit.com/r/btrfs/comments/zq44ib/file_corruption_with_minio_dev_blames_btrfs/ Also for completeness' sake, I ran your reproducer on Ubuntu 22.04 with a mainline 5.17 kernel. I'd guess this bug has been around for a while, not just in 6.x kernel series. (In reply to Howard Chu from comment #6) > By the way, this may have nothing to do with mmap or page-ins at all. A > friend pointed me to a report of vaguely similar corruption in a file backup > utility: > https://www.reddit.com/r/btrfs/comments/zq44ib/ > file_corruption_with_minio_dev_blames_btrfs/ It mentions O_DIRECT as well, which is rather suggestive that something is wrong with O_DIRECT handling. Are there any kernel messages at the time of either the write to database or its verification? I'm wondering if with O_DIRECT there's an in-flight change of data resulting in wrong btrfs data checksums, and subsequent read results in checksum failure, which would cause btrfs to report EIO to user space, and without handing over the blocks btrfs thinks are corrupt. In other words, there may not be corruption of the file, but because btrfs withholds data blocks it thinks are corrupt, the resulting incomplete read causes the database verification to report corruption. Any blocks btrfs thinks are corrupt will typically produce prolific kernel messages, so just `dmesg | grep -i btrfs` will let us know if the above hypothesis is the right track. Two ways to test the hypothesis: A. `mount -o rescue=ignoredatacsums` and then try to reproduce the problem; OR B. create a new database file with nodatacow enabled. e.g. use `chattr +C` on a zero length file; OR on the enclosing (empty) directory prior to creating the database file. We do see this problem with VM's when the qemu cache mode uses O_DIRECT, and the recommended work around is to use nodatacow on such VM images. Libvirt enables nodatacow on enclosing directories on Btrfs when a pool is first activated. I don't see any btrfs message at all in the kernel log. Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the issue discussed in that reddit post. The mdb_copy uses vanilla write(). The source of the write is a read-only mmap of the original database, and the destination is just a plain fd for the new copy of the file. Since we're writing directly from the mmap, the address being written from will always be page-aligned, if that matters. Yep, in fact I can reproduce the mismatch whether the enclosing dir (and test files) are datacow or nodatacow. I started an upstream thread: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/T/#u There is another difference between the files shown by `ls -ls` 1792 -rw-r--r--. 1 chris chris 1835008 Feb 15 15:08 samba-dc.ldb 1736 -rw-r--r--. 1 chris chris 1835008 Feb 15 15:09 samba-dc.ldb.backup It's somehow become a sparse file? (In reply to Howard Chu from comment #10) > Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the > issue discussed in that reddit post. I see this under strace: 1482685 openat(AT_FDCWD, "samba-dc.ldb.backup", O_WRONLY|O_CREAT|O_EXCL|O_CLOEXE C, 0666) = 5 1482685 fcntl(5, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) 1482685 fcntl(5, F_SETFL, O_WRONLY|O_DIRECT|O_LARGEFILE) = 0 That's before the writing starts. (In reply to Florian Weimer from comment #12) > (In reply to Howard Chu from comment #10) > > Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the > > issue discussed in that reddit post. > > I see this under strace: > > 1482685 openat(AT_FDCWD, "samba-dc.ldb.backup", > O_WRONLY|O_CREAT|O_EXCL|O_CLOEXE > C, 0666) = 5 > 1482685 fcntl(5, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) > 1482685 fcntl(5, F_SETFL, O_WRONLY|O_DIRECT|O_LARGEFILE) = 0 > > That's before the writing starts. Ah that's right, I'd forgotten. mdb_copy uses O_DIRECT because we don't want these writes to pollute the page cache or evict the the pages of the DB we're reading from. Is mdb_copy mixing direct and buffered writes on one file? As I understand it this results in undefined behavior. (In reply to Chris Murphy from comment #14) > Is mdb_copy mixing direct and buffered writes on one file? As I understand > it this results in undefined behavior. No. The file is opened, flags set with fnctl, then writes are done. No mixing of modes. OK a few Btrfs devs are looking at it. Stay tuned. Fixed posted upstream: https://lore.kernel.org/linux-btrfs/ae81e48b0e954bae1c3451c0da1a24ae7146606c.1676684984.git.boris@bur.io/T/#u Does someone know with which Kernel this will be fixed? It's still being worked on upstream. See link in c17 for that discussion. A patch will go into mainline first, and then is backported to stable branches. This message is a reminder that Fedora Linux 37 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '37'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 37 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed. Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05. Fedora Linux 37 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed. |