Created attachment 1944230 [details] mdb_copy reproducer 1. Please describe the problem: The mdb_copy operation on btrfs produces a corrupted database. It works just fine on ext4! Requirements: sudo dnf install lmdb ## BTRFS $ tar xf mdb_copy_reproducer.tar.xz $ ./reproducer.sh Current directory: /home/asn/mdb Filesystem information for current directory: Filesystem Type 1024-blocks Used Available Capacity Mounted on /dev/mapper/luks-feea5c33-1033-4e33-a507-4ea83cbaf610 btrfs 248378368 209424480 37336592 85% /home Checksum of samba-dc.ldb: 2307b6aef35ee923c9f9c2f02541ca72760116a39cad62d8455e972cddcabd0a Backup samba-dc.ldb with mdb_copy to samba-dc.ldb.backup LMDB 0.9.29: (March 16, 2021) Checksum of samba-dc.ldb.backup: e9f8ede21831aec0da3897b5880b3008761fb8942d0f4fe086c637319fde4c61 FATAL: The checksums don't match! ## EXT4 $ ./reproducer.sh Current directory: /home/asn/mdb Filesystem information for current directory: Filesystem Type 1024-blocks Used Available Capacity Mounted on /dev/mapper/cr_md1 ext4 1921722432 1752912108 71118376 97% /home Checksum of samba-dc.ldb: 2307b6aef35ee923c9f9c2f02541ca72760116a39cad62d8455e972cddcabd0a Backup samba-dc.ldb with mdb_copy to samba-dc.ldb.backup LMDB 0.9.29: (March 16, 2021) Checksum of samba-dc.ldb.backup: 2307b6aef35ee923c9f9c2f02541ca72760116a39cad62d8455e972cddcabd0a The checksums match 2. What is the Version-Release number of the kernel: Linux krikkit 6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : It also fails on: Linux samba-cli01 6.0.5-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Oct 26 15:55:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: See attachment. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes, it does. 6. Are you running any modules that not shipped with directly Fedora's kernel?: Nope. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. Not relevant. Reproducer attached.
Will take me a while to scare up space to create a btrfs partition. But just in general, mdb_copy doesn't do anything special - it just writes in multiples of OS pagesize in a loop until it's done. Does btrfs do anything weird like ZFS, that tries to configure its I/O blocksize based on the size of the first I/O to a file?
Ah had a spare SSD handy. A hex dump of the backup file shows that the contents of the 3rd thru 15th pages are missing, all zeroes. Dump: samba-dc.ldb.backup Offset: 0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef 00000000: 00 00 00 00 00 00 00 00 00 00 08 00 00 00 00 00 | ................ | 00000010: de c0 ef be 01 00 00 00 00 00 00 00 00 00 00 00 | ................ | 00000020: 00 00 00 00 02 00 00 00 00 10 00 00 08 40 01 00 | .............@.. | 00000030: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................ | 00000040: 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 | ................ | 00000050: 3f 01 00 00 00 00 00 00 00 00 00 00 00 00 03 00 | ?............... | 00000060: 04 00 00 00 00 00 00 00 73 00 00 00 00 00 00 00 | ........s....... | 00000070: 13 01 00 00 00 00 00 00 36 06 00 00 00 00 00 00 | ........6....... | 00000080: 03 00 00 00 00 00 00 00 bf 01 00 00 00 00 00 00 | ................ | 00000090: 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | 000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | * 00001000: 01 00 00 00 00 00 00 00 00 00 08 00 00 00 00 00 | ................ | 00001010: de c0 ef be 01 00 00 00 00 00 00 00 00 00 00 00 | ................ | 00001020: 00 00 00 00 02 00 00 00 00 10 00 00 08 40 01 00 | .............@.. | 00001030: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................ | 00001040: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 | ................ | 00001050: bf 01 00 00 00 00 00 00 00 00 00 00 00 00 03 00 | ................ | 00001060: 04 00 00 00 00 00 00 00 73 00 00 00 00 00 00 00 | ........s....... | 00001070: 14 01 00 00 00 00 00 00 3d 06 00 00 00 00 00 00 | ........=....... | 00001080: 5d 01 00 00 00 00 00 00 bf 01 00 00 00 00 00 00 | ]............... | 00001090: 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | 000010a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ | * 00010000: 10 00 00 00 00 00 00 00 00 00 04 00 01 00 00 00 | ................ | 00010010: 68 19 01 26 10 00 00 00 34 00 00 00 43 4e 3d 43 | h..&....4...CN=C | 00010020: 6f 6d 70 75 74 65 72 73 2c 44 43 3d 62 61 63 6b | omputers,DC=back | 00010030: 75 70 64 6f 6d 2c 44 43 3d 73 61 6d 62 61 2c 44 | updom,DC=samba,D | 00010040: 43 3d 65 78 61 6d 70 6c 65 2c 44 43 3d 63 6f 6d | C=example,DC=com | 00010050: 00 25 00 00 00 62 61 63 6b 75 70 64 6f 6d 2e 73 | .%...backupdom.s | 00010060: 61 6d 62 61 2e 65 78 61 6d 70 6c 65 2e 63 6f 6d | amba.example.com | 00010070: 2f 43 6f 6d 70 75 74 65 72 73 00 80 01 00 00 0b | /Computers...... | 00010080: 00 00 00 6f 62 6a 65 63 74 43 6c 61 73 73 00 02 | ...objectClass.. | 00010090: 00 00 00 01 03 09 02 00 00 00 63 6e 00 01 00 00 | ..........cn.... | 000100a0: 00 01 09 0c 00 00 00 69 6e 73 74 61 6e 63 65 54 | .......instanceT | 000100b0: 79 70 65 00 01 00 00 00 01 01 0b 00 00 00 77 68 | ype...........wh | 000100c0: 65 6e 43 72 65 61 74 65 64 00 01 00 00 00 01 11 | enCreated....... | 000100d0: 0b 00 00 00 77 68 65 6e 43 68 61 6e 67 65 64 00 | ....whenChanged. | 000100e0: 01 00 00 00 01 11 0a 00 00 00 75 53 4e 43 72 65 | ..........uSNCre | 000100f0: 61 74 65 64 00 01 00 00 00 01 04 14 00 00 00 6e | ated...........n | 00010100: 54 53 65 63 75 72 69 74 79 44 65 73 63 72 69 70 | TSecurityDescrip | 00010110: 74 6f 72 00 01 00 00 00 02 f4 05 04 00 00 00 6e | tor............n | 00010120: 61 6d 65 00 01 00 00 00 01 09 0a 00 00 00 6f 62 | ame...........ob | Haven't debugged yet to see why, but that explains the file/hash differences.
Created attachment 1944292 [details] gdb session single-stepping As you can see in the attached gdb session transcript, mdb_copy does exactly 2 write() calls here, first for 8192 bytes to copy the first two (meta) pages, and then a single write for all of the remaining data of the DB from the mmap to the destination fd. There can be no LMDB bug here, all the action happens in a single write() syscall. It looks like btrfs is failing to page-in the contents of the source file into the mmap before writing to the destination file.
Also a reminder, even if it works, btrfs is a poor choice of filesystem for use with LMDB. Best performance is with JFS with its journal on a separate device. Otherwise use a non-journaling filesystem like ext2, or just use a raw partition. All journaling filesystems are doing redundant work with LMDB's copy-on-write design. http://www.lmdb.tech/bench/microbench/july/#sec11
Thanks for the analysis Howard. btrfs is the default filesystem for new Fedora installations, so it is really bad if something like this can happen.
By the way, this may have nothing to do with mmap or page-ins at all. A friend pointed me to a report of vaguely similar corruption in a file backup utility: https://www.reddit.com/r/btrfs/comments/zq44ib/file_corruption_with_minio_dev_blames_btrfs/ Also for completeness' sake, I ran your reproducer on Ubuntu 22.04 with a mainline 5.17 kernel. I'd guess this bug has been around for a while, not just in 6.x kernel series.
(In reply to Howard Chu from comment #6) > By the way, this may have nothing to do with mmap or page-ins at all. A > friend pointed me to a report of vaguely similar corruption in a file backup > utility: > https://www.reddit.com/r/btrfs/comments/zq44ib/ > file_corruption_with_minio_dev_blames_btrfs/ It mentions O_DIRECT as well, which is rather suggestive that something is wrong with O_DIRECT handling.
Are there any kernel messages at the time of either the write to database or its verification? I'm wondering if with O_DIRECT there's an in-flight change of data resulting in wrong btrfs data checksums, and subsequent read results in checksum failure, which would cause btrfs to report EIO to user space, and without handing over the blocks btrfs thinks are corrupt. In other words, there may not be corruption of the file, but because btrfs withholds data blocks it thinks are corrupt, the resulting incomplete read causes the database verification to report corruption. Any blocks btrfs thinks are corrupt will typically produce prolific kernel messages, so just `dmesg | grep -i btrfs` will let us know if the above hypothesis is the right track. Two ways to test the hypothesis: A. `mount -o rescue=ignoredatacsums` and then try to reproduce the problem; OR B. create a new database file with nodatacow enabled. e.g. use `chattr +C` on a zero length file; OR on the enclosing (empty) directory prior to creating the database file. We do see this problem with VM's when the qemu cache mode uses O_DIRECT, and the recommended work around is to use nodatacow on such VM images. Libvirt enables nodatacow on enclosing directories on Btrfs when a pool is first activated.
I don't see any btrfs message at all in the kernel log.
Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the issue discussed in that reddit post. The mdb_copy uses vanilla write(). The source of the write is a read-only mmap of the original database, and the destination is just a plain fd for the new copy of the file. Since we're writing directly from the mmap, the address being written from will always be page-aligned, if that matters.
Yep, in fact I can reproduce the mismatch whether the enclosing dir (and test files) are datacow or nodatacow. I started an upstream thread: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/T/#u There is another difference between the files shown by `ls -ls` 1792 -rw-r--r--. 1 chris chris 1835008 Feb 15 15:08 samba-dc.ldb 1736 -rw-r--r--. 1 chris chris 1835008 Feb 15 15:09 samba-dc.ldb.backup It's somehow become a sparse file?
(In reply to Howard Chu from comment #10) > Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the > issue discussed in that reddit post. I see this under strace: 1482685 openat(AT_FDCWD, "samba-dc.ldb.backup", O_WRONLY|O_CREAT|O_EXCL|O_CLOEXE C, 0666) = 5 1482685 fcntl(5, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) 1482685 fcntl(5, F_SETFL, O_WRONLY|O_DIRECT|O_LARGEFILE) = 0 That's before the writing starts.
(In reply to Florian Weimer from comment #12) > (In reply to Howard Chu from comment #10) > > Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the > > issue discussed in that reddit post. > > I see this under strace: > > 1482685 openat(AT_FDCWD, "samba-dc.ldb.backup", > O_WRONLY|O_CREAT|O_EXCL|O_CLOEXE > C, 0666) = 5 > 1482685 fcntl(5, F_GETFL) = 0x8001 (flags O_WRONLY|O_LARGEFILE) > 1482685 fcntl(5, F_SETFL, O_WRONLY|O_DIRECT|O_LARGEFILE) = 0 > > That's before the writing starts. Ah that's right, I'd forgotten. mdb_copy uses O_DIRECT because we don't want these writes to pollute the page cache or evict the the pages of the DB we're reading from.
Is mdb_copy mixing direct and buffered writes on one file? As I understand it this results in undefined behavior.
(In reply to Chris Murphy from comment #14) > Is mdb_copy mixing direct and buffered writes on one file? As I understand > it this results in undefined behavior. No. The file is opened, flags set with fnctl, then writes are done. No mixing of modes.
OK a few Btrfs devs are looking at it. Stay tuned.
Fixed posted upstream: https://lore.kernel.org/linux-btrfs/ae81e48b0e954bae1c3451c0da1a24ae7146606c.1676684984.git.boris@bur.io/T/#u
Does someone know with which Kernel this will be fixed?
It's still being worked on upstream. See link in c17 for that discussion. A patch will go into mainline first, and then is backported to stable branches.
This message is a reminder that Fedora Linux 37 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a 'version' of '37'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, change the 'version' to a later Fedora Linux version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see it. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora Linux 37 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora Linux, you are encouraged to change the 'version' to a later version prior to this bug being closed.
Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05. Fedora Linux 37 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora Linux please feel free to reopen this bug against that version. Note that the version field may be hidden. Click the "Show advanced fields" button if you do not see the version field. If you are unable to reopen this bug, please file a new report against an active release. Thank you for reporting this bug and we are sorry it could not be fixed.