2169947 – mdb_copy produces a corrupted database on btrfs

Bug 2169947 - mdb_copy produces a corrupted database on btrfs

Summary: mdb_copy produces a corrupted database on btrfs

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	37
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	fedora-kernel-btrfs
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-02-15 08:07 UTC by Andreas Schneider
Modified:	2024-01-12 22:45 UTC (History)
CC List:	33 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-01-12 22:45:13 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
mdb_copy reproducer (106.95 KB, application/x-xz) 2023-02-15 08:07 UTC, Andreas Schneider	no flags	Details
gdb session single-stepping (4.23 KB, text/plain) 2023-02-15 12:31 UTC, Howard Chu	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Linux Kernel	217042	0	P1	NEW	mdb_copy produces a corrupted database on btrfs	2023-02-15 08:41:57 UTC

Description Andreas Schneider 2023-02-15 08:07:13 UTC

Created attachment 1944230 [details]
mdb_copy reproducer

1. Please describe the problem:

The mdb_copy operation on btrfs produces a corrupted database. It works just fine on ext4!

Requirements:
sudo dnf install lmdb


## BTRFS

$ tar xf mdb_copy_reproducer.tar.xz            
$ ./reproducer.sh              
Current directory: /home/asn/mdb

Filesystem information for current directory:
Filesystem                                            Type  1024-blocks      Used Available Capacity Mounted on
/dev/mapper/luks-feea5c33-1033-4e33-a507-4ea83cbaf610 btrfs   248378368 209424480  37336592      85% /home

Checksum of samba-dc.ldb: 2307b6aef35ee923c9f9c2f02541ca72760116a39cad62d8455e972cddcabd0a

Backup samba-dc.ldb with mdb_copy to samba-dc.ldb.backup
LMDB 0.9.29: (March 16, 2021)


Checksum of samba-dc.ldb.backup: e9f8ede21831aec0da3897b5880b3008761fb8942d0f4fe086c637319fde4c61

FATAL: The checksums don't match!


## EXT4


$ ./reproducer.sh 
Current directory: /home/asn/mdb

Filesystem information for current directory:
Filesystem         Type 1024-blocks       Used Available Capacity Mounted on
/dev/mapper/cr_md1 ext4  1921722432 1752912108  71118376      97% /home

Checksum of samba-dc.ldb: 2307b6aef35ee923c9f9c2f02541ca72760116a39cad62d8455e972cddcabd0a

Backup samba-dc.ldb with mdb_copy to samba-dc.ldb.backup
LMDB 0.9.29: (March 16, 2021)


Checksum of samba-dc.ldb.backup: 2307b6aef35ee923c9f9c2f02541ca72760116a39cad62d8455e972cddcabd0a

The checksums match



2. What is the Version-Release number of the kernel:

Linux krikkit 6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb  2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

It also fails on:

Linux samba-cli01 6.0.5-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Oct 26 15:55:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

See attachment.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Yes, it does.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

Nope.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Not relevant. Reproducer attached.

Comment 1 Howard Chu 2023-02-15 11:58:17 UTC

Will take me a while to scare up space to create a btrfs partition. But just in general, mdb_copy doesn't do anything special - it just writes in multiples of OS pagesize in a loop until it's done. Does btrfs do anything weird like ZFS, that tries to configure its I/O blocksize based on the size of the first I/O to a file?

Comment 2 Howard Chu 2023-02-15 12:21:37 UTC

Ah had a spare SSD handy. A hex dump of the backup file shows that the contents of the 3rd thru 15th pages are missing, all zeroes.

Dump: samba-dc.ldb.backup

Offset:    0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f   0123456789abcdef

00000000: 00 00 00 00 00 00 00 00 00 00 08 00 00 00 00 00 | ................ |
00000010: de c0 ef be 01 00 00 00 00 00 00 00 00 00 00 00 | ................ |
00000020: 00 00 00 00 02 00 00 00 00 10 00 00 08 40 01 00 | .............@.. |
00000030: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................ |
00000040: 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 | ................ |
00000050: 3f 01 00 00 00 00 00 00 00 00 00 00 00 00 03 00 | ?............... |
00000060: 04 00 00 00 00 00 00 00 73 00 00 00 00 00 00 00 | ........s....... |
00000070: 13 01 00 00 00 00 00 00 36 06 00 00 00 00 00 00 | ........6....... |
00000080: 03 00 00 00 00 00 00 00 bf 01 00 00 00 00 00 00 | ................ |
00000090: 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ |
000000a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ |
*
00001000: 01 00 00 00 00 00 00 00 00 00 08 00 00 00 00 00 | ................ |
00001010: de c0 ef be 01 00 00 00 00 00 00 00 00 00 00 00 | ................ |
00001020: 00 00 00 00 02 00 00 00 00 10 00 00 08 40 01 00 | .............@.. |
00001030: 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 | ................ |
00001040: 00 00 00 00 00 00 00 00 02 00 00 00 00 00 00 00 | ................ |
00001050: bf 01 00 00 00 00 00 00 00 00 00 00 00 00 03 00 | ................ |
00001060: 04 00 00 00 00 00 00 00 73 00 00 00 00 00 00 00 | ........s....... |
00001070: 14 01 00 00 00 00 00 00 3d 06 00 00 00 00 00 00 | ........=....... |
00001080: 5d 01 00 00 00 00 00 00 bf 01 00 00 00 00 00 00 | ]............... |
00001090: 1d 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ |
000010a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................ |
*
00010000: 10 00 00 00 00 00 00 00 00 00 04 00 01 00 00 00 | ................ |
00010010: 68 19 01 26 10 00 00 00 34 00 00 00 43 4e 3d 43 | h..&....4...CN=C |
00010020: 6f 6d 70 75 74 65 72 73 2c 44 43 3d 62 61 63 6b | omputers,DC=back |
00010030: 75 70 64 6f 6d 2c 44 43 3d 73 61 6d 62 61 2c 44 | updom,DC=samba,D |
00010040: 43 3d 65 78 61 6d 70 6c 65 2c 44 43 3d 63 6f 6d | C=example,DC=com |
00010050: 00 25 00 00 00 62 61 63 6b 75 70 64 6f 6d 2e 73 | .%...backupdom.s |
00010060: 61 6d 62 61 2e 65 78 61 6d 70 6c 65 2e 63 6f 6d | amba.example.com |
00010070: 2f 43 6f 6d 70 75 74 65 72 73 00 80 01 00 00 0b | /Computers...... |
00010080: 00 00 00 6f 62 6a 65 63 74 43 6c 61 73 73 00 02 | ...objectClass.. |
00010090: 00 00 00 01 03 09 02 00 00 00 63 6e 00 01 00 00 | ..........cn.... |
000100a0: 00 01 09 0c 00 00 00 69 6e 73 74 61 6e 63 65 54 | .......instanceT |
000100b0: 79 70 65 00 01 00 00 00 01 01 0b 00 00 00 77 68 | ype...........wh |
000100c0: 65 6e 43 72 65 61 74 65 64 00 01 00 00 00 01 11 | enCreated....... |
000100d0: 0b 00 00 00 77 68 65 6e 43 68 61 6e 67 65 64 00 | ....whenChanged. |
000100e0: 01 00 00 00 01 11 0a 00 00 00 75 53 4e 43 72 65 | ..........uSNCre |
000100f0: 61 74 65 64 00 01 00 00 00 01 04 14 00 00 00 6e | ated...........n |
00010100: 54 53 65 63 75 72 69 74 79 44 65 73 63 72 69 70 | TSecurityDescrip |
00010110: 74 6f 72 00 01 00 00 00 02 f4 05 04 00 00 00 6e | tor............n |
00010120: 61 6d 65 00 01 00 00 00 01 09 0a 00 00 00 6f 62 | ame...........ob |

Haven't debugged yet to see why, but that explains the file/hash differences.

Comment 3 Howard Chu 2023-02-15 12:31:42 UTC

Created attachment 1944292 [details]
gdb session single-stepping

As you can see in the attached gdb session transcript, mdb_copy does exactly 2 write() calls here, first for 8192 bytes to copy the first two (meta) pages, and then a single write for all of the remaining data of the DB from the mmap to the destination fd. There can be no LMDB bug here, all the action happens in a single write() syscall.

It looks like btrfs is failing to page-in the contents of the source file into the mmap before writing to the destination file.

Comment 4 Howard Chu 2023-02-15 12:43:38 UTC

Also a reminder, even if it works, btrfs is a poor choice of filesystem for use with LMDB. Best performance is with JFS with its journal on a separate device. Otherwise use a non-journaling filesystem like ext2, or just use a raw partition. All journaling filesystems are doing redundant work with LMDB's copy-on-write design.

http://www.lmdb.tech/bench/microbench/july/#sec11

Comment 5 Andreas Schneider 2023-02-15 14:22:00 UTC

Thanks for the analysis Howard. btrfs is the default filesystem for new Fedora installations, so it is really bad if something like this can happen.

Comment 6 Howard Chu 2023-02-15 15:25:14 UTC

By the way, this may have nothing to do with mmap or page-ins at all. A friend pointed me to a report of vaguely similar corruption in a file backup utility: https://www.reddit.com/r/btrfs/comments/zq44ib/file_corruption_with_minio_dev_blames_btrfs/

Also for completeness' sake, I ran your reproducer on Ubuntu 22.04 with a mainline 5.17 kernel. I'd guess this bug has been around for a while, not just in 6.x kernel series.

Comment 7 Florian Weimer 2023-02-15 16:59:44 UTC

(In reply to Howard Chu from comment #6)
> By the way, this may have nothing to do with mmap or page-ins at all. A
> friend pointed me to a report of vaguely similar corruption in a file backup
> utility:
> https://www.reddit.com/r/btrfs/comments/zq44ib/
> file_corruption_with_minio_dev_blames_btrfs/

It mentions O_DIRECT as well, which is rather suggestive that something is wrong with O_DIRECT handling.

Comment 8 Chris Murphy 2023-02-15 18:50:48 UTC

Are there any kernel messages at the time of either the write to database or its verification? I'm wondering if with O_DIRECT there's an in-flight change of data resulting in wrong btrfs data checksums, and subsequent read results in checksum failure, which would cause btrfs to report EIO to user space, and without handing over the blocks btrfs thinks are corrupt. In other words, there may not be corruption of the file, but because btrfs withholds data blocks it thinks are corrupt, the resulting incomplete read causes the database verification to report corruption.

Any blocks btrfs thinks are corrupt will typically produce prolific kernel messages, so just `dmesg | grep -i btrfs` will let us know if the above hypothesis is the right track.

Two ways to test the hypothesis:

A. `mount -o rescue=ignoredatacsums` and then try to reproduce the problem; OR

B. create a new database file with nodatacow enabled. e.g. use `chattr +C` on a zero length file; OR on the enclosing (empty) directory prior to creating the database file.

We do see this problem with VM's when the qemu cache mode uses O_DIRECT, and the recommended work around is to use nodatacow on such VM images. Libvirt enables nodatacow on enclosing directories on Btrfs when a pool is first activated.

Comment 9 Andreas Schneider 2023-02-15 19:54:54 UTC

I don't see any btrfs message at all in the kernel log.

Comment 10 Howard Chu 2023-02-15 20:05:12 UTC

Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the issue discussed in that reddit post.

The mdb_copy uses vanilla write(). The source of the write is a read-only mmap of the original database, and
the destination is just a plain fd for the new copy of the file. Since we're writing directly from the mmap,
the address being written from will always be page-aligned, if that matters.

Comment 11 Chris Murphy 2023-02-15 20:17:13 UTC

Yep, in fact I can reproduce the mismatch whether the enclosing dir (and test files) are datacow or nodatacow.

I started an upstream thread:
https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/T/#u

There is another difference between the files shown by `ls -ls`

1792 -rw-r--r--. 1 chris chris 1835008 Feb 15 15:08 samba-dc.ldb
1736 -rw-r--r--. 1 chris chris 1835008 Feb 15 15:09 samba-dc.ldb.backup

It's somehow become a sparse file?

Comment 12 Florian Weimer 2023-02-15 20:18:47 UTC

(In reply to Howard Chu from comment #10)
> Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the
> issue discussed in that reddit post.

I see this under strace:

1482685 openat(AT_FDCWD, "samba-dc.ldb.backup", O_WRONLY|O_CREAT|O_EXCL|O_CLOEXE
C, 0666) = 5
1482685 fcntl(5, F_GETFL)               = 0x8001 (flags O_WRONLY|O_LARGEFILE)
1482685 fcntl(5, F_SETFL, O_WRONLY|O_DIRECT|O_LARGEFILE) = 0

That's before the writing starts.

Comment 13 Howard Chu 2023-02-15 20:31:27 UTC

(In reply to Florian Weimer from comment #12)
> (In reply to Howard Chu from comment #10)
> > Just to clarify: LMDB doesn't use O_DIRECT. That's only relevant to the
> > issue discussed in that reddit post.
> 
> I see this under strace:
> 
> 1482685 openat(AT_FDCWD, "samba-dc.ldb.backup",
> O_WRONLY|O_CREAT|O_EXCL|O_CLOEXE
> C, 0666) = 5
> 1482685 fcntl(5, F_GETFL)               = 0x8001 (flags O_WRONLY|O_LARGEFILE)
> 1482685 fcntl(5, F_SETFL, O_WRONLY|O_DIRECT|O_LARGEFILE) = 0
> 
> That's before the writing starts.

Ah that's right, I'd forgotten. mdb_copy uses O_DIRECT because we don't want these writes to pollute the page cache or evict the the pages of the DB we're reading from.

Comment 14 Chris Murphy 2023-02-15 21:04:23 UTC

Is mdb_copy mixing direct and buffered writes on one file? As I understand it this results in undefined behavior.

Comment 15 Howard Chu 2023-02-15 21:21:27 UTC

(In reply to Chris Murphy from comment #14)
> Is mdb_copy mixing direct and buffered writes on one file? As I understand
> it this results in undefined behavior.

No. The file is opened, flags set with fnctl, then writes are done. No mixing of modes.

Comment 16 Chris Murphy 2023-02-15 21:52:37 UTC

OK a few Btrfs devs are looking at it. Stay tuned.

Comment 17 Chris Murphy 2023-02-18 05:39:08 UTC

Fixed posted upstream: https://lore.kernel.org/linux-btrfs/ae81e48b0e954bae1c3451c0da1a24ae7146606c.1676684984.git.boris@bur.io/T/#u

Comment 18 Andreas Schneider 2023-03-15 06:59:38 UTC

Does someone know with which Kernel this will be fixed?

Comment 19 Chris Murphy 2023-03-16 14:22:10 UTC

It's still being worked on upstream. See link in c17 for that discussion. A patch will go into mainline first, and then is backported to stable branches.

Comment 20 Aoife Moloney 2023-11-23 01:15:15 UTC

This message is a reminder that Fedora Linux 37 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 37 on 2023-12-05.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '37'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 37 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 21 Aoife Moloney 2024-01-12 22:45:13 UTC

Fedora Linux 37 entered end-of-life (EOL) status on 2023-12-05.

Fedora Linux 37 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.

abartlet
acaringi
adam900710
adscvr
airlied
alciregi
asn
bskeggs
bugzilla
davide
fweimer
hdegoede
hpa
hyc
jarodwilson
jforbes
jglisse
josef
jra
jrische
jstanek
kernel-maint
lgoncalv
linville
masami256
mchehab
michele
michel
ngompa13
ptalbert
rsroka
steved
todoleza