Bug 1914433 - btrfs csum errors with cow VM images and using O_DIRECT
Summary: btrfs csum errors with cow VM images and using O_DIRECT
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 33
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-01-08 20:22 UTC by David Woodhouse
Modified: 2023-07-02 16:07 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-30 16:10:43 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
casper.xml (7.00 KB, text/plain)
2021-01-14 11:03 UTC, David Woodhouse
no flags Details

Description David Woodhouse 2021-01-08 20:22:00 UTC
# btrfs fi show /
Label: 'root'  uuid: ca682c0c-67ae-4723-8337-2bb8816fe37e
	Total devices 2 FS bytes used 1.44TiB
	devid    1 size 7.27TiB used 1.44TiB path /dev/sda3
	devid    2 size 7.27TiB used 1.44TiB path /dev/sdb3


I had a corruption in my qcow file used for hosting a VM. So I salvaged it into a shiny new qcow file, inode #24387997.

Then I rebooted, and I'm seeing crc failures with *both* mirrors in my RAID-1  having the same apparently wrong checksum:


.[ 6827.513630] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 1
[ 6827.517448] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8286, gen 0
[ 6827.527281] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 2
[ 6827.530817] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9115, gen 0
[ 6892.036519] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200421351424 csum 0x8289e91f expected csum 0xaaadc53d mirror 2
[ 6892.040498] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9116, gen 0
[ 6892.060099] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200421351424 csum 0x8289e91f expected csum 0xaaadc53d mirror 1
[ 6892.063553] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8287, gen 0
[ 6892.119530] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200421351424 csum 0x8289e91f expected csum 0xaaadc53d mirror 2
[ 6892.122922] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9117, gen 0
[ 6893.170668] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200425775104 csum 0xb7c69c4d expected csum 0x461bc8df mirror 2
[ 6893.174223] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9118, gen 0
[ 6893.178277] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200425795584 csum 0xc21ad299 expected csum 0x72be42b3 mirror 2
[ 6893.181566] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9119, gen 0
[ 6893.192448] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200425775104 csum 0xb7c69c4d expected csum 0x461bc8df mirror 1
[ 6893.195668] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8288, gen 0
[ 6893.201786] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200425795584 csum 0xc21ad299 expected csum 0x72be42b3 mirror 1
[ 6893.204929] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8289, gen 0
[ 6893.220024] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200425775104 csum 0xb7c69c4d expected csum 0x461bc8df mirror 2
[ 6893.223091] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9120, gen 0
[ 6893.228423] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 200425795584 csum 0xc21ad299 expected csum 0x72be42b3 mirror 2
[ 6893.231413] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9121, gen 0

Comment 1 David Woodhouse 2021-01-08 20:22:30 UTC
This is 5.9.16-200.fc33.x86_64

Comment 2 David Woodhouse 2021-01-09 00:17:12 UTC
I think this might be a recurrence of https://bugzilla.redhat.com/show_bug.cgi?id=1204569

The qcow2 was being used in directsync mode.

Comment 3 David Woodhouse 2021-01-09 10:42:09 UTC
BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 1

I assume the data (with crc 0x81529887) in both mirrors are correct, but btrfs won't actually *return* the data since it thinks the crc is wrong.

Is there a way to make btrfs return the data anyway, despite the mismatch?

Or should I attempt to recover the data from the disk and rewrite it to the same location in the file?

Comment 4 Chris Murphy 2021-01-10 22:29:04 UTC
It's recommended to use nodatacow (chattr +C) for VM images. This is the default on Fedora when using libvirt created pools.

There are three ways to get this file out.

1. Install kernel 5.11
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

mount -o ro,rescue=ignoredatacsums

The file can now be copied out normally.

2. 'btrfs restore' is an offline scrape tool that will remove the file ignoring csums
https://btrfs.wiki.kernel.org/index.php/Restore

You'll probably need to use --regex-path or else it copies everything, so figure out that path in advance (while the fs is mounted). e.g.

--path-regex "/(|home(|/chris(|/.weechat(|/logs(|/irc.freenode.#btrfs.weechatlog)))))$"

3. 'btrfs repair --init-csum-tree' will create a new checksum tree. Depending on the size of the file system, this could take a while because all data must be read in order to create all new checksums for the new csum tree. I'd opt for 1 or 2.

Comment 5 David Woodhouse 2021-01-11 10:45:54 UTC
At a bare minimum, libvirt should be refusing to use O_DIRECT on btrfs file systems since it's known to lead to corruption, shouldn't it?

Comment 6 Chris Murphy 2021-01-12 02:33:09 UTC
O_DIRECT is fine on Btrfs if the file is nodatacow. And in the datacow case, I'm pretty sure the csum error is a false positive, i.e. the data is OK, as evidence by the data in the guest being consistently intact, at least in my limited (and possibly stale) testing.

If the guest data is shown to be corrupt, or not returned, e.g. EIO to qemu, then that's quite a different matter. However, since libvirt defaults to nodatacow with new pools on Btrfs, and the problem doesn't happen there, I'm reluctant to agree that libvirt must do something.

Comment 7 David Woodhouse 2021-01-12 09:47:30 UTC
Hm, can I make 'qemu-img convert' (or just dd) read the data back out with O_DIRECT so that it bypasses the checksum false positive too?

Comment 8 David Woodhouse 2021-01-12 10:30:09 UTC
# dd if=old-casper.cqow2 of=/twosheds/old-casper.qcow2 bs=262144 iflag=direct status=progress
3645374464 bytes (3.6 GB, 3.4 GiB) copied, 103 s, 35.4 MB/s
dd: error reading 'old-casper.cqow2': Input/output error
13906+0 records in
13906+0 records out
3645374464 bytes (3.6 GB, 3.4 GiB) copied, 103.363 s, 35.3 MB/s

Comment 9 David Woodhouse 2021-01-12 10:39:22 UTC
# qemu-img convert  old-casper.cqow2 /twosheds/old-casper.qcow2  -O qcow2 -p -n -T directsync
qemu-img: error while reading at byte 3638755328: Input/output error

Comment 10 Chris Murphy 2021-01-12 18:09:50 UTC
Older kernels it was possible, but looks like that was considered a bug and fixed some time ago.

My advice is copy the file out with 5.11 using ro,rescue=nodatacsums option; set chattr +C (nodatacow) on the containing directory, then copy the image to that directory.

Comment 11 David Woodhouse 2021-01-12 20:22:04 UTC
I'd have to reboot the host for that. Instead I'll copy the file system onto one of the many spare 8TB drives I have lying around (don't ask; I may cry) and shove it in a different box (or assign it to a VM) for the rescue=nodatacsums part.

Can I expand the btrfs RAID-1 to cover three disks "temporarily" and then take one of them away? Or should I just dd the drive while it's mounted, on the grounds that I don't *care* about the inconsistencies that'll result because I'm not currently writing to the files I care about?

Comment 12 Chris Murphy 2021-01-12 21:54:48 UTC
I expect dd of a rw mounted Btrfs won't mount. But you might be able to extract the file from such an image using 'btrfs restore'.

A 3x raid1 is not 3 copies of data, it's still 2 copies, spread across 3 drives. You'd need to convert data and metadata to raid1c3 to have three copies, but it's expensive, it reads all data and writes it out to all three drives. I'm not sure what happens if you have a 3x raid1c3 with a removed drive, I don't think it will let you remove missing or convert back to raid1 while that drive is missing, pretty sure it'll insist that the volume is repaired first by replacing the missing 3rd drive. And then you'll be able to convert back to raid1. I don't recommend it.

It's risky, in that I haven't tested this lately, and requires hotplug drives: pull one of the mirror drives, mount -o degraded,ro,rescue=ignoredatacsums on the other box, pull out the file, return this drive to the server, start a scrub. What I can't tell you for sure is what interaction happens with udev and btrfs scan and if the reappearance of the drive will just cause it to get added back in or if you're going to have to umount and then mount them while they are both present. In any case, you will have to scrub to make them both synced up.

Comment 13 David Woodhouse 2021-01-13 21:00:58 UTC
'btrfs restore' says it doesn't modify the image. So it might as well be run on the live one... if I just hack out this check for it being mounted... that works :)

Comment 14 David Woodhouse 2021-01-13 21:03:00 UTC
Given that I am left with a file which cannot be read or backed up without 'heroic measures' (even if it *did* still work for the guest), can I perhaps ask you to reconsider your reply in comment 6:

> (In reply to Chris Murphy from comment #6)
> O_DIRECT is fine on Btrfs if the file is nodatacow. And in the datacow case,
> I'm pretty sure the csum error is a false positive, i.e. the data is OK, as
> evidence by the data in the guest being consistently intact, at least in my
> limited (and possibly stale) testing.
> 
> If the guest data is shown to be corrupt, or not returned, e.g. EIO to qemu,
> then that's quite a different matter. However, since libvirt defaults to
> nodatacow with new pools on Btrfs, and the problem doesn't happen there, I'm
> reluctant to agree that libvirt must do something.

I think refusing to use O_DIRECT on files without nodatacow on btrfs might be a good idea.

Comment 15 Chris Murphy 2021-01-14 02:11:02 UTC
What qemu cache mode were you using? none, directsync, or writethrough?

Comment 16 Chris Murphy 2021-01-14 02:37:18 UTC
Also, what is the guest in the VM? If this is qemu-kvm, could you attach the reproducing configuration xml file, i.e.

virsh dumpxml $VMNAME > vmname-dump.xml

I'm not really sure where this should be fixed to prevent it from happening in the first place. I just read the notes for the patch that introduces the rescue=ignoredatacsums.

https://lore.kernel.org/linux-btrfs/c3cc0815c5756d07201c57063f3759250f662c77.1600961206.git.josef@toxicpanda.com

"There are cases where you can end up with bad data csums because of
misbehaving applications.  This happens when an application modifies a
buffer in-flight when doing an O_DIRECT write.  In order to recover the
file we need a way to turn off data checksums so you can copy the file
off, and then you can delete the file and restore it properly later."

Comment 17 David Woodhouse 2021-01-14 11:03:49 UTC
Created attachment 1747349 [details]
casper.xml

It was directsync. The guest is Fedora 33 x86_64, the file system ext4 (on bare partitions on /dev/vda[13]:

[root@casper ~] # mount | grep vda
/dev/vda3 on / type ext4 (rw,relatime,seclabel)
/dev/vda1 on /boot type ext4 (rw,relatime,seclabel)

I had seen there assertions that the problem only happens with rare guest configurations that modify data in flight. I'm not sure I believe them.

    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/casper/casper.qcow2' index='2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </disk>

Comment 18 Daniel Berrangé 2021-01-14 11:07:36 UTC
(In reply to David Woodhouse from comment #5)
> At a bare minimum, libvirt should be refusing to use O_DIRECT on btrfs file
> systems since it's known to lead to corruption, shouldn't it?

Libvirt doesn't want to keep track of whether particular kernel version + filesystems are broken.
We would expect the kernel to reject the use of open(..., O_DIRECT) with errno=EINVAL in any scenarios where its usage is known to be broken...

(In reply to David Woodhouse from comment #14)
> Given that I am left with a file which cannot be read or backed up without
> 'heroic measures' (even if it *did* still work for the guest), can I perhaps
> ask you to reconsider your reply in comment 6:

I'm puzzelled how data could still be read successfully by the guest OS via QEMU, and yet fail for dd or qemu-img.  qemu-img uses the exact same code as QEMU will use for guest OS reads (Assuming you've set matching cache options for qemu-img vs QEMU.

Comment 19 Chris Murphy 2021-01-14 22:54:39 UTC
The issue is ext4 and xfs don't enforce stable pages. It's reported to be even more problematic with Windows guests. Is it a bug? *shrug* I don't know how to characterize it, that exceeds my knowledge. The ext4 and XFS devs might call it an optimization, but then checksumming on the host catches it. I haven't tested this, but I expect a host using dm-integrity or ZFS, the same problem would happen. And they have no opt out like Btrfs.

The cache mode pop-up menu in virt-manager is an advanced area. Maybe a triangular warning icon appeared anytime none or directsync are selected, and just document this accordingly?

Comment 20 Chris Murphy 2021-01-14 23:12:06 UTC
(In reply to Daniel Berrangé from comment #18)
> I'm puzzelled how data could still be read successfully by the guest OS via
> QEMU, and yet fail for dd or qemu-img.

This exceeds my knowledge. My understanding is a csum error results in EIO to the application. I have no idea how reads work though between host and guest. Is there always a 1:1 correlation between a guest kernel block range read/write and host kernel block range read/write? i.e. if not, I could see the host kernel read a bigger range than the guest requests, and the EIO for some of these blocks just doesn't matter to the guest because those blocks weren't even requested, but they show up as an error on the host. I also don't know how EIO on the host propagates from qemu to the guest kernel and how it would manifest in the guest. Maybe it should just be EIO there too? In which case where does that show up? dmesg or the application has to do its own error handling and reporting?

Comment 21 David Woodhouse 2021-01-15 09:00:08 UTC
(In reply to Daniel Berrangé from comment #18)
> (In reply to David Woodhouse from comment #5)
> > At a bare minimum, libvirt should be refusing to use O_DIRECT on btrfs file
> > systems since it's known to lead to corruption, shouldn't it?
> 
> Libvirt doesn't want to keep track of whether particular kernel version +
> filesystems are broken.
> We would expect the kernel to reject the use of open(..., O_DIRECT) with
> errno=EINVAL in any scenarios where its usage is known to be broken...

I understand why libvirt upstream might not keep track of that for various arbitrary kernels and filesystems.

But *Fedora* has shipped *this* particular kernel version and file system as its default, which is a slightly different prospect :)

> (In reply to David Woodhouse from comment #14)
> > Given that I am left with a file which cannot be read or backed up without
> > 'heroic measures' (even if it *did* still work for the guest), can I perhaps
> > ask you to reconsider your reply in comment 6:
> 
> I'm puzzelled how data could still be read successfully by the guest OS via
> QEMU, and yet fail for dd or qemu-img.  qemu-img uses the exact same code as
> QEMU will use for guest OS reads (Assuming you've set matching cache options
> for qemu-img vs QEMU.

Indeed, I'm puzzled by that too. I expected my attempts in comment 8 and/or comment 9 to work, and specifically included the command lines when they didn't, so that someone could tell me how I was Doing It Wrong.

(When I deliberately did it Wrong in comment 13 that failed too in the end with a complaint about inconsistent transids. Which will probably happen if I mirror the full drive when it's live too; it looks like I will actually have to reboot the host if I can't get the data out again with O_DIRECT.)

Comment 22 Chris Murphy 2021-01-15 09:50:30 UTC
(In reply to David Woodhouse from comment #21)

> But *Fedora* has shipped *this* particular kernel version and file system as
> its default, which is a slightly different prospect :)

nodatacow is the libvirt default; and the qemu default cache mode is writeback. The problem doesn't happen with either of them.

Comment 23 Ben Cotton 2021-11-04 16:03:23 UTC
This message is a reminder that Fedora 33 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 33 on 2021-11-30.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '33'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 33 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 24 Ben Cotton 2021-11-30 16:10:43 UTC
Fedora 33 changed to end-of-life (EOL) status on 2021-11-30. Fedora 33 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 25 Simon Skoda 2023-06-29 08:06:08 UTC
This issue still happens on Fedora 38 with kernel 6.3.8, I know cache=none and disabling nodatacow on VM images are both non-default configurations However, it's also not an unreasonable one, cache=none is common to leverage direct_io and nodatacow disables compression, therefore users who think they can have both will run into this (as I did).

The docs currently say this:
"Compression is done using the COW mechanism so it’s incompatible with nodatacow. Direct IO works on compressed files but will fall back to buffered writes and leads to recompression. Currently nodatasum and compression don’t work together."

So from this I thought the fallback would be graceful without errors. Have I read the following gotcha from old btrfs wiki it would've been clear:

"Direct IO writes to Btrfs files can result in checksum warnings. This can happen with other filesystems, but most don't have checksums, so a mismatch between (updated) data and (out-of-date) checksum cannot arise."

Is this currently mentioned in the docs, I can't seem to find this warning.

Comment 26 Chris Murphy 2023-07-02 16:07:32 UTC
I think this should be asked on linux-btrfs@ list, http://vger.kernel.org/vger-lists.html#linux-btrfs since upstream Btrfs development happens there and they don't monitor downstream bugs. Thanks.


Note You need to log in before you can comment on or make changes to this bug.