Bug 2132483

Summary: BTRFS not noticing hardware failure, not registering and counting errors, dev stats shows 0 after disk failure
Product: [Fedora] Fedora Reporter: Basic Six <drbasic6>
Component: btrfs-progsAssignee: Josef Bacik <josef>
Status: NEW --- QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 38CC: bugzilla, esandeen, igor.raits, josef, ngompa13
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Basic Six 2022-10-05 21:25:49 UTC
Description of problem:

Like ZFS, BTRFS can notice checksum and other errors both online and offline but unlike ZFS, this feature has never worked reliably in BTRFS for more than a year straight.

Here's one more example, it's a bad drive:

nvme0n1: I/O Cmd(0x2) @ LBA 99961344, 2560 blocks, I/O Error (sct 0x2 / sc 0x81) MORE 
critical medium error, dev nvme0n1, sector 99961344 op 0x0:(READ) flags 0x84700 phys_seg 1 prio class 0

Yet BTRFS pretends everything is normal:

$ sudo btrfs dev stats /
[/dev/mapper/luks-e792c5a9-672a-4246-b75d-f64015ce573a].write_io_errs    0
[/dev/mapper/luks-e792c5a9-672a-4246-b75d-f64015ce573a].read_io_errs     0
[/dev/mapper/luks-e792c5a9-672a-4246-b75d-f64015ce573a].flush_io_errs    0
[/dev/mapper/luks-e792c5a9-672a-4246-b75d-f64015ce573a].corruption_errs  0
[/dev/mapper/luks-e792c5a9-672a-4246-b75d-f64015ce573a].generation_errs  0
$ sudo btrfs dev stats / | grep -vwc 0
0

This is while the bad disk is already causing the system to misbehave in a way that even the most foolish user would have noticed that something isn't right, yet BTRFS hasn't noticed anything unusual.



Version-Release number of selected component (if applicable):

Fedora 36
btrfs-progs v5.18



How reproducible:

Always.



Steps to Reproduce:
1. Try to use BTRFS for valuable data.
2. It works as long as the hardware doesn't fail.
3. As soon as you would actually need the data protection provided by BTRFS, it doesn't work.



Actual results:

$ sudo btrfs dev stats / | grep -vwc 0
0



Expected results:

dev stats should show errors when errors have occurred.



Additional info:

Similar: Bug 2005987

Comment 1 Chris Murphy 2022-10-06 03:29:00 UTC
Filing another bug here isn't going to get it fixed any faster. Upstream is aware of it.
https://bugzilla.redhat.com/show_bug.cgi?id=2005987
https://lore.kernel.org/linux-btrfs/CAJCQCtRbktnZ5NxRTZL9UKvTr1TaFtkCbeCS2pVnf2SPg8O3-w@mail.gmail.com/

The lack of a Btrfs error suggests the bad sector was read in the course of readahead logic, and ended up not being needed by Btrfs. This is mentioned in the upstream thread.

If it was expected either data or metadata, the read error would definitely have triggered a Btrfs complaint and reading any available redundant copy of what's in that block, and then trying to fix the bad one. So it's really just a device problem not a Btrfs problem.

Comment 2 Basic Six 2022-10-10 21:07:03 UTC
Thanks for your response! Well, right, I could've added a comment to the other bug. Either way, the comments in the mailing list (upstream) are a year old and I thought it might have been forgotten.

I'd just like to point out one thing here, since my description above isn't very clear:

> The lack of a Btrfs error suggests the bad sector was read in the course of readahead logic, and ended up not being needed by Btrfs.

I have my doubts about [whatever the system failed to read] "ended up not being needed by BTRFS" because the system actually started to show erratic behavior. Most of what happened couldn't be linked to BTRFS but was possibly caused by it (e.g., parts of the desktop environment kept crashing). At some point, something wrote an error message to the terminal saying that some (config) file could not be updated because the disk is full (it was not full). So although I cannot be completely sure, it seems to contradict the assumption that BTRFS did not need the requested data in the end. Furthermore, some operations were very slow, like copying a 10M file got stuck for several minutes (after several attempts, it went through within seconds).

> This is mentioned in the upstream thread.

Yes: readahead errors are things like "out of memory" ...

Although I agree that a failed readahead due to an oom situation is not critical, I think a clear distinction should be made:
Why did it fail, really? If oom, it may be safe to assume that other parts of the system will inform the user; no disk issue so no reason to increase error counts for one of the disks. However, if some sort of disk i/o error happened, it should be reported in the same way that other errors are reported even if they can be corrected.

Comment 3 Chris Murphy 2022-10-13 18:54:32 UTC
>I have my doubts about [whatever the system failed to read] "ended up not being needed by BTRFS" because the system actually started to show erratic behavior.

Correlation isn't causation.

Btrfs design, and what developers have told me, is that Btrfs always rejects bad metadata and data with EIO (input output error) when there's a checksum mismatch. Bad metadata and data are not used by kernel code or submitted to user space. Instead, EIO is reported and the requester receiving EIO is expected to handle the error with some kind of graceful failure.


>Most of what happened couldn't be linked to BTRFS but was possibly caused by it (e.g., parts of the desktop environment kept crashing).

This is speculation. And it's also inconsistent with the design. It could be true, but you'd have to find the code path to prove such an assertion. Or some kind of test where the desktop environment receives EIO, and fails to gracefully handle the condition.


>At some point, something wrote an error message to the terminal saying that some (config) file could not be updated because the disk is full (it was not full).

This is the first time this is mentioned in either bug report. It's also too vague. We need logs to ascertain the problem before it's possible to talk about a fix.


>So although I cannot be completely sure, it seems to contradict the assumption that BTRFS did not need the requested data in the end.

The information would not have been available to Btrfs because the drive reported an unrecoverable read error for this sector, i.e. it is not able to read the sector, and has not handed over any of the data in the sector to the kernel. If Btrfs needed this information, but the drive can't/won't hand over the information - Btrfs would complain significantly. There'd be all kinds of kernel errors in the log, and they'd be unambiguous.

>Furthermore, some operations were very slow, like copying a 10M file got stuck for several minutes (after several attempts, it went through within seconds).

This is consistent with a drive that has bad media, and it's trying to do error correction on these bad sectors. This takes quite a lot of time. So it seems like a device slow down. The manufacturer might say this is normal behavior, the firmware is correctly handling media defects by re-reading bad sectors and reconstructing the data using ECC. But from a consumer standpoint, this is a defective drive. Hands down. Get it replaced under warranty if you can.


>However, if some sort of disk i/o error happened, it should be reported in the same way that other errors are reported even if they can be corrected.

Not if Btrfs wasn't the requester. If Btrfs requested a sector, and the drive reports it can't read that sector, it'll tell Btrfs that sector can't be read, and then Btrfs will complain. We have no indication Btrfs is even aware of this problem, so near as I can tell from the information provided, it's just dying hardware and it should be replaced. Uncorrectable/Unrecoverable read errors are always media defects of some kind. If it's a HDD, they can sometimes be fixed by overwriting the bad sectors, but you have to overwrite those sectors with a block size equal to physical sector size. But no matter what, if it's under warranty, get it replaced.

Comment 4 Ben Cotton 2023-02-07 14:57:01 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 38 development cycle.
Changing version to 38.