Bug 2114677 - system fails to boot from RAID1 btrfs when a disk has failed
Summary: system fails to boot from RAID1 btrfs when a disk has failed
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: distribution
Version: 36
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: Aoife Moloney
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-03 05:41 UTC by hw
Modified: 2023-05-09 13:56 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-16 18:37:33 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Fedora Pagure fedora-btrfs/project issue 59 0 None None None 2022-08-16 21:02:07 UTC
Github dracutdevs dracut issues 1922 0 None open enable unattended degraded boot for btrfs 2022-08-16 21:02:07 UTC

Description hw 2022-08-03 05:41:49 UTC
Description of problem:

When the default btrfs file system in RAID1 mode with two disks is used to boot from, the machine won't boot when one of the disks has failed :(

Expected results:

This is a severe issue because I'm using RAID1 to protect against disk failures just like this one.  I had to go to lengths to connect a monitor and a keyboard to the server --- which fortunately happens to be not at some remote place --- just to figure out what's going on.  Then I had to boot from an USB stick with a live system on it which I was very fortunate to have at hand so I could replace the failed disk which I was also fortunate to have at hand.  I had to mount the file system in degraded mode to be able to add the new disk.  Then I could finally remove the failed one which lead to finally(!) rebuilding the array.

I expect the machine to at least boot as usual so that I don't run into several hours of downtime to begin with and so that I can log in remotely.  I expect to get a warning that a disk has failed (by email) when that happens.  I expect that replacing the failed disk with a new one is all that needs to be done so even an arbitrary person at a remote site can be instructed to replace the disk.

Hardware RAID does that.  Mdadm RAID1 does recover from failure by running the array in degraded mode and the machine will still boot so you can add the new disk manually after logging in remotely.

Btrfs just fails.  This is a huge nono!

What are we supposed to do?  Regress to mdadm for boot drives because btrfs fails so miserably?

Or is at a problem with systemd which keeps waiting indefinitely for the root drive to become available?  I can understand that it may seem generally advisable not to start a degraded array.  In case that array is required to boot, not starting the array at all is not an option, at least not by default.

Comment 1 Zbigniew Jędrzejewski-Szmek 2022-08-16 17:41:45 UTC
Systemd is just executing the configuration that was provided externally. (By the installer, I guess.)
If that config does not specify that degraded mounts should be performed, then the boot will go
into emergency.

Comment 2 hw 2022-08-16 18:14:39 UTC
It was configured by the installer.  I didn't know you could configure this and don't know how to.

It should be the default that the computer will boot if possible, even if the array is degraded, and somehow give a warning.  I can imagine that a lot of users won't be able to figure out what happened and how to fix the problem, and when their computer doesn't even boot, how are they supposed to search for answers.

Besides, IMHO the purpose of RAID with redundancy is to keep stuff working even when a disk has failed, and making it so that the computer doesn't boot when it happens somewhat defeats the purpose and advantages of RAID with redundancy.  So that shouldn't be the default; it has never been like that.

Comment 3 Chris Murphy 2022-08-16 18:37:33 UTC
For what it's worth I've previously taken this same position, the problem with it though is that it really isn't a bug. It's a feature request. I'll try to explain...

Short version: If you need unattended degraded boot on faulty device support, you need to use mdadm. (Offhand I don't know if LVM raid can do it, haven't tested it.) Btrfs is the default for single disk installations, we do not configure raid1 by default for any Fedora variant.

Long version: For mdadm, this use case is all handled with dracut scripts, not mdadm itself. Dracut won't initially assemble an array if a drive is faulty/missing, so it starts a ~300s countdown waiting for the drive to appear because aggressively assembling degraded when drives might just be slow to appear is not good. After 300s, dracut orders mdadm to assemble the array degraded.

There's no equivalent in dracut to handle btrfs in the same way. So one part of the feature request needs to happen with dracut upstream. I'm not sure if it's been discussed with them by anyone. I think for servers this might be enough. On desktops, the issue is that users sometimes just forget to plug in a drive. And in that case we need some extra protections on Btrfs to avoid a split brain problem, if the user were to have two absent minded incidents in a row. So someone has to think through this logic and write the code to handle it. It's not something intrinsic in Btrfs right now.

Side note that this can't be a higher priority than the UEFI multiple ESP syncing requirement. Right now we don't have a solution for that either. There's a bad hack in some places to use software raid for the ESP but this is (a) expressly rejected by upstream mdadm developers, (b) the firmware can write to the ESP thus making any software raid out of sync in an ambiguous way that can't be unambiguously repaired (c) fragile (d) can't support dual boot use case.

A partial work around for the btrfs case is looking at the problem of failed drives differently. In Btrfs land, there's much less importance on considering drives with some errors as faulty because every read is verified against checksums. Therefore there isn't this extreme need to "eject" a drive once it's spitting out errors. Mdadm has no idea what reads are good or bad other than what the drive says, and at the point the drive complains about a few read errors in a row, it's reasonable to just ignore the whole drive and consider it faulty. Whereas with Btrfs, it knows data is bad even if the drive doesn't return an error. This means we're better off keeping failing drives in place because we might get good data from them a lot of the time. What we need to do until there are improvements in this area is *replace* the drive before rebooting i.e. `btrfs replace` command, which does a live replace of a failing or failed drive, and just obviate the problem of degraded boot entirely. It's not always possible of course, so it's an imperfect work around. But it's definitely viable more often than not because Btrfs only fails to boot if the drive is missing, i.e. completely failed. If it at least shows up on the bus and can read well enough to report itself as a btrfs volume, even if most other reads are fails, btrfs can handle that automatically, not even needing degraded mount.

Anyway, closing this as notabug. Upstream btrfs knows the implied work they need to do. And the work dracut devs would need to do if they're willing and able should be filed in an upstream bug report with them. And probably for desktop use case we need some way for the desktop to determine drive failing states and report that to the user.

Comment 4 hw 2022-08-16 19:55:17 UTC
It's ok if this would be a feature request.  But do you really think that defaulting to failure is a good default just because yet another drive has failed?  Drives fail all the time, and the only question is when, not if one fails.

Wo would install anything without redundancy?  I can see that for temporary testing installations, not for anything you're actually gona use.

So what you're saying is that btrfs is basically dead because you can't really use it.

Having mdadm wait 300 seconds for drives to appear seems excessive.  If a drive takes that long to appear, it's probably broken --- or what kind of drives take that long?

Are you suggesting that I resort to hardware RAID (which has its own issues) to handle all potential failures?  Or should we resort to ZFS which is it's own nightmare and which to me seems like the worst solution one could go for.  Or should we use mdadm and then put btrfs on top of it --- that seems like a pretty weird idea.

Desktops usually don't have hot-pluggable drives; they usually don't even have the pluggable drive bays for that.  And if someone has all that and forgets to plug in a drive, then what would be the reason to default to failure?  That someone probably wants their computer to just boot.

For non-boot drives, I'm finding it arguable wether the computer should still boot, but not for boot drives.

Someone who tends to plug in drives and forgets to plug them may as well unplug a drive.  How is that being handled?  Shutdown the computer because the array is degraded just because, and because that someone could unplug some more drives?

I don't think it's valid to argue here with people plugging and unplugging their drives.  They could as well unplug the power, delete all their data, burn up their computer with a flame thrower or whatever.  You can argue that it's too difficult to create software to prevent that and therefore everything should default to failure before anything could happen, and when you do that, then you can never use a computer or anything else.

I don't know what the "UEFI multiple ESP syncing requirement" is.  I simply want the computer to boot even when a disk has failed, and it used to be that way.

You can not prevent a disk from failing, and you can not replace a disk when there is no hotspare that could be used to replace one.  Do you seriously want me to double the number of disks (which would also mean more than twice as many failures can occur) so that there is one for every disk that may fail?  That's not a solution for a lot of reasons.  You may be able to detect errors with btrfs sooner, and what's the point when it defaults to failure when it happens?  In this case, the disk just didn't come back after rebooting, so there wasn't anything to detect other than that.  That was a really simple case and it defaulted to total failure, and that's definitely not an option.

You are kinda suggesting that btrfs can not be used with SSDs.  In all cases of SSDs failing which I've seen so far, the disk failed completely, i. e. didn't show up anymore.  So when you're using btrfs with SSDs, it means it will always default to failure.  That isn't an option.  With spinning disks, they still show up and show signs of failure, so should we go back to spinning disks?

Well, it remains a bug, no matter if you mark it closed or not.  Why is btrfs default in Fedora when it's still so immature?  Default to total failure is no good at all.

Comment 5 Chris Murphy 2022-08-16 20:35:56 UTC
>But do you really think that defaulting to failure is a good default just because yet another drive has failed?

It is not good. But it is a good default because it's the safest option that we have available. There's some bad advice out there to disable /usr/lib/udev/rules.d/64-btrfs.rules so the hang doesn't happen, and then tack on `degraded` to the existing rootflags boot parameter. The problem is that this then mean you get a degraded mount anytime one device is merely 1/2 second slower becoming available. If they go back and forth, you get a split brain and corrupted file system. Right now, degraded operations on Btrfs require this painful method of notifying the user something is wrong, so that they have to troubleshoot what's wrong. It's intentionally requiring user intervention.

>Wo would install anything without redundancy?  I can see that for temporary testing installations, not for anything you're actually gona use.

I don't understand the question.

>Having mdadm wait 300 seconds for drives to appear seems excessive.  If a drive takes that long to appear, it's probably broken --- or what kind of drives take that long?

It's a probability function. We can only assume a drive is broken if it takes too long to show up. What's too long? The agreed upon time for many years is 5 minutes. At 5 minutes we can say it's probably broken, not just late, and thus it's OK to do a degraded assemble.

>Are you suggesting that I resort to hardware RAID (which has its own issues) to handle all potential failures?

No. I gave my suggestion already exactly as I intended it, you don't have to extrapolate.

> Or should we resort to ZFS which is it's own nightmare and which to me seems like the worst solution one could go for.

I don't know anything about OpenZFS's multiple device behavior.

>Or should we use mdadm and then put btrfs on top of it --- that seems like a pretty weird idea.

It's a set of tradeoffs. This configuration will boot automatically (eventually) unattended if a drive failure happens. And it will unambiguously complain if any metadata or data corruption is detected, and can self-heal if it affects metadata so long as you're using the (default) DUP profile. For single copy data, it can only detect, so you are missing the unique btrfs raid1 feature of being able to self-heal data.

>Desktops usually don't have hot-pluggable drives; they usually don't even have the pluggable drive bays for that.  And if someone has all that and forgets to plug in a drive, then what would be the reason to default to failure?  That someone probably wants their computer to just boot.

There's no alternative to not booting. The code to handle this intelligently and automatically doesn't exist.

>I don't know what the "UEFI multiple ESP syncing requirement" is.  I simply want the computer to boot even when a disk has failed, and it used to be that way.

Sure and it's a reasonable want. The problem is the computer doesn't care what we want, it only cares about what it's coded to do. And this code doesn't exist. It's thus far not been a priority I guess. Somebody would need to do it. Code doesn't write itself. The UEFI multiple ESP sync requirement refers to where the bootloader lives. You need an ESP on two drives in order to boot if one drive fails. The problem is that we don't do this on UEFI. You get one ESP. There is a hacky work around to make the ESP use software raid1 but like I said, it's bad news, not really supportable and won't always work so it's not a viable solution. i.e. in the face of a disk failure, you have maybe a 50/50 chance of booting anyway, aside from the btrfs raid1 issue.

>That was a really simple case and it defaulted to total failure, and that's definitely not an option.

It's the only option. Plus it's desktop use case, the user has physical access and it's straightforward to intervene. You can get to a shell and manually invoke 'mount -o degraded` and then you can boot just fine. I don't think there is a valid use case for automatic unattended degraded boot of a desktop system. That's for servers in which case yes you'd need a spare if you're going to repair it and get back to normal operation without a human having physical access to the server.

>You are kinda suggesting that btrfs can not be used with SSDs.  In all cases of SSDs failing which I've seen so far, the disk failed completely, i. e. didn't show up anymore.  So when you're using btrfs with SSDs, it means it will always default to failure.  That isn't an option.  With spinning disks, they still show up and show signs of failure, so should we go back to spinning disks?

Again, if your requirement is unattended automatic degraded boot, then Btrfs raid1 is not for you. You'll need to use mdadm raid1. If you accept that in the case of disk failure, you'll have to manually intervene to mount the fs degraded, then you can use btrfs raid1. And note that the failure is only during boot. If the SSD vanishes while in-use, as long as you don't reboot, btrfs continues to work OK, albeit with very noisy kernel messages due to all the read and write failures for the failed drive.

Comment 6 hw 2022-08-16 21:47:06 UTC
It does remain a bug that neeeds to be worked on.  Closing the bug report doesn't change that.

I don't know what makes you think this would be a "desktop use case", and thinking that failure is an option in this case would be an option is mistaken, desktop or not.

In fact, it was a server, with Fedora Server installed on it.  I only was fortunate that it wasn't a remote server.

You're explaining that btrfs in RAID1 is unable to handle a disk failure because it hasn't matured that far yet.  Since you know this, you can fix the bug by not making btrfs the default file system for Fedora and use mdadm or other options that give better results until btrfs has come far enough.  That goes for the "desktop use case", whatever that is, just as well.

This is an issue that shouldn't be taken as likely as you seem to take it.  Servers usually boot even when a RAID array is degraded, and when the default file system of Fedora Server is unsuitable for such a common requirement, then it should have never been made the default.  You'll need to use mdadm raid1 ...

Fedora needs to become much more responsible with its decisions if it really wants to assume the leading role its mission statement suggests it should have.

Comment 7 Chris Murphy 2022-08-16 22:32:44 UTC
OK you're confused, and just becoming argumentative at this point. This is a bug report, it is not a discussion forum. If you want to have a conversation about this topic, start one on the Fedora devel list, I'm happy to discuss it further there.

It's appropriate to close the bug as I did, UPSTREAM, because upstream bug reports have been filed as noted, so please don't change the bug status again. Fedora can't change this independent of upstream projects.

Comment 8 hw 2022-09-24 09:45:12 UTC Comment hidden (abuse)

Note You need to log in before you can comment on or make changes to this bug.