Bug 1733388
Summary: | deadlock caused by missing memory barrier causes btrfs installs to hang with kernel-5.3.0-0.rc0.git7.1.fc31 and later | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED RAWHIDE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | rawhide | CC: | airlied, bskeggs, bugzilla, fedora-kernel-btrfs, gmarr, hdegoede, ichavero, itamar, jarodwilson, jeremy, jforbes, jglisse, john.j5live, jonathan, josef, kernel-maint, labbott, linville, mchehab, mjg59, robatino, steved |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | openqa | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-08-16 17:08:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1644937 |
Description
Adam Williamson
2019-07-25 22:33:20 UTC
Proposing as a Beta blocker per criterion "When using both the installer-native and the blivet-gui-based custom partitioning flow, the installer must be able to: Correctly interpret, and modify as described below, any disk with a valid ms-dos or gpt disk label and partition table containing ext4 partitions, LVM and/or btrfs volumes, and/or software RAID arrays at RAID levels 0, 1 and 5 containing ext4 partitions..." It doesn't actually say that we should then be able to successfully complete an install to devices of those types, but I'm gonna say it's kinda implied. I'm strongly against anything with btrfs being a blocker. If that's in the criteria I think we should see about removing btrfs simply because we don't have the resources to actually deal with btrfs besides reporting bugs upstream. Agreed, btrfs has been a gamble pretty much always. See previous discussion around proposals to make btrfs default. Ext4 and xfs should be the only release blocking. We can revisit that, sure. The storage criteria have always been...fun. The basic principle, though, is that stuff the installer offers prominently ought to work: so perhaps this could also be a reason to revisit dropping btrfs from the installer... I wouldn't be opposed to that, though it is probably a bit late for F31. Either way, there is nothing about btrfs that should be release blocking. We have pretty much always discussed it as "use at your own risk" This isn't a new lockdep splat. I've seen it in debug kernels since 5.0.0 and haven't seen it actually cause any problems in hundreds of VM and baremetall installations, and multiple production systems. Upstream is aware of it, and based on this explanation I'm not sure that it's actually a Btrfs problem or if it's just exposed by Btrfs. An actual deadlock attributed to this would get it a lot more attention. https://lore.kernel.org/linux-btrfs/20190703211210.GJ16275@worktop.programming.kicks-ass.net/ Well, after filing the bug report, I found several other cases where the bug happened but that circular locking backtrace isn't in the logs. So I'm not sure now if it's really related to the bug. But, the bug definitely seems real. After a long history of the btrfs tests either passing or failing for some other obviously identifiable reason, they suddenly started getting flaky on both prod and staging from the Fedora-Rawhide-20190720.n.1 onwards. Each compose since then at least one out of the four tests (we test UEFI and BIOS installs via both 'custom partitioning' and 'advanced custom partitioning') has failed. OK it looks like this has been fixed already. https://lore.kernel.org/linux-btrfs/35b5e6a8-8e9b-037d-b248-36fee9da8717@suse.com/ OK, I'll see if the actual install-hangy-bug goes away over the next few composes, then (assuming that patch will land in our kernel builds soonish). It's been merged for rc2. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4792ba1f1ff0db30369f7016c1611fda3f84b895 It's not been pulled in for 5.3.0-0.rc1.git4.1, but might be if there's a git5.1. Discussed during the 2019-07-29 blocker review meeting: [1] The decision to delay the classification of this as a blocker bug was made as, while this seems fairly likely to be a blocker under current policy, the Anaconda team believes our current policy casts too wide a net. We will start a discussion of storage criteria on the mailing lists and reconvene on this bug next meeting. [1] https://meetbot.fedoraproject.org/fedora-blocker-review/2019-07-29/f31-blocker-review.2019-07-29-16.02.txt I was able to reproduce this 1 in 2 attempts with Fedora-Workstation-Live-x86_64-Rawhide-20190730.n.0.iso (kernel 5.3.0-0.rc1.git3.1), and not at all since rc2 landed in Fedora-Workstation-Live-x86_64-Rawhide-20190731.n.0.iso, and four tests in openqa on 20190802 also succeeded. I think this can be set to CLOSED RAWHIDE. yeah, this failure has not happened in openQA for some time. Let's call it fixed. |