Bug 1878596
| Summary: | Linux kernel 5.8.x fails to boot on 2015 MacBook Air | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Brandon Jones <brandon.gustav> | ||||||||
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
| Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 32 | CC: | acaringi, airlied, brandon.gustav, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonah, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, m.v.b, pany, steved | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | --- | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2021-05-25 17:28:54 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
|
Description
Brandon Jones
2020-09-14 06:00:59 UTC
This issue is still present in kernel version 5.8.8. Am seeing similar issue. 2014 MBP boots and runs well with 5.6.6 and fails with 5.8.8 and 5.8.11. However, in this case, after grub kernel version selection, there is no further reported progress. No boot log text is displayed to the terminal and nothing about the boot is available in journalctl. While I'm waiting the capslock key works and the machine is warm, something seems to be happening, but nothing logged for several minutes until I give up. This is also fedora 32. XFCE spin, just installed from USB drive, then installed wifi drivers against 5.6.6, and ran dnf update, which brought in the 5.8.11 kernel. Rebooted under 5.8.11 removing quiet from the kernel command line. Boot proceeds past Welcome to Fedora message, freezes at fb0: switching to inteldrmfb from EFI VGA Following https://bbs.archlinux.org/viewtopic.php?id=256520 Adding nomodeset to the 5.8.11 command line allows the machine to complete boot. Adding nomodeset to my kernel boot options on 5.8.11 does not fix the issue for me. On my machine this does not seem to be an issue with the graphics chipset as referenced in the Arch forums, but an issue with the NVME controller, since I'm getting filesystem and NVME controller errors on boot with these kernel versions. My suspicion is that this is a regression of this issue: https://lore.kernel.org/linux-nvme/m3a707rd8y.fsf@web.de/ I was previously on earlier kernel versions unable to even install linux on this laptop because the kernel on the live image did not recognize the SSD. The kernel that ships with Debian Buster has this issue, and I suspect 5.8.x may have reintroduced an issue with certain SSDs. Shot in the dark, the above report doesn't mention the Macbook version- Before installing an NVMe drive in my 2014 MBP I had to do a Mac OS upgrade on the stock AHCI drive. I had been running 10.11 something, had to upgrade to 10.13, which when upgrading the OS on the AHCI drive also does a MBP firmware upgrade that allows the machine firmware to support NVMe. There seems to be no other way to upgrade the machine firmware, and if the machine was not running 10.13+, it will not be able to use NVMe drives. After the 10.13 upgrade, I removed the stock AHCI drive and installed a 3rd party NVMe drive, and as above am able to boot and run Fedora from NVMe. Perhaps that helps, ignore if not relevant. The above report says the model was a 2014 MacBook Pro. That particular model was experiencing video driver issues according to the linked discussion on the Arch forums. My 2015 Macbook Air (MacBook 7,1) is experiencing file system issues, none of which fsck seem to clear (running fsck on all partitions shows there are no file system issues). Running the integrated Apple Hardware test also shows no hardware issues. In response to upgrading the NVME firmware, the last Mac OS versions on this laptop were 10.15 (Catalina) and the Big Sur beta. The firmware should be, according to what you are saying, up to date. In any case, this laptop shipped with an NVME drive, so the firmware should have been present and upgraded throughout its life as I upgraded Mac OS. So in my continued search for a solution to this issue, I ran across this on the Ask Fedora forum: https://ask.fedoraproject.org/t/f32-kernel-5-8-9-fails-to-boot-on-macbook-pro/9299/13 Same issue, same era machine. I'm not sure if this bug would be considered a duplicate of the one listed in the thread here: https://bugzilla.redhat.com/show_bug.cgi?id=1878347 It seems a potential fix was pushed in >5.8.14 according to this: https://marc.info/?l=linux-usb&m=160296664914184&w=2 I will continue keeping tabs on this. Just a quick follow up to my previous comment, this is not the same issue it would seem. My machine still fails to boot on 5.8.14 and 5.8.15. I think that this bug is a duplicate of bug 1878347. I've been paying attention to that bug, I will test out any possible solutions or patches there when they are reported to confirm if it is a duplicate. I took a look at the thread for the patch, and tried the workaround mentioned there: appending module_blacklist=apple_mfi_fastcharge to the boot options. It unfortunately does not allow me to boot on 5.8.15. Further, I get an "nvme nvme0: controller is down" before systemd tries to load services. I am still of the belief that this is an issue with the nvme controller and a possible regression of the previous issue I mentioned above and not related to the USB bug referenced. This bug I believe is related: https://bugzilla.redhat.com/show_bug.cgi?id=1601196 (In reply to Brandon Jones from comment #0) ... > 3. Did it work previously in Fedora? If so, what kernel version did the issue > *first* appear? Old kernels are available for download at > https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : > > Kernel versions 5.6.6 (the original kernel shipped on the installation ISO) > through 5.7.17 continue to function and boot properly. The issue first > occurred after updating to kernel 5.8.4. Hi Brandon, did you tried kernel 5.8.2 [1] or so? Maybe it will narrow the range of the first appearance of this issue. [1] https://koji.fedoraproject.org/koji/buildinfo?buildID=1595917 (In reply to Pany from comment #15) > (In reply to Brandon Jones from comment #0) > ... > > 3. Did it work previously in Fedora? If so, what kernel version did the issue > > *first* appear? Old kernels are available for download at > > https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : > > > > Kernel versions 5.6.6 (the original kernel shipped on the installation ISO) > > through 5.7.17 continue to function and boot properly. The issue first > > occurred after updating to kernel 5.8.4. > > Hi Brandon, did you tried kernel 5.8.2 [1] or so? Maybe it will narrow the > range of the first appearance of this issue. > > [1] https://koji.fedoraproject.org/koji/buildinfo?buildID=1595917 I installed 5.8.2 and I also observed the same issue with that version. 5.7.17 seems to be the most recent version that is bootable on my machine. Brandon, thanks for ruling out the potential of this being a duplicate of bug 1878347, and my apologies for making a premature statement. No apology necessary. We are just trying to narrow down what the issue may be. I will try the patched version in the kernel mail list if it becomes available on koji (I'm not sure how I would try the patched version otherwise). I've also tried seeing if 5.9.1 would boot, and it did not, so it does seem that this issue first appeared in the 5.8 series. I recall reporting on this kernel version mentioned there were changes to the ext4 drivers or implementations so, with a grain of speculation, this may be related to those changes. I also wanted to add, I did attempt to check the smart status of the SSD via the Drives app in GNOME, but the option was greyed out both when booting from the SSD and from a live USB. However, as previously mentioned, Apple Hardware test does not show any issues with the laptop. Created attachment 1724956 [details]
Emergency shell log
Error log generated before dropping in to root emergency shell.
I've attached a log generated from the patched kernel provided in https://bugzilla.redhat.com/show_bug.cgi?id=1878347. This log finally shows what the dmesg was not able to show since it truncated before the issue. It seems fsck fails at the end of the log and the boot stops since the kernel cannot mount the filesystems. Running fsck from a live USB however shows no errors to correct on all partitions. Hopefully this provides more needed information about this issue. Hello Brandon,
I had a look at the logs, and one thing that caught my eye is that the NVMe controller is initialized successfully, but then "something" happens, which causes the controller to stop responding, resulting in the "controller is down; will reset" message after about 30 seconds:
[ 3.071646] f-air kernel: nvme nvme0: pci function 0000:04:00.0
[ 3.072065] f-air kernel: nvme nvme0: detected Apple NVMe controller, set queue depth=2 to work around controller resets
[ 3.073162] f-air kernel: nvme nvme0: 1/0/0 default/read/poll queues
[ 3.082290] f-air kernel: nvme0n1: p1 p2 p3 p4
...
[ 34.890739] f-air kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
[ 34.917820] f-air kernel: nvme nvme0: detected Apple NVMe controller, set queue depth=2 to work around controller resets
[ 50.419835] f-air kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3
[ 50.419839] f-air kernel: nvme nvme0: Removing after probe failure status: -19
[ 50.425822] f-air kernel: blk_update_request: I/O error, dev nvme0n1, sector 161476608 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
I also noticed that the fact that this is an Apple NVMe controller causes the queue depth to be set to two (2), so this appears to be an Apple NVMe controller-specific edge case.
Here is the relevant part of the code printing out this message. Note that the queue depth variable's name is dev->q_depth, and it is set to the value two (2):
=== drivers/nvme/host/pci.c ===
/*
* Temporary fix for the Apple controller found in the MacBook8,1 and
* some MacBook7,1 to avoid controller resets and data loss.
*/
if (pdev->vendor == PCI_VENDOR_ID_APPLE && pdev->device == 0x2001) {
dev->q_depth = 2;
dev_warn(dev->ctrl.device, "detected Apple NVMe controller, "
"set queue depth=%u to work around controller resets\n",
dev->q_depth);
=== drivers/nvme/host/pci.c ===
Given that I do not have the hardware to bisect this issue, I quickly/casually went through all of the commits touching two relevant files between v5.7 (assumed good based on your report) and v5.8 (assumed broken, also based on your report) using the following command:
git log --full-diff -p --no-merges v5.7..v5.8 -- drivers/nvme/host/pci.c drivers/nvme/host/core.c
One interesting commit that caught my eye is the following one, but please note that I am not pointing fingers here. I am only making a speculation based on the available data.
===
commit 54b2fcee1db041a83b52b51752dade6090cf952f
Author: Keith Busch <kbusch>
Date: Mon Apr 27 11:54:46 2020 -0700
nvme-pci: remove last_sq_tail
The nvme driver does not have enough tags to wrap the queue, and blk-mq
will no longer call commit_rqs() when there are no new submissions to
notify.
Signed-off-by: Keith Busch <kbusch>
Reviewed-by: Sagi Grimberg <sagi>
Signed-off-by: Christoph Hellwig <hch>
Signed-off-by: Jens Axboe <axboe>
===
The main/interesting hunk in this patch is as follows:
===
@@ -446,24 +445,11 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
return 0;
}
-/*
- * Write sq tail if we are asked to, or if the next command would wrap.
- */
-static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
+static inline void nvme_write_sq_db(struct nvme_queue *nvmeq)
{
- if (!write_sq) {
- u16 next_tail = nvmeq->sq_tail + 1;
-
- if (next_tail == nvmeq->q_depth)
- next_tail = 0;
- if (next_tail != nvmeq->last_sq_tail)
- return;
- }
-
if (nvme_dbbuf_update_and_check_event(nvmeq->sq_tail,
nvmeq->dbbuf_sq_db, nvmeq->dbbuf_sq_ei))
writel(nvmeq->sq_tail, nvmeq->q_db);
- nvmeq->last_sq_tail = nvmeq->sq_tail;
}
===
As can be seen in the patch hunk quoted above, the variable next_tail is derived from the submission queue tail position, and then next_tail is compared against the q_depth variable to determine whether a wraparound has occurred in the submission queue.
The interesting bit of information is that q_depth is set to 2 with the Apple NVMe controller on your system, so I am guessing that the wrap-around is much more likely to occur with the NVMe controller in your laptop. However, from what I can tell, this commit removed the wrap-around check.
Given the above discussion, it might be worthwhile to attempt to take Linux kernel release 5.8.16, revert the aforementioned commit 54b2fcee1db0 ("nvme-pci: remove last_sq_tail") and give the resulting kernel a try on your system?
Once again, I do not know if this will help, and this is a guess/speculation based on casually looking at the code.
If this suggestion does not help, then another option would be to use git bisect as follows and then build and test the kernel at each step, which admittedly will take some time:
git bisect start v5.8 v5.7 -- drivers/nvme/host/
For this command, I specified the drivers/nvme/host/ directory instead of only the two files I had used with the git log command above. This so that the problematic commit will be found, because my initial hunch regarding pci.c and core.c being the only relevant files changed between v5.7..v5.8 might be incorrect.
I hope that this will help in some way!
(In reply to M. Vefa Bicakci from comment #22) ... > Given the above discussion, it might be worthwhile to attempt to take Linux kernel release 5.8.16, revert the aforementioned commit 54b2fcee1db0 ("nvme-pci: remove last_sq_tail") and give the resulting kernel a try on your system? ... Hi, Brandon, I made a build, which is based on the official kernel 5.8.16 with the commit 54b2fcee1db0 reverted: https://copr.fedorainfracloud.org/coprs/pany/kernel-macbook/build/1727741/ You can easily try this: $ sudo dnf copr enable pany/kernel-macbook $ sudo dnf makecache $ sudo dnf --disablerepo=* --enablerepo= copr:copr.fedorainfracloud.org:pany:kernel-macbook install kernel-5.8.16-200.revert54b2fcee1db0.fc32.rpm (In reply to Pany from comment #23) > (In reply to M. Vefa Bicakci from comment #22) > ... > > Given the above discussion, it might be worthwhile to attempt to take Linux > kernel release 5.8.16, revert the aforementioned commit 54b2fcee1db0 > ("nvme-pci: remove last_sq_tail") and give the resulting kernel a try on your > system? > ... > > Hi, Brandon, > > I made a build, which is based on the official kernel 5.8.16 with the commit > 54b2fcee1db0 reverted: > > https://copr.fedorainfracloud.org/coprs/pany/kernel-macbook/build/1727741/ > > You can easily try this: > > $ sudo dnf copr enable pany/kernel-macbook > $ sudo dnf makecache > $ sudo dnf --disablerepo=* --enablerepo= > copr:copr.fedorainfracloud.org:pany:kernel-macbook install > kernel-5.8.16-200.revert54b2fcee1db0.fc32.rpm Sorry for the typo, the last line should be: $ sudo dnf --disablerepo=* --enablerepo=copr:copr.fedorainfracloud.org:pany:kernel-macbook install kernel-5.8.16-200.revert54b2fcee1db0.fc32 Created attachment 1725199 [details]
dmesg successful boot on reverted 5.8.16
Hi Pany, I actually cloned the fedora git repo for the kernel based on the instructions here: https://fedoraproject.org/wiki/Building_a_custom_kernel and did a git revert [commit-hash] of the commit you suspected was the issue after checking out 5.8.16. I allowed the kernel and modules to compile, and I am very happy to report that the kernel I built using your suggestion booted successfully to the GNOME desktop! I uploaded the dmesg output in case it contains any useful information. I did this before seeing the build you created. Thank you for doing that, I will try out your build as well and report back. I'm fairly certain I built things correctly, but I am not as experienced with git and kernel compiling as I would like to be, so it will be good to get a second confirmation. Pany, I downloaded and installed your kernel build and am happy to report I successfully booted in to your build and am typing this on that build right now! :) (In reply to Brandon Jones from comment #26) > Hi Pany, I actually cloned the fedora git repo for the kernel based on the > instructions here: https://fedoraproject.org/wiki/Building_a_custom_kernel > and did a git revert [commit-hash] of the commit you suspected was the issue > after checking out 5.8.16. I allowed the kernel and modules to compile, and > I am very happy to report that the kernel I built using your suggestion > booted successfully to the GNOME desktop! I uploaded the dmesg output in > case it contains any useful information. ... Glad to hear that! To clarify, M. Vefa Bicakci helped by looking through the journal and gave the suggestion at comment #22, so thanks to Vefa. It turns out that Vefa was right, the commit 54b2fcee1db0 ("nvme-pci: remove last_sq_tail") was the root cause. Maybe you want to report this issue to the upstream linux-nvme mailing list: http://lists.infradead.org/mailman/listinfo/linux-nvme My apologies, I did not intend to mis-credit who identified the issue. Thank you! This will be my first time emailing a kernel mailing list, but I will link the conversation here and do my best to provide useful information for the kernel devs. Here is a link to the mail thread. I will keep this thread posted if I get any useful updates. http://lists.infradead.org/pipermail/linux-nvme/2020-October/020569.html The fix for this issue will be added in 5.10: http://lists.infradead.org/pipermail/linux-nvme/2020-November/020580.html I can confirm that this issue is fixed in the available kernel release 5.9.10 in the Fedora repos. Thank you to all who helped! :) This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. |