Bug 1878596

Summary: Linux kernel 5.8.x fails to boot on 2015 MacBook Air
Product: [Fedora] Fedora Reporter: Brandon Jones <brandon.gustav>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 32CC: acaringi, airlied, brandon.gustav, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonah, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, mjg59, m.v.b, pany, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: ---
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-25 17:28:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output
none
Emergency shell log
none
dmesg successful boot on reverted 5.8.16 none

Description Brandon Jones 2020-09-14 06:00:59 UTC
Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

Created attachment 1714722 [details]
dmesg output

1. Please describe the problem:
Kernels 5.8.x fail to boot on my MacBook Air (2015 model). The screen hangs with several failed attempts by systemd to start services. The boot sequence also shows complaints about the filesystem. Running fsck or formatting the drive with a different filesystem (btrfs instead of ext4) do not seem to help. 

2. What is the Version-Release number of the kernel:
This issue has been observed in kernel versions 5.8.2, 5.8.4, 5.8.6, 5.8.7, 5.8.8, 5.8.9, 5.8.10, 5.8.11, 5.8.12, 5.8.14, 5.8.15, and 5.9.1. 

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Kernel versions 5.6.6 (the original kernel shipped on the installation ISO) through 5.7.17 continue to function and boot properly. The issue first occurred after updating to kernel 5.8.4.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

1. Install Fedora 32 on a 2015 MacBook Air or other affected system.
2. Run sudo dnf upgrade to pull in the newest kernel version.
3. Reboot the system and select the new kernel version from the GRUB boot loader prompt. 



5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

This issue occurs in the latest rawhide kernel.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
I am using broadcom-wl for wifi driver support, but the issue was observed prior to wireless driver installation with a wifi dongle supported by the vanilla install.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

The dmesg attached seems to truncate prior to the messages I observed on-screen. Photos of those messages can be seen on my post on the Fedora subreddit if they are helpful. I can also provide updated pictures if necessary. 
https://www.reddit.com/r/Fedora/comments/iophr8/kernel_584_and_586_do_not_boot_with_several/

Comment 1 Brandon Jones 2020-09-15 11:32:18 UTC
This issue is still present in kernel version 5.8.8.

Comment 2 Jonah Benton 2020-09-26 23:04:22 UTC
Am seeing similar issue. 2014 MBP boots and runs well with 5.6.6 and fails with 5.8.8 and 5.8.11. 

However, in this case, after grub kernel version selection, there is no further reported progress. No boot log text is displayed to the terminal and nothing about the boot is available in journalctl. While I'm waiting the capslock key works and the machine is warm, something seems to be happening, but nothing logged for several minutes until I give up.

Comment 3 Jonah Benton 2020-09-26 23:17:03 UTC
This is also fedora 32. XFCE spin, just installed from USB drive, then installed wifi drivers against 5.6.6, and ran dnf update, which brought in the 5.8.11 kernel.

Comment 4 Jonah Benton 2020-09-26 23:32:47 UTC
Rebooted under 5.8.11 removing quiet from the kernel command line. Boot proceeds past Welcome to Fedora message, freezes at

fb0: switching to inteldrmfb from EFI VGA

Comment 5 Jonah Benton 2020-09-26 23:38:28 UTC
Following

https://bbs.archlinux.org/viewtopic.php?id=256520

Adding 

nomodeset

to the 5.8.11 command line allows the machine to complete boot.

Comment 6 Brandon Jones 2020-09-27 06:59:04 UTC
Adding nomodeset to my kernel boot options on 5.8.11 does not fix the issue for me. On my machine this does not seem to be an issue with the graphics chipset as referenced in the Arch forums, but an issue with the NVME controller, since I'm getting filesystem and NVME controller errors on boot with these kernel versions. My suspicion is that this is a regression of this issue: https://lore.kernel.org/linux-nvme/m3a707rd8y.fsf@web.de/

I was previously on earlier kernel versions unable to even install linux on this laptop because the kernel on the live image did not recognize the SSD. The kernel that ships with Debian Buster has this issue, and I suspect 5.8.x may have reintroduced an issue with certain SSDs.

Comment 7 Jonah Benton 2020-10-02 03:07:47 UTC
Shot in the dark, the above report doesn't mention the Macbook version- 

Before installing an NVMe drive in my 2014 MBP I had to do a Mac OS upgrade on the stock AHCI drive. I had been running 10.11 something, had to upgrade to 10.13, which when upgrading the OS on the AHCI drive also does a MBP firmware upgrade that allows the machine firmware to support NVMe. There seems to be no other way to upgrade the machine firmware, and if the machine was not running 10.13+, it will not be able to use NVMe drives. 

After the 10.13 upgrade, I removed the stock AHCI drive and installed a 3rd party NVMe drive, and as above am able to boot and run Fedora from NVMe. Perhaps that helps, ignore if not relevant.

Comment 8 Brandon Jones 2020-10-08 03:25:01 UTC
The above report says the model was a 2014 MacBook Pro. That particular model was experiencing video driver issues according to the linked discussion on the Arch forums. My 2015 Macbook Air (MacBook 7,1) is experiencing file system issues, none of which fsck seem to clear (running fsck on all partitions shows there are no file system issues). Running the integrated Apple Hardware test also shows no hardware issues.

In response to upgrading the NVME firmware, the last Mac OS versions on this laptop were 10.15 (Catalina) and the Big Sur beta. The firmware should be, according to what you are saying, up to date. In any case, this laptop shipped with an NVME drive, so the firmware should have been present and upgraded throughout its life as I upgraded Mac OS.

Comment 9 Brandon Jones 2020-10-19 02:32:44 UTC
So in my continued search for a solution to this issue, I ran across this on the Ask Fedora forum: https://ask.fedoraproject.org/t/f32-kernel-5-8-9-fails-to-boot-on-macbook-pro/9299/13

Same issue, same era machine. I'm not sure if this bug would be considered a duplicate of the one listed in the thread here: https://bugzilla.redhat.com/show_bug.cgi?id=1878347

It seems a potential fix was pushed in >5.8.14 according to this: https://marc.info/?l=linux-usb&m=160296664914184&w=2

I will continue keeping tabs on this.

Comment 10 Brandon Jones 2020-10-19 02:50:04 UTC
Just a quick follow up to my previous comment, this is not the same issue it would seem. My machine still fails to boot on 5.8.14 and 5.8.15.

Comment 11 M. Vefa Bicakci 2020-10-22 02:35:54 UTC
I think that this bug is a duplicate of bug 1878347.

Comment 12 Brandon Jones 2020-10-22 04:10:53 UTC
I've been paying attention to that bug, I will test out any possible solutions or patches there when they are reported to confirm if it is a duplicate.

Comment 13 Brandon Jones 2020-10-22 23:44:21 UTC
I took a look at the thread for the patch, and tried the workaround mentioned there: appending module_blacklist=apple_mfi_fastcharge to the boot options. It unfortunately does not allow me to boot on 5.8.15. Further, I get an "nvme nvme0: controller is down" before systemd tries to load services. I am still of the belief that this is an issue with the nvme controller and a possible regression of the previous issue I mentioned above and not related to the USB bug referenced.

Comment 14 Brandon Jones 2020-10-22 23:51:58 UTC
This bug I believe is related: https://bugzilla.redhat.com/show_bug.cgi?id=1601196

Comment 15 Pany 2020-10-23 04:25:41 UTC
(In reply to Brandon Jones from comment #0)
...
> 3. Did it work previously in Fedora? If so, what kernel version did the issue
>    *first* appear?  Old kernels are available for download at
>    https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
> 
> Kernel versions 5.6.6 (the original kernel shipped on the installation ISO)
> through 5.7.17 continue to function and boot properly. The issue first
> occurred after updating to kernel 5.8.4.

Hi Brandon, did you tried kernel 5.8.2 [1] or so? Maybe it will narrow the range of the first appearance of this issue.

[1] https://koji.fedoraproject.org/koji/buildinfo?buildID=1595917

Comment 16 Brandon Jones 2020-10-24 12:03:57 UTC
(In reply to Pany from comment #15)
> (In reply to Brandon Jones from comment #0)
> ...
> > 3. Did it work previously in Fedora? If so, what kernel version did the issue
> >    *first* appear?  Old kernels are available for download at
> >    https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
> > 
> > Kernel versions 5.6.6 (the original kernel shipped on the installation ISO)
> > through 5.7.17 continue to function and boot properly. The issue first
> > occurred after updating to kernel 5.8.4.
> 
> Hi Brandon, did you tried kernel 5.8.2 [1] or so? Maybe it will narrow the
> range of the first appearance of this issue.
> 
> [1] https://koji.fedoraproject.org/koji/buildinfo?buildID=1595917

I installed 5.8.2 and I also observed the same issue with that version. 5.7.17 seems to be the most recent version that is bootable on my machine.

Comment 17 M. Vefa Bicakci 2020-10-24 15:25:47 UTC
Brandon, thanks for ruling out the potential of this being a duplicate of bug 1878347, and my apologies for making a premature statement.

Comment 18 Brandon Jones 2020-10-25 03:52:45 UTC
No apology necessary. We are just trying to narrow down what the issue may be.

I will try the patched version in the kernel mail list if it becomes available on koji (I'm not sure how I would try the patched version otherwise). 

I've also tried seeing if 5.9.1 would boot, and it did not, so it does seem that this issue first appeared in the 5.8 series. I recall reporting on this kernel version mentioned there were changes to the ext4 drivers or implementations so, with a grain of speculation, this may be related to those changes.

Comment 19 Brandon Jones 2020-10-25 03:55:53 UTC
I also wanted to add, I did attempt to check the smart status of the SSD via the Drives app in GNOME, but the option was greyed out both when booting from the SSD and from a live USB. However, as previously mentioned, Apple Hardware test does not show any issues with the laptop.

Comment 20 Brandon Jones 2020-10-28 23:23:58 UTC
Created attachment 1724956 [details]
Emergency shell log

Error log generated before dropping in to root emergency shell.

Comment 21 Brandon Jones 2020-10-28 23:28:12 UTC
I've attached a log generated from the patched kernel provided in https://bugzilla.redhat.com/show_bug.cgi?id=1878347. This log finally shows what the dmesg was not able to show since it truncated before the issue. It seems fsck fails at the end of the log and the boot stops since the kernel cannot mount the filesystems. Running fsck from a live USB however shows no errors to correct on all partitions. Hopefully this provides more needed information about this issue.

Comment 22 M. Vefa Bicakci 2020-10-29 07:19:51 UTC
Hello Brandon,

I had a look at the logs, and one thing that caught my eye is that the NVMe controller is initialized successfully, but then "something" happens, which causes the controller to stop responding, resulting in the "controller is down; will reset" message after about 30 seconds:

[    3.071646] f-air kernel: nvme nvme0: pci function 0000:04:00.0
[    3.072065] f-air kernel: nvme nvme0: detected Apple NVMe controller, set queue depth=2 to work around controller resets
[    3.073162] f-air kernel: nvme nvme0: 1/0/0 default/read/poll queues
[    3.082290] f-air kernel:  nvme0n1: p1 p2 p3 p4
...
[   34.890739] f-air kernel: nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
[   34.917820] f-air kernel: nvme nvme0: detected Apple NVMe controller, set queue depth=2 to work around controller resets
[   50.419835] f-air kernel: nvme nvme0: Device not ready; aborting reset, CSTS=0x3
[   50.419839] f-air kernel: nvme nvme0: Removing after probe failure status: -19
[   50.425822] f-air kernel: blk_update_request: I/O error, dev nvme0n1, sector 161476608 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0

I also noticed that the fact that this is an Apple NVMe controller causes the queue depth to be set to two (2), so this appears to be an Apple NVMe controller-specific edge case.

Here is the relevant part of the code printing out this message. Note that the queue depth variable's name is dev->q_depth, and it is set to the value two (2):

=== drivers/nvme/host/pci.c ===
	/*
	 * Temporary fix for the Apple controller found in the MacBook8,1 and
	 * some MacBook7,1 to avoid controller resets and data loss.
	 */
	if (pdev->vendor == PCI_VENDOR_ID_APPLE && pdev->device == 0x2001) {
		dev->q_depth = 2;
		dev_warn(dev->ctrl.device, "detected Apple NVMe controller, "
			"set queue depth=%u to work around controller resets\n",
			dev->q_depth);
=== drivers/nvme/host/pci.c ===

Given that I do not have the hardware to bisect this issue, I quickly/casually went through all of the commits touching two relevant files between v5.7 (assumed good based on your report) and v5.8 (assumed broken, also based on your report) using the following command:
  git log --full-diff  -p --no-merges  v5.7..v5.8 -- drivers/nvme/host/pci.c drivers/nvme/host/core.c

One interesting commit that caught my eye is the following one, but please note that I am not pointing fingers here. I am only making a speculation based on the available data.

===
commit 54b2fcee1db041a83b52b51752dade6090cf952f
Author: Keith Busch <kbusch>
Date:   Mon Apr 27 11:54:46 2020 -0700

    nvme-pci: remove last_sq_tail
    
    The nvme driver does not have enough tags to wrap the queue, and blk-mq
    will no longer call commit_rqs() when there are no new submissions to
    notify.
    
    Signed-off-by: Keith Busch <kbusch>
    Reviewed-by: Sagi Grimberg <sagi>
    Signed-off-by: Christoph Hellwig <hch>
    Signed-off-by: Jens Axboe <axboe>
===

The main/interesting hunk in this patch is as follows:

===
@@ -446,24 +445,11 @@ static int nvme_pci_map_queues(struct blk_mq_tag_set *set)
 	return 0;
 }
 
-/*
- * Write sq tail if we are asked to, or if the next command would wrap.
- */
-static inline void nvme_write_sq_db(struct nvme_queue *nvmeq, bool write_sq)
+static inline void nvme_write_sq_db(struct nvme_queue *nvmeq)
 {
-	if (!write_sq) {
-		u16 next_tail = nvmeq->sq_tail + 1;
-
-		if (next_tail == nvmeq->q_depth)
-			next_tail = 0;
-		if (next_tail != nvmeq->last_sq_tail)
-			return;
-	}
-
 	if (nvme_dbbuf_update_and_check_event(nvmeq->sq_tail,
 			nvmeq->dbbuf_sq_db, nvmeq->dbbuf_sq_ei))
 		writel(nvmeq->sq_tail, nvmeq->q_db);
-	nvmeq->last_sq_tail = nvmeq->sq_tail;
 }
===

As can be seen in the patch hunk quoted above, the variable next_tail is derived from the submission queue tail position, and then next_tail is compared against the q_depth variable to determine whether a wraparound has occurred in the submission queue.

The interesting bit of information is that q_depth is set to 2 with the Apple NVMe controller on your system, so I am guessing that the wrap-around is much more likely to occur with the NVMe controller in your laptop. However, from what I can tell, this commit removed the wrap-around check.

Given the above discussion, it might be worthwhile to attempt to take Linux kernel release 5.8.16, revert the aforementioned commit 54b2fcee1db0 ("nvme-pci: remove last_sq_tail") and give the resulting kernel a try on your system?

Once again, I do not know if this will help, and this is a guess/speculation based on casually looking at the code.

If this suggestion does not help, then another option would be to use git bisect as follows and then build and test the kernel at each step, which admittedly will take some time:

  git bisect start v5.8 v5.7 -- drivers/nvme/host/

For this command, I specified the drivers/nvme/host/ directory instead of only the two files I had used with the git log command above. This so that the problematic commit will be found, because my initial hunch regarding pci.c and core.c being the only relevant files changed between v5.7..v5.8 might be incorrect.

I hope that this will help in some way!

Comment 23 Pany 2020-10-30 00:32:51 UTC
(In reply to M. Vefa Bicakci from comment #22)
...
> Given the above discussion, it might be worthwhile to attempt to take Linux
kernel release 5.8.16, revert the aforementioned commit 54b2fcee1db0
("nvme-pci: remove last_sq_tail") and give the resulting kernel a try on your
system?
...

Hi, Brandon,

I made a build, which is based on the official kernel 5.8.16 with the commit 54b2fcee1db0 reverted:

https://copr.fedorainfracloud.org/coprs/pany/kernel-macbook/build/1727741/

You can easily try this:

$ sudo dnf copr enable pany/kernel-macbook
$ sudo dnf makecache
$ sudo dnf --disablerepo=* --enablerepo= copr:copr.fedorainfracloud.org:pany:kernel-macbook install kernel-5.8.16-200.revert54b2fcee1db0.fc32.rpm

Comment 24 Pany 2020-10-30 01:33:32 UTC
(In reply to Pany from comment #23)
> (In reply to M. Vefa Bicakci from comment #22)
> ...
> > Given the above discussion, it might be worthwhile to attempt to take Linux
> kernel release 5.8.16, revert the aforementioned commit 54b2fcee1db0
> ("nvme-pci: remove last_sq_tail") and give the resulting kernel a try on your
> system?
> ...
> 
> Hi, Brandon,
> 
> I made a build, which is based on the official kernel 5.8.16 with the commit
> 54b2fcee1db0 reverted:
> 
> https://copr.fedorainfracloud.org/coprs/pany/kernel-macbook/build/1727741/
> 
> You can easily try this:
> 
> $ sudo dnf copr enable pany/kernel-macbook
> $ sudo dnf makecache
> $ sudo dnf --disablerepo=* --enablerepo=
> copr:copr.fedorainfracloud.org:pany:kernel-macbook install
> kernel-5.8.16-200.revert54b2fcee1db0.fc32.rpm

Sorry for the typo, the last line should be:

$ sudo dnf --disablerepo=* --enablerepo=copr:copr.fedorainfracloud.org:pany:kernel-macbook install kernel-5.8.16-200.revert54b2fcee1db0.fc32

Comment 25 Brandon Jones 2020-10-30 02:06:21 UTC
Created attachment 1725199 [details]
dmesg successful boot on reverted 5.8.16

Comment 26 Brandon Jones 2020-10-30 02:14:25 UTC
Hi Pany, I actually cloned the fedora git repo for the kernel based on the instructions here: https://fedoraproject.org/wiki/Building_a_custom_kernel
and did a git revert [commit-hash] of the commit you suspected was the issue after checking out 5.8.16. I allowed the kernel and modules to compile, and I am very happy to report that the kernel I built using your suggestion booted successfully to the GNOME desktop! I uploaded the dmesg output in case it contains any useful information.

I did this before seeing the build you created. Thank you for doing that, I will try out your build as well and report back. I'm fairly certain I built things correctly, but I am not as experienced with git and kernel compiling as I would like to be, so it will be good to get a second confirmation.

Comment 27 Brandon Jones 2020-10-30 02:36:17 UTC
Pany, I downloaded and installed your kernel build and am happy to report I successfully booted in to your build and am typing this on that build right now! :)

Comment 28 Pany 2020-10-30 03:44:58 UTC
(In reply to Brandon Jones from comment #26)
> Hi Pany, I actually cloned the fedora git repo for the kernel based on the
> instructions here: https://fedoraproject.org/wiki/Building_a_custom_kernel
> and did a git revert [commit-hash] of the commit you suspected was the issue
> after checking out 5.8.16. I allowed the kernel and modules to compile, and
> I am very happy to report that the kernel I built using your suggestion
> booted successfully to the GNOME desktop! I uploaded the dmesg output in
> case it contains any useful information.
...

Glad to hear that! To clarify, M. Vefa Bicakci helped by looking through the journal and gave the suggestion at comment #22, so thanks to Vefa.

It turns out that Vefa was right, the commit 54b2fcee1db0 ("nvme-pci: remove last_sq_tail") was the root cause.

Maybe you want to report this issue to the upstream linux-nvme mailing list:

http://lists.infradead.org/mailman/listinfo/linux-nvme

Comment 29 Brandon Jones 2020-10-30 04:03:44 UTC
My apologies, I did not intend to mis-credit who identified the issue. Thank you!

This will be my first time emailing a kernel mailing list, but I will link the conversation here and do my best to provide useful information for the kernel devs.

Comment 30 Brandon Jones 2020-10-30 04:35:11 UTC
Here is a link to the mail thread. I will keep this thread posted if I get any useful updates. 

http://lists.infradead.org/pipermail/linux-nvme/2020-October/020569.html

Comment 31 Brandon Jones 2020-11-03 09:28:53 UTC
The fix for this issue will be added in 5.10: http://lists.infradead.org/pipermail/linux-nvme/2020-November/020580.html

Comment 32 Brandon Jones 2020-11-27 01:59:50 UTC
I can confirm that this issue is fixed in the available kernel release 5.9.10 in the Fedora repos. Thank you to all who helped! :)

Comment 33 Fedora Program Management 2021-04-29 16:56:56 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 34 Ben Cotton 2021-05-25 17:28:54 UTC
Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.