Bug 2111555

Summary: Kernels newer than kernel-5.18.13-200.fc36 stop output if you use LUKS on the root partition with NVIDIA drivers
Product: [Fedora] Fedora Reporter: Joe Doss <joe>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 36CC: acaringi, adscvr, airlied, alciregi, bskeggs, dev, hdegoede, homann.philipp, hpa, jarodwilson, jglisse, jonathan, josef, kernel-maint, lgoncalv, linville, masami256, mchehab, ptalbert, steved
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-05-25 17:10:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Hung boot of kernel-5.18.13-200.fc36
none
LUKS password prompt over IPMI SOL none

Description Joe Doss 2022-07-27 13:23:14 UTC
Created attachment 1899661 [details]
Hung boot of kernel-5.18.13-200.fc36

1. Please describe the problem:

When booting kernel-5.18.13-200.fc36 it hangs on boot. It always seems to hang right after the nvme modules init in the intramfs. See attached screenshot.

2. What is the Version-Release number of the kernel:

kernel-5.18.13-200.fc36

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

I can boot 5.18.11-200.fc36.x86_64 just fine.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Yes. I try to boot kernel-5.18.13-200.fc36 with the following configuration.

* AMD Ryzen Threadripper PRO 3975WX
* NVIDIA GeForce GTX 1080 Ti with kmod-nvidia-5.18.13-200.fc36.x86_64-515.57-1.fc36.x86_64
* LUKS encrypted root partition on a Samsung SSD 980 PRO 1TB NVMe drive


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Unknown. I will try this out with kernel-5.19.0-0.rc7.20220722git68e77ffbfd06.56.fc37.x86_64.rpm and report back.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

kmod-nvidia-5.18.13-200.fc36.x86_64-515.57-1.fc36.x86_64

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Since I can't fully boot with kernel-5.18.13-200.fc36 I am unsure I can get any logs.

Thanks!
Joe

Comment 1 Joe Doss 2022-07-27 13:51:54 UTC
kernel-5.19.0-0.rc7.20220722git68e77ffbfd06.56.fc37.x86_64.rpm fails to boot for me and just hangs on the EFI Stub: UEFI Secure Boot is enabled.

Comment 2 Joe Doss 2022-08-03 03:12:50 UTC
Same results in kernel-5.18.15-200.fc36. https://bbs.archlinux.org/viewtopic.php?id=278535 might be related to this issue. However, setting spectre_v2=off or retbleed=off as detailed in the Arch thread do not seem to work.

Comment 3 Joe Doss 2022-08-03 23:56:32 UTC
I am still seeing the same hang with 5.18.16-200.fc36.

Comment 4 homann.philipp 2022-08-04 12:51:21 UTC
Hi Joe,
I'm also experiencing boot issues with latest kernel on an UEFI system.
Also filed an item for that https://bugzilla.redhat.com/show_bug.cgi?id=2104883
Maybe you can comment there so and hopefully we will catch attention of the maintainers.

Comment 5 Joe Doss 2022-08-07 04:20:50 UTC
Hey Philipp,

My issue is on a workstation and yours is on a server, correct?. I see that you were able to get some info out off a serial. That is great. I wish I had a serial port on this workstation. I might be able to get something off it's IPMI maybe. If you have any pointers on setting that up I'd love a shove in the right direction. It has been many years since I have had to do something like that. 

I got some time to further debug this issue this weekend. I installed a fresh Fedora 36 to a secondary hard drive and updated it to 5.18.16-200.fc36 and it boots just fine. I then installed akmod-nvidia and rebooted and the problem surfaced again on the fresh install. Arrg! 

I then removed the NVIDIA drivers (sudo dnf remove xorg-x11-drv-nvidia\*) on my main install and tried booting 5.18.16-200.fc36 and the hang persists. I am unsure if the install of the NVIDIA akmod leaves anything behind. If anyone has any ideas on why this might be happening or other thoughts on this issue, please let me know.

Comment 6 dev 2022-08-08 17:33:11 UTC
im having the same issue on very similar hardware:

- AMD Epyc 7313
- Nvidia GTX 1080 Ti
- LUKS encrypted root partition

issue also started with kernel-5.18.13-200.fc36 and persists through kernel 5.18.16 :(

Comment 7 homann.philipp 2022-08-09 10:38:31 UTC
Do you also have the same stacktrace?

Comment 8 Joe Doss 2022-08-09 12:51:32 UTC
Created attachment 1904482 [details]
LUKS password prompt over IPMI SOL

I was able to get IPMI SOL working and it showed that it is prompting for my LUKS password but it is only visible over serial. Whatever redirection to plymouth isn't working since kernel-5.18.13-200.fc36.

Knowing it is prompting for my LUKS password, I was able to type it in and get myself booted into 5.18.16-200.fc36.x86_64

# uname -a
Linux sw-0608 5.18.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Aug 3 15:44:49 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Comment 9 dev 2022-08-11 19:18:15 UTC
+1 -- thanks for this joe, exactly the same issue here.

of course its nvidia. its always nvidia...

went back to nouveau for now and is working

Comment 10 dev 2022-08-11 19:19:35 UTC
joe this is how i nuked the existing driver

dnf remove xorg-x11-drv-nvidia\*
rm -f /usr/lib{,64}/libGL.so.* /usr/lib{,64}/libEGL.so.*
rm -f /usr/lib{,64}/xorg/modules/extensions/libglx.so
dnf reinstall xorg-x11-server-Xorg mesa-libGL mesa-libEGL libglvnd\*
mv /etc/X11/xorg.conf /etc/X11/xorg.conf.saved

Comment 11 Joe Doss 2022-11-06 14:01:36 UTC
This is still an issue in 6.0.6-300.fc37.x86_64.

Comment 12 Joe Doss 2023-01-14 01:15:15 UTC
This still happens on 6.1.5-200.fc37.x86_64. It is actually worse now. It shows zero output except for first message about the EFI bootloader. If you blindly type in your LUKS password a few times you can get things to boot.

Comment 13 dev 2023-01-24 05:12:38 UTC
fixed in 6.1.7 :)

Comment 14 dev 2023-01-24 17:16:59 UTC
^ assuming this is a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=2161104

Comment 15 Joe Doss 2023-01-24 18:24:29 UTC
This is not fixed for me in 6.1.7-200.fc37.x86_64 with NVIDIA drivers and LUKS. It is back to my original state as shown in https://bugzilla.redhat.com/attachment.cgi?id=1899661 :(

Comment 16 Ben Cotton 2023-04-25 17:40:06 UTC
This message is a reminder that Fedora Linux 36 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 36 on 2023-05-16.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '36'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version. Note that the version field may be hidden.
Click the "Show advanced fields" button if you do not see it.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 36 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 17 Ludek Smid 2023-05-25 17:10:01 UTC
Fedora Linux 36 entered end-of-life (EOL) status on 2023-05-16.

Fedora Linux 36 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of Fedora Linux
please feel free to reopen this bug against that version. Note that the version
field may be hidden. Click the "Show advanced fields" button if you do not see
the version field.

If you are unable to reopen this bug, please file a new report against an
active release.

Thank you for reporting this bug and we are sorry it could not be fixed.