Bug 2048093 - Computer always crashes because of regression with AMDGPU since Kernel 5.16.3
Summary: Computer always crashes because of regression with AMDGPU since Kernel 5.16.3
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 35
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-29 12:19 UTC by Marcel
Modified: 2022-04-21 12:45 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-15 19:29:19 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
Kernel 5.16.3 dmesg of the failed boot attempt (96.28 KB, text/plain)
2022-01-29 12:19 UTC, Marcel
no flags Details
Kernel 5.16.4 dmesg of the failed boot attempt (94.03 KB, text/plain)
2022-01-30 22:14 UTC, Marcel
no flags Details
Kernel 5.16.5 dmesg of the failed boot attempt (97.02 KB, text/plain)
2022-02-02 17:40 UTC, Marcel
no flags Details
Kernel 5.17.0-rc2 dmesg of the failed boot attempt (95.97 KB, text/plain)
2022-02-05 09:14 UTC, Marcel
no flags Details
Kernel 5.17.0-rc4 dmesg of the failed boot attempt (94.83 KB, text/plain)
2022-02-16 09:54 UTC, Marcel
no flags Details

Description Marcel 2022-01-29 12:19:29 UTC
Created attachment 1857568 [details]
Kernel 5.16.3 dmesg of the failed boot attempt

1. Please describe the problem:
Computer crashes ~2 seconds after login every time, 100% reproducible.
I am using the AMD Radeon RX 6500 XT as GPU, which was just released last week.

2. What is the Version-Release number of the kernel:
Defect with 5.16.3
Works with 5.16.2

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
5.16.3

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
- Boot the computer
- Login (I am using KDE Plasma 5.23.5)


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
-

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
File is attached and shows some errors about amdgpu

Comment 1 Marcel 2022-01-30 22:14:38 UTC
Created attachment 1857907 [details]
Kernel 5.16.4 dmesg of the failed boot attempt

Just tested Kernel 5.16.4, still can't get the computer to boot properly. But  I think the error changed: "[drm:amdgpu_dm_init.isra.0.cold [amdgpu]] *ERROR* Failed to register vline0 irq 30!"

Comment 2 Marcel 2022-02-02 17:40:28 UTC
Created attachment 1858728 [details]
Kernel 5.16.5 dmesg of the failed boot attempt

For Kernel 5.16.5 the error slightly changed again:
1. error message: "Feb 02 18:29:36 kernel: __common_interrupt: 8.55 No irq handler for vector"
2. error message: "Feb 02 18:29:39 kernel: [drm:amdgpu_dm_init.isra.0.cold [amdgpu]] *ERROR* Failed to register vline0 irq 30!"

Comment 3 Justin M. Forbes 2022-02-04 23:08:14 UTC
Does this work with 5.17-rc2 from rawhide?  Just curious if the patch added to 5.16.3 was incomplete or if it was a bad patch all together. Unfortunately, 5.16.3 was over 1000 patches, 29 of those being amd specific, so trying to narrow it down a bit. I am not seeing it on my rx580 system, which is the only AMD card I have at the moment.

Comment 4 Marcel 2022-02-05 09:14:16 UTC
Created attachment 1859208 [details]
Kernel 5.17.0-rc2 dmesg of the failed boot attempt

Just tried to boot with kernel-5.17.0-0.rc2.83.fc36.x86_64 , but my system is still crashing.

Comment 5 David Yang 2022-02-07 11:20:02 UTC
Unsure if strictly related, but also encountering a crash regression on boot. My gap was larger, 5.15.16 to 5.15.5. From fetch:

CPU: AMD Ryzen 5 5600H with Radeon Graphics @ 12x 3.3GHz
GPU: NVIDIA GeForce RTX 3060 Laptop GPU

Came searching after the first reboot into the previous installed kernel to see if anyone else had the same issue. Can try to check for an error message/try an rc kernel on future restarts

Comment 6 David Yang 2022-02-08 15:58:16 UTC
Retracting last comment, took an opportunity to attempt to capture boot log on a restart today and my issue was merely more waiting needed at boot time, machine only appeared to be hung when the spinning circle stopped.

Comment 7 Justin M. Forbes 2022-02-08 17:03:16 UTC
Want to give https://koji.fedoraproject.org/koji/taskinfo?taskID=82513039 a try? It is not secure boot signed being a test kernel, but might help.

Comment 8 Marcel 2022-02-08 21:52:27 UTC
Tried Kernel 5.16.8rc2, but the computer still crashes a few seconds after login.

Comment 9 Ian Kumlien 2022-02-09 11:02:32 UTC
I have a similar issue with a work laptop, try booting with acpi off (acpi=off on the kernel command line)

I suspect that a bios update should solve this...

Comment 10 Marcel 2022-02-09 16:00:57 UTC
Booting wuth acpi=off did not solve the crash.
Did the crash you describe happen inside the AMDGPU driver code or was it a different error message?

I think I'll try to compile the kernel "commit for commit" by myself and check what change introduced the regression. Guess this is the currently the best way to find the bug (?).

Comment 11 Ian Kumlien 2022-02-09 21:16:25 UTC
No, I only have one machine with a amdgpu and it works just fine with mesa... 

The symptoms are similar to my weork laptop (intel, HP laptop) that freezes at login (basically deadlock) when booted with newer kernels... But it's ascpi related in that case.

Comment 12 Marcel 2022-02-10 08:15:43 UTC
But then your crash has nothing to do with this bug report.
Please file another bug and only post here if it's related to the AMDGPU driver that throws that message "[drm:amdgpu_dm_init.isra.0.cold [amdgpu]] *ERROR* Failed to register vline0 irq 30!".

Comment 13 Marcel 2022-02-16 09:54:59 UTC
Created attachment 1861436 [details]
Kernel 5.17.0-rc4 dmesg of the failed boot attempt

Kernel 5.17.0-rc4 crashes while producing (probably) the most detailed dmesg output (the last 8 lines)

Comment 14 Marcel 2022-03-04 15:00:26 UTC
It is commit 620c32a9af98fec55f9f22e2dbeff10824c909e6 "drm/amdgpu/display: set vblank_disable_immediate for DC".
When I compile the Kernel using 2f0fd2f941e88cfb56f1aa5e5ec7bd396576d2f3, everything works fine, but one commit later (620c32a9af98fec55f9f22e2dbeff10824c909e6) I can't boot anymore.

This is all I can do for now, hope this helps to find/fix the bug.

Comment 15 Marcel 2022-03-15 19:29:19 UTC
As this is a problem with the upstream kernel, I close this one and create an issue there.

Comment 16 Marcel 2022-04-21 12:45:05 UTC
Just for the reference:
Here is the bug report on AMD's side: https://gitlab.freedesktop.org/drm/amd/-/issues/1933


Note You need to log in before you can comment on or make changes to this bug.