Bug 1956571 - Since Kernel 5.11, AMD gpu driver crashes on occasion. (No suspend involved. No power management at all)
Summary: Since Kernel 5.11, AMD gpu driver crashes on occasion. (No suspend involved. ...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-amdgpu
Version: 34
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Christopher Atherton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-03 23:42 UTC by Yasuo Ohgaki
Modified: 2021-05-03 23:42 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
"journalctl -b -1 -t kernel " (86.91 KB, application/octet-stream)
2021-05-03 23:42 UTC, Yasuo Ohgaki
no flags Details

Description Yasuo Ohgaki 2021-05-03 23:42:53 UTC
Created attachment 1779159 [details]
"journalctl -b -1 -t kernel "

Description of problem:

Since Fedora 33 Kernel 5.11, kernel crashes by 

RIP: 0010:kernel_queue_uninit+0xd/0xe0 [amdgpu]

on occasion. 
No suspend/power management is used. (This PC is turned on always)
It works few hours then crashes. I'm not sure what triggers crash. 
The same problem exist in Fedora 34.

-------------------------
 5月 02 17:44:01 localhost kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
 5月 02 17:44:12 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:15 localhost kernel: amdgpu: 
                                             failed to send message 252 ret is 0 
 5月 02 17:44:20 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:23 localhost kernel: amdgpu: 
                                             failed to send message 253 ret is 0 
 5月 02 17:44:28 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:31 localhost kernel: amdgpu: 
                                             failed to send message 250 ret is 0 
 5月 02 17:44:36 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:39 localhost kernel: amdgpu: 
                                             failed to send message 251 ret is 0 
 5月 02 17:44:44 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:47 localhost kernel: amdgpu: 
                                             failed to send message 254 ret is 0 
 5月 02 17:44:49 localhost kernel: amdgpu: SMU load firmware failed
 5月 02 17:44:49 localhost kernel: amdgpu: fw load failed
 5月 02 17:44:49 localhost kernel: amdgpu: smu firmware loading failed
 5月 02 17:44:49 localhost kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: snd_hda_intel 0000:01:00.1: CORB reset timeout#1, CORBRP = 0
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:56 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered blocking state
 5月 02 17:44:56 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:56 localhost kernel: device veth98f0b5c entered promiscuous mode
 5月 02 17:44:57 localhost kernel: eth0: renamed from veth941cf4a
 5月 02 17:44:57 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth98f0b5c: link becomes ready
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered blocking state
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered forwarding state
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:57 localhost kernel: veth941cf4a: renamed from eth0
 5月 02 17:44:57 localhost kernel: userif-2: sent link down event.
 5月 02 17:44:57 localhost kernel: userif-2: sent link up event.
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:57 localhost kernel: device veth98f0b5c left promiscuous mode
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:57 localhost kernel: userif-2: sent link down event.
 5月 02 17:44:57 localhost kernel: userif-2: sent link up event.
 5月 02 17:45:00 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=345369, emitted seq=345371
 5月 02 17:45:00 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
 5月 02 17:45:00 localhost kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
 5月 02 17:45:00 localhost kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
-----------------------------

It appears to have NULL pointer deref bug.
I didn't have this issue kernel 5.10 <=.


Version-Release number of selected component (if applicable):

All kernel 5.11 versions I've tried crashes.

    kernel-5.11.16-300.fc34.x86_64
    kernel-5.11.16-200.fc33.x86_64
    kernel-5.11.15-200.fc33.x86_64
    kernel-5.11.14-200.fc33.x86_64
    kernel-5.11.12-200.fc33.x86_64
    kernel-5.11.10-200.fc33.x86_64
    kernel-5.11.7-200.fc33.x86_64


How reproducible:

Not sure.

Additional info:

Tail of "journalctl -b -1 -t kernel" output is attached. 

Reported since this bug seems different type of bug that caused by suspend such as 
https://bugzilla.redhat.com/show_bug.cgi?id=1884180


Note You need to log in before you can comment on or make changes to this bug.