Bug 1956571 - Since Kernel 5.11, AMD gpu driver crashes on occasion. (No suspend involved. No power management at all)
Summary: Since Kernel 5.11, AMD gpu driver crashes on occasion. (No suspend involved. ...
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: xorg-x11-drv-amdgpu
Version: 34
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Christopher Atherton
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-03 23:42 UTC by Yasuo Ohgaki
Modified: 2022-06-07 20:07 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-07 20:07:25 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
"journalctl -b -1 -t kernel " (86.91 KB, application/octet-stream)
2021-05-03 23:42 UTC, Yasuo Ohgaki
no flags Details

Description Yasuo Ohgaki 2021-05-03 23:42:53 UTC
Created attachment 1779159 [details]
"journalctl -b -1 -t kernel "

Description of problem:

Since Fedora 33 Kernel 5.11, kernel crashes by 

RIP: 0010:kernel_queue_uninit+0xd/0xe0 [amdgpu]

on occasion. 
No suspend/power management is used. (This PC is turned on always)
It works few hours then crashes. I'm not sure what triggers crash. 
The same problem exist in Fedora 34.

-------------------------
 5月 02 17:44:01 localhost kernel: [drm] PCIE GART of 256M enabled (table at 0x000000F400000000).
 5月 02 17:44:12 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:15 localhost kernel: amdgpu: 
                                             failed to send message 252 ret is 0 
 5月 02 17:44:20 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:23 localhost kernel: amdgpu: 
                                             failed to send message 253 ret is 0 
 5月 02 17:44:28 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:31 localhost kernel: amdgpu: 
                                             failed to send message 250 ret is 0 
 5月 02 17:44:36 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:39 localhost kernel: amdgpu: 
                                             failed to send message 251 ret is 0 
 5月 02 17:44:44 localhost kernel: amdgpu: 
                                             last message was failed ret is 0
 5月 02 17:44:47 localhost kernel: amdgpu: 
                                             failed to send message 254 ret is 0 
 5月 02 17:44:49 localhost kernel: amdgpu: SMU load firmware failed
 5月 02 17:44:49 localhost kernel: amdgpu: fw load failed
 5月 02 17:44:49 localhost kernel: amdgpu: smu firmware loading failed
 5月 02 17:44:49 localhost kernel: amdgpu 0000:01:00.0: amdgpu: amdgpu_device_ip_resume failed (-22).
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:49 localhost kernel: snd_hda_intel 0000:01:00.1: CORB reset timeout#1, CORBRP = 0
 5月 02 17:44:49 localhost kernel: amdgpu: Move buffer fallback to memcpy unavailable
 5月 02 17:44:49 localhost kernel: [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to process the buffer list -19!
 5月 02 17:44:56 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered blocking state
 5月 02 17:44:56 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:56 localhost kernel: device veth98f0b5c entered promiscuous mode
 5月 02 17:44:57 localhost kernel: eth0: renamed from veth941cf4a
 5月 02 17:44:57 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth98f0b5c: link becomes ready
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered blocking state
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered forwarding state
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:57 localhost kernel: veth941cf4a: renamed from eth0
 5月 02 17:44:57 localhost kernel: userif-2: sent link down event.
 5月 02 17:44:57 localhost kernel: userif-2: sent link up event.
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:57 localhost kernel: device veth98f0b5c left promiscuous mode
 5月 02 17:44:57 localhost kernel: br-9affc843f45a: port 1(veth98f0b5c) entered disabled state
 5月 02 17:44:57 localhost kernel: userif-2: sent link down event.
 5月 02 17:44:57 localhost kernel: userif-2: sent link up event.
 5月 02 17:45:00 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=345369, emitted seq=345371
 5月 02 17:45:00 localhost kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
 5月 02 17:45:00 localhost kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
 5月 02 17:45:00 localhost kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
-----------------------------

It appears to have NULL pointer deref bug.
I didn't have this issue kernel 5.10 <=.


Version-Release number of selected component (if applicable):

All kernel 5.11 versions I've tried crashes.

    kernel-5.11.16-300.fc34.x86_64
    kernel-5.11.16-200.fc33.x86_64
    kernel-5.11.15-200.fc33.x86_64
    kernel-5.11.14-200.fc33.x86_64
    kernel-5.11.12-200.fc33.x86_64
    kernel-5.11.10-200.fc33.x86_64
    kernel-5.11.7-200.fc33.x86_64


How reproducible:

Not sure.

Additional info:

Tail of "journalctl -b -1 -t kernel" output is attached. 

Reported since this bug seems different type of bug that caused by suspend such as 
https://bugzilla.redhat.com/show_bug.cgi?id=1884180

Comment 1 Yasuo Ohgaki 2021-05-13 21:43:25 UTC
I'm not sure if 
libdrm-2.4.105-1.fc34.x86_64
fixed crash, but it does not crash with 5.11.18-300.fc34.x86_64. Playing several youtube's for few hours crashed kernel, but it didn't crash more than 10 hours. 

(libdrm-2.4.103-2.fc34.x86_64 is used previously)

Comment 2 Yasuo Ohgaki 2021-05-14 01:22:24 UTC
It has problem still. OS didn't crash completely, but it became unusable since NO display and No ssh. It does reply ping and journal indicates processes are still running, though.

[root@dev ~]# uname -a
Linux localhost 5.10.22-200.fc33.x86_64 #1 SMP Tue Mar 9 22:05:08 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
[root@dev ~]# lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 07)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:16.3 Serial controller: Intel Corporation 100 Series/C230 Series Chipset Family KT Redirection (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #1 (rev f1)
00:1c.2 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #3 (rev f1)
00:1c.4 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1c.5 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #6 (rev f1)
00:1c.6 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #7 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation C236 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev cf)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
02:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01)
03:00.0 USB controller: ASMedia Technology Inc. ASM1142 USB 3.1 Host Controller
04:00.0 PCI bridge: Tundra Semiconductor Corp. Device 8113 (rev 01)
06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
07:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
08:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
09:00.0 Non-Volatile memory controller: Phison Electronics Corporation E12 NVMe Controller (rev 01)

=============================== crash log ===============================================
 5月 14 09:48:43 localhost kernel: amdgpu: 
                                             failed to send message 261 ret is 0 
 5月 14 09:48:43 localhost kernel: ------------[ cut here ]------------
 5月 14 09:48:43 localhost kernel: WARNING: CPU: 5 PID: 41916 at drivers/gpu/drm/amd/amdgpu/../display/amdgpu_dm/amdgpu_dm.c:1792 dm_suspend+0x178/0x190 [amdgpu]
 5月 14 09:48:43 localhost kernel: Modules linked in: veth xt_nat nf_conntrack_netlink xt_addrtype br_netfilter snd_seq_dummy snd_hrtimer xt_CHECKSUM nf_nat_tftp nf_conntrack_tftp bridge stp llc ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip_set_hash_net ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ppdev vmnet(OE) parport_pc parport vmw_vsock_vmci_transport vsock vmw_vmci vmmon(OE) sunrpc vfat fat snd_hda_codec_realtek intel_rapl_msr intel_rapl_common snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel x86_pkg_temp_thermal snd_intel_dspcfg intel_powerclamp soundwire_intel coretemp soundwire_generic_allocation
 5月 14 09:48:43 localhost kernel:  kvm_intel zfs(POE) snd_soc_core iTCO_wdt kvm intel_pmc_bxt iTCO_vendor_support ee1004 ipmi_ssif mei_hdcp mei_wdt snd_compress snd_pcm_dmaengine zunicode(POE) soundwire_cadence zzstd(OE) irqbypass rapl snd_hda_codec zlua(OE) intel_cstate zavl(POE) intel_uncore icp(POE) snd_hda_core pcspkr ac97_bus snd_hwdep i2c_i801 i2c_smbus snd_seq snd_seq_device zcommon(POE) snd_pcm mei_me znvpair(POE) joydev snd_timer mei spl(OE) intel_pch_thermal snd acpi_ipmi ipmi_si soundcore ie31200_edac ipmi_devintf ipmi_msghandler acpi_pad zram ip_tables amdgpu ast drm_vram_helper drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched drm_kms_helper crc32_pclmul crc32c_intel cec drm e1000e igb ghash_clmulni_intel nvme dca i2c_algo_bit nvme_core video uas usb_storage fuse
 5月 14 09:48:43 localhost kernel: CPU: 5 PID: 41916 Comm: kworker/5:3 Tainted: P           OE     5.11.19-300.fc34.x86_64 #1
 5月 14 09:48:43 localhost kernel: Hardware name: Supermicro Super Server/X11SAE-F, BIOS 1.0a 01/19/2016
 5月 14 09:48:43 localhost kernel: Workqueue: pm pm_runtime_work
 5月 14 09:48:43 localhost kernel: RIP: 0010:dm_suspend+0x178/0x190 [amdgpu]
 5月 14 09:48:43 localhost kernel: Code: c3 31 d2 4c 89 e6 4c 89 ef e8 84 c7 12 00 83 f8 01 74 1e 89 c2 48 c7 c6 40 74 8f c0 48 c7 c7 68 16 9a c0 e8 ea 22 d3 ff eb b8 <0f> 0b e9 ad fe ff ff 4c 89 e6 4c 89 ef e8 a6 0b 12 00 eb a4 0f 1f
 5月 14 09:48:43 localhost kernel: RSP: 0018:ffffb7e00631fc68 EFLAGS: 00010282
 5月 14 09:48:43 localhost kernel: RAX: 0000000000000000 RBX: ffff8d1a0cd76920 RCX: 0000000000000000
 5月 14 09:48:43 localhost kernel: RDX: 0000000000000009 RSI: ffff8d2935b58ac0 RDI: ffff8d1a0cd60000
 5月 14 09:48:43 localhost kernel: RBP: ffff8d1a0cd60000 R08: 0000000000000000 R09: ffffb7e00631fa38
 5月 14 09:48:43 localhost kernel: R10: ffffb7e00631fa30 R11: ffffffffa7b44f08 R12: ffff8d1a0cd60000
 5月 14 09:48:43 localhost kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
 5月 14 09:48:43 localhost kernel: FS:  0000000000000000(0000) GS:ffff8d2935b40000(0000) knlGS:0000000000000000
 5月 14 09:48:43 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 5月 14 09:48:43 localhost kernel: CR2: 000004c50ce87000 CR3: 0000000110f62006 CR4: 00000000003706e0
 5月 14 09:48:43 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 5月 14 09:48:43 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 5月 14 09:48:43 localhost kernel: Call Trace:
 5月 14 09:48:43 localhost kernel:  ? vi_common_set_clockgating_state+0x229/0x2f0 [amdgpu]
 5月 14 09:48:43 localhost kernel:  amdgpu_device_ip_suspend_phase1+0x79/0xe0 [amdgpu]
 5月 14 09:48:43 localhost kernel:  amdgpu_device_suspend+0x6f/0x2b0 [amdgpu]
 5月 14 09:48:43 localhost kernel:  amdgpu_pmops_runtime_suspend+0x9d/0x130 [amdgpu]
 5月 14 09:48:43 localhost kernel:  pci_pm_runtime_suspend+0x5e/0x170
 5月 14 09:48:43 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:43 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:43 localhost kernel:  __rpm_callback+0x81/0x140
 5月 14 09:48:43 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:43 localhost kernel:  rpm_callback+0x1f/0x70
 5月 14 09:48:43 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:43 localhost kernel:  rpm_suspend+0x137/0x6c0
 5月 14 09:48:43 localhost kernel:  ? __switch_to_asm+0x42/0x70
 5月 14 09:48:43 localhost kernel:  ? __switch_to+0x11b/0x460
 5月 14 09:48:43 localhost kernel:  pm_runtime_work+0x8e/0x90
 5月 14 09:48:43 localhost kernel:  process_one_work+0x1ec/0x380
 5月 14 09:48:43 localhost kernel:  worker_thread+0x53/0x3e0
 5月 14 09:48:43 localhost kernel:  ? process_one_work+0x380/0x380
 5月 14 09:48:43 localhost kernel:  kthread+0x11b/0x140
 5月 14 09:48:43 localhost kernel:  ? kthread_associate_blkcg+0xa0/0xa0
 5月 14 09:48:43 localhost kernel:  ret_from_fork+0x22/0x30
 5月 14 09:48:43 localhost kernel: ---[ end trace c4f37f60b39674fa ]---
 5月 14 09:48:45 localhost kernel: BUG: unable to handle page fault for address: ffff8d1a24406000
 5月 14 09:48:45 localhost kernel: #PF: supervisor write access in kernel mode
 5月 14 09:48:45 localhost kernel: #PF: error_code(0x0003) - permissions violation
 5月 14 09:48:45 localhost kernel: PGD 18ca01067 P4D 18ca01067 PUD 100fa7063 PMD 124552063 PTE 8000000124406161
 5月 14 09:48:45 localhost kernel: Oops: 0003 [#1] SMP PTI
 5月 14 09:48:45 localhost kernel: CPU: 5 PID: 41916 Comm: kworker/5:3 Tainted: P        W  OE     5.11.19-300.fc34.x86_64 #1
 5月 14 09:48:45 localhost kernel: Hardware name: Supermicro Super Server/X11SAE-F, BIOS 1.0a 01/19/2016
 5月 14 09:48:45 localhost kernel: Workqueue: pm pm_runtime_work
 5月 14 09:48:45 localhost kernel: RIP: 0010:kfd_gtt_sa_free+0x39/0x80 [amdgpu]
 5月 14 09:48:45 localhost kernel: Code: f5 53 48 89 fb 0f 1f 44 00 00 4c 8d a3 70 01 00 00 4c 89 e7 e8 38 6b 56 e6 8b 45 00 3b 45 04 77 16 48 8b 93 68 01 00 00 89 c1 <f0> 48 0f b3 0a 83 c0 01 39 45 04 73 ea 4c 89 e7 e8 02 5d 56 e6 48
 5月 14 09:48:45 localhost kernel: RSP: 0018:ffffb7e00631fca0 EFLAGS: 00010206
 5月 14 09:48:45 localhost kernel: RAX: 00000000d0146000 RBX: ffff8d1a0a310c00 RCX: 00000000d0146000
 5月 14 09:48:45 localhost kernel: RDX: ffff8d1a0a3dd400 RSI: ffff8d1a088b2b80 RDI: ffff8d1a0a310d70
 5月 14 09:48:45 localhost kernel: RBP: ffff8d1a088b2b80 R08: ffff8d2975bd5f70 R09: ffff8d2975bd6030
 5月 14 09:48:45 localhost kernel: R10: 00000000003fdb40 R11: 0000000000000000 R12: ffff8d1a0a310d70
 5月 14 09:48:45 localhost kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
 5月 14 09:48:45 localhost kernel: FS:  0000000000000000(0000) GS:ffff8d2935b40000(0000) knlGS:0000000000000000
 5月 14 09:48:45 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 5月 14 09:48:45 localhost kernel: CR2: ffff8d1a24406000 CR3: 0000000110f62006 CR4: 00000000003706e0
 5月 14 09:48:45 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 5月 14 09:48:45 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 5月 14 09:48:45 localhost kernel: Call Trace:
 5月 14 09:48:45 localhost kernel:  stop_cpsch+0x94/0xc0 [amdgpu]
 5月 14 09:48:45 localhost kernel:  kgd2kfd_suspend.part.0+0x2f/0x40 [amdgpu]
 5月 14 09:48:45 localhost kernel:  amdgpu_device_suspend+0x7b/0x2b0 [amdgpu]
 5月 14 09:48:45 localhost kernel:  amdgpu_pmops_runtime_suspend+0x9d/0x130 [amdgpu]
 5月 14 09:48:45 localhost kernel:  pci_pm_runtime_suspend+0x5e/0x170
 5月 14 09:48:45 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:45 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:45 localhost kernel:  __rpm_callback+0x81/0x140
 5月 14 09:48:45 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:45 localhost kernel:  rpm_callback+0x1f/0x70
 5月 14 09:48:45 localhost kernel:  ? pci_dev_put+0x20/0x20
 5月 14 09:48:45 localhost kernel:  rpm_suspend+0x137/0x6c0
 5月 14 09:48:45 localhost kernel:  ? __switch_to_asm+0x42/0x70
 5月 14 09:48:45 localhost kernel:  ? __switch_to+0x11b/0x460
 5月 14 09:48:45 localhost kernel:  pm_runtime_work+0x8e/0x90
 5月 14 09:48:45 localhost kernel:  process_one_work+0x1ec/0x380
 5月 14 09:48:45 localhost kernel:  worker_thread+0x53/0x3e0
 5月 14 09:48:45 localhost kernel:  ? process_one_work+0x380/0x380
 5月 14 09:48:45 localhost kernel:  kthread+0x11b/0x140
 5月 14 09:48:45 localhost kernel:  ? kthread_associate_blkcg+0xa0/0xa0
 5月 14 09:48:45 localhost kernel:  ret_from_fork+0x22/0x30
 5月 14 09:48:45 localhost kernel: Modules linked in: veth xt_nat nf_conntrack_netlink xt_addrtype br_netfilter snd_seq_dummy snd_hrtimer xt_CHECKSUM nf_nat_tftp nf_conntrack_tftp bridge stp llc ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter xt_set ipt_REJECT nf_reject_ipv4 xt_conntrack xt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip_set_hash_net ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security rfkill ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) ppdev vmnet(OE) parport_pc parport vmw_vsock_vmci_transport vsock vmw_vmci vmmon(OE) sunrpc vfat fat snd_hda_codec_realtek intel_rapl_msr intel_rapl_common snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel x86_pkg_temp_thermal snd_intel_dspcfg intel_powerclamp soundwire_intel coretemp soundwire_generic_allocation
 5月 14 09:48:45 localhost kernel:  kvm_intel zfs(POE) snd_soc_core iTCO_wdt kvm intel_pmc_bxt iTCO_vendor_support ee1004 ipmi_ssif mei_hdcp mei_wdt snd_compress snd_pcm_dmaengine zunicode(POE) soundwire_cadence zzstd(OE) irqbypass rapl snd_hda_codec zlua(OE) intel_cstate zavl(POE) intel_uncore icp(POE) snd_hda_core pcspkr ac97_bus snd_hwdep i2c_i801 i2c_smbus snd_seq snd_seq_device zcommon(POE) snd_pcm mei_me znvpair(POE) joydev snd_timer mei spl(OE) intel_pch_thermal snd acpi_ipmi ipmi_si soundcore ie31200_edac ipmi_devintf ipmi_msghandler acpi_pad zram ip_tables amdgpu ast drm_vram_helper drm_ttm_helper ttm iommu_v2 crct10dif_pclmul gpu_sched drm_kms_helper crc32_pclmul crc32c_intel cec drm e1000e igb ghash_clmulni_intel nvme dca i2c_algo_bit nvme_core video uas usb_storage fuse
 5月 14 09:48:45 localhost kernel: CR2: ffff8d1a24406000
 5月 14 09:48:45 localhost kernel: ---[ end trace c4f37f60b39674fb ]---
 5月 14 09:48:45 localhost kernel: sssd_nss[3117]: segfault at 0 ip 00007fdf8e913960 sp 00007ffd86b6a878 error 6
 5月 14 09:48:45 localhost kernel: sssd_be[3093]: segfault at 0 ip 00007f0c53d98960 sp 00007ffd1a1e39b8 error 6 in libsss_util.so[7f0c53d56000+54000]
 5月 14 09:48:45 localhost kernel: Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 5月 14 09:48:45 localhost kernel:  in libsss_util.so[7fdf8e8d1000+54000]
 5月 14 09:48:45 localhost kernel: Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 5月 14 09:48:45 localhost kernel: RIP: 0010:kfd_gtt_sa_free+0x39/0x80 [amdgpu]
 5月 14 09:48:45 localhost kernel: Code: f5 53 48 89 fb 0f 1f 44 00 00 4c 8d a3 70 01 00 00 4c 89 e7 e8 38 6b 56 e6 8b 45 00 3b 45 04 77 16 48 8b 93 68 01 00 00 89 c1 <f0> 48 0f b3 0a 83 c0 01 39 45 04 73 ea 4c 89 e7 e8 02 5d 56 e6 48
 5月 14 09:48:45 localhost kernel: RSP: 0018:ffffb7e00631fca0 EFLAGS: 00010206
 5月 14 09:48:45 localhost kernel: RAX: 00000000d0146000 RBX: ffff8d1a0a310c00 RCX: 00000000d0146000
 5月 14 09:48:45 localhost kernel: RDX: ffff8d1a0a3dd400 RSI: ffff8d1a088b2b80 RDI: ffff8d1a0a310d70
 5月 14 09:48:45 localhost kernel: RBP: ffff8d1a088b2b80 R08: ffff8d2975bd5f70 R09: ffff8d2975bd6030
 5月 14 09:48:45 localhost kernel: R10: 00000000003fdb40 R11: 0000000000000000 R12: ffff8d1a0a310d70
 5月 14 09:48:45 localhost kernel: R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
 5月 14 09:48:45 localhost kernel: FS:  0000000000000000(0000) GS:ffff8d2935b40000(0000) knlGS:0000000000000000
 5月 14 09:48:45 localhost kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 5月 14 09:48:45 localhost kernel: CR2: ffff8d1a24406000 CR3: 0000000110f62006 CR4: 00000000003706e0
 5月 14 09:48:45 localhost kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 5月 14 09:48:45 localhost kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Comment 3 Yasuo Ohgaki 2021-05-14 01:27:42 UTC
Previous uname output was useless because I booted with older healthy kernel.
The kernel is kernel-5.11.19-300.fc34.x86_64.

Comment 4 Yasuo Ohgaki 2021-05-22 22:21:00 UTC
5.11.21-300.fc34.x86_64 seems working so far on the AMD GPU machine. It is working more than a day w/o crash.

However, 5.11.21-300.fc34.x86_64 crashes on Intel HD Graphics machine during boot. I see crash related to DRM, but journald couldn't log the error due to too early crash in kernel.

Comment 5 Ben Cotton 2022-05-12 14:58:34 UTC
This message is a reminder that Fedora Linux 34 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora Linux 34 on 2022-06-07.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
'version' of '34'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, change the 'version' 
to a later Fedora Linux version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora Linux 34 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora Linux, you are encouraged to change the 'version' to a later version
prior to this bug being closed.

Comment 6 Ben Cotton 2022-06-07 20:07:25 UTC
Fedora Linux 34 entered end-of-life (EOL) status on 2022-06-07.

Fedora Linux 34 is no longer maintained, which means that it
will not receive any further security or bug fix updates. As a result we
are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.