1911009 – Black screen and unresponsive system involving amdgpu starting when booting 5.10 kernels

Bug 1911009 - Black screen and unresponsive system involving amdgpu starting when booting 5.10 kernels

Summary: Black screen and unresponsive system involving amdgpu starting when booting 5...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-12-26 23:15 UTC by Matt Fagnani
Modified:	2021-01-10 01:34 UTC (History)
CC List:	18 users (show)
Fixed In Version:
Doc Type:	---
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-01-10 01:34:35 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
boot messages video with 5.10.0-0.rc6.20201204git34816d20f173.92.fc34 (4.39 MB, video/mp4) 2020-12-26 23:15 UTC, Matt Fagnani	no flags	Details
trace image from boot with 5.10.0-0.rc6.20201204git34816d20f173.92.fc34 (123.64 KB, image/png) 2020-12-26 23:18 UTC, Matt Fagnani	no flags	Details
journalctl output for boot with default kernel parameters ending with a black screen and unresponsive system (163.35 KB, text/plain) 2020-12-30 03:53 UTC, Matt Fagnani	no flags	Details
View All

Description Matt Fagnani 2020-12-26 23:15:51 UTC

Created attachment 1742163 [details]
boot messages video with 5.10.0-0.rc6.20201204git34816d20f173.92.fc34

1. Please describe the problem:

When I've booted the F34 KDE Plasma spin images Fedora-KDE-Live-x86_64-Rawhide-20201108.n.0 with kernel-5.10.0-0.rc2.20201105git4ef8451b3326.64.fc34 and Fedora-KDE-Live-x86_64-Rawhide-20201224.n.0 with kernel-5.10.0-0.rc6.20201204git34816d20f173.92.fc34, 
the screen went black around when the amdgpu driver was started and remained so. The system became unresponsive including to sysrq+alt+* or switching vt with ctrl-alt-f2 etc. I've had to shut the system off by holding the power button for  a few seconds each time this crash happened. The system journals from boots with the crash weren't retained on the next boots.

It's difficult to get the kernel logs normally because the system becomes unresponsive. I'll attach a video of the boot messages when booting Fedora-KDE-Live-x86_64-Rawhide-20201224.n.0 with quiet removed and debug added to the kernel command line. A stack trace was shown 20 s into the video, but the messages were going by so quickly and the resolution is low so its difficult to read the text. the system became unresponsive at 52 s after drm amdgpu messages showing kms was starting were shown.
[drm] amdgpu kernel modesetting enabled
amdgpu: Topology: Add APU nodes [0x0-0x0]
checking generic (e0000000 420000) vs hw (e0000000 10000000)
fb0: switching to amdgpudrmfb from EFI VGA

With 5.9.16 these messages are followed by amdgpu starting normally. The 5.9 kernels up to 5.9.16 aren't affected by this problem.

The black screen and unrepsonsive system didn't occur when I booted with Troubleshooting > Start Fedora 34 KDE Plasma in basic graphics mode which used nomodeset on the kernel command line. Plasma on Wayland didn't start due to a segmentation fault, and the system went back to sddm. When I used amdgpu.dc=0 on the kernel command line, the boot completed and Plasma on Wayland started. The problem is likely in amdgpu particularly its display core (dc). kernel-5.10.2-200.fc33 has this problem when I installed it in an F33 KDE Plasma installation.
https://koji.fedoraproject.org/koji/buildinfo?buildID=1660969 

The system is an hp laptop with an AMD A10-9620P CPU with integrated Radeon R5 GPU shown by lspci as 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wani [Radeon R5/R6/R7 Graphics] (rev ca)

2. What is the Version-Release number of the kernel:
5.10.0-0.rc2.20201105git4ef8451b3326.64.fc34 through 5.10.2

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Yes, 5.9.16 and earlier kernels are unaffected by this problem. The first 5.10 kernel I tested 5.10.0-0.rc2.20201105git4ef8451b3326.64.fc34 was affected so the problem was likely introduced earlier in the 5.10 branch than that.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
1. Boot Fedora-KDE-Live-x86_64-Rawhide-20201108.n.0 or Fedora-KDE-Live-x86_64-Rawhide-20201224.n.0 with the default kernel command line on a system with an AMD GPU affected by this problem
or 
1. install kernel-5.10.2-200.fc33 in an F33 KDE Plasma spin installation
2. reboot into 5.10.2 with the default kernel command line on a system with an AMD GPU affected by this problem

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Yes. The problem happened with the latest successful Rawhide build kernel-5.10.0-0.rc6.20201204git34816d20f173.92.fc34

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
I'll attach the boot video with 5.10.0-0.rc6.20201204git34816d20f173.92.fc34 and image of the trace mentioned above. 

I'm open to suggestions on better ways to get the kernel messages and trace. I tried to use the pstore block oops/panic logger as described at https://www.kernel.org/doc/html/latest/admin-guide/pstore-blk.html but nothing was saved in /sys/fs/pstore on the following boots.

The following reports have similar problems with a black screen and unresponsive system with amdgpu in 5.10.1-5.10.2. A proposed patch is at the kernel.org report. The traces reported in those reported look different from the one I saw though it's hard to tell.
https://bugzilla.kernel.org/show_bug.cgi?id=210739
https://bbs.archlinux.org/viewtopic.php?id=261745

Comment 1 Matt Fagnani 2020-12-26 23:18:15 UTC

Created attachment 1742164 [details]
trace image from boot with 5.10.0-0.rc6.20201204git34816d20f173.92.fc34

Comment 2 Matt Fagnani 2020-12-30 03:27:43 UTC

5.10.3 is affected by the same problem with the default kernel command line. When I booted 5.10.3 with amdgpu.dc=0, a null pointer dereference in dc_commit_state in amdgpu happened while amdgpu was starting. The boot completed with amdgpu.dc=0.

Dec 29 15:21:08 kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000
Dec 29 15:21:08 kernel: #PF: supervisor instruction fetch in kernel mode
Dec 29 15:21:08 kernel: #PF: error_code(0x0010) - not-present page
Dec 29 15:21:08 kernel: PGD 0 P4D 0 
Dec 29 15:21:08 kernel: Oops: 0010 [#1] SMP NOPTI
Dec 29 15:21:08 kernel: CPU: 2 PID: 356 Comm: plymouthd Not tainted 5.10.3-200.fc33.x86_64 #1
Dec 29 15:21:08 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019
Dec 29 15:21:08 kernel: RIP: 0010:0x0
Dec 29 15:21:08 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Dec 29 15:21:08 kernel: RSP: 0018:ffffa0cdc05938c8 EFLAGS: 00010286
Dec 29 15:21:08 kernel: RAX: 0000000000000000 RBX: ffff8d638f2c01b8 RCX: ffff8d638a86a000
Dec 29 15:21:08 kernel: RDX: 0000000000000000 RSI: 00000000000005cf RDI: ffff8d6388d99420
Dec 29 15:21:08 kernel: RBP: ffff8d638f2c0000 R08: ffffa0cdc05938c4 R09: 0000000000000001
Dec 29 15:21:08 kernel: R10: 0000000000000004 R11: 0000000000000003 R12: 0000000000000000
Dec 29 15:21:08 kernel: R13: 0000000000000000 R14: ffff8d638b16ec00 R15: ffff8d638e870000
Dec 29 15:21:08 kernel: FS:  00007f0defc53f40(0000) GS:ffff8d6477500000(0000) knlGS:0000000000000000
Dec 29 15:21:08 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 15:21:08 kernel: CR2: ffffffffffffffd6 CR3: 00000001097ca000 CR4: 00000000001506e0
Dec 29 15:21:08 kernel: Call Trace:
Dec 29 15:21:08 kernel:  dc_commit_state+0x823/0xa20 [amdgpu]
Dec 29 15:21:08 kernel:  ? drm_calc_timestamping_constants+0x195/0x1f0 [drm]
Dec 29 15:21:08 kernel:  amdgpu_dm_atomic_commit_tail+0x527/0x2420 [amdgpu]
Dec 29 15:21:08 kernel:  ? amdgpu_move_blit+0xbc/0x200 [amdgpu]
Dec 29 15:21:08 kernel:  ? amdgpu_bo_move+0x9f/0x290 [amdgpu]
Dec 29 15:21:08 kernel:  ? ttm_bo_handle_move_mem+0xb4/0x460 [ttm]
Dec 29 15:21:08 kernel:  ? ttm_bo_validate+0x121/0x130 [ttm]
Dec 29 15:21:08 kernel:  ? dm_plane_helper_prepare_fb+0x18b/0x220 [amdgpu]
Dec 29 15:21:08 kernel:  ? _cond_resched+0x16/0x40
Dec 29 15:21:08 kernel:  ? _cond_resched+0x16/0x40
Dec 29 15:21:08 kernel:  ? __wait_for_common+0x2b/0x130
Dec 29 15:21:08 kernel:  commit_tail+0x94/0x130 [drm_kms_helper]
Dec 29 15:21:08 kernel:  drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
Dec 29 15:21:08 kernel:  drm_atomic_helper_set_config+0x70/0xb0 [drm_kms_helper]
Dec 29 15:21:08 kernel:  drm_mode_setcrtc+0x1d3/0x6f0 [drm]
Dec 29 15:21:08 kernel:  ? avc_has_extended_perms+0x18d/0x3e0
Dec 29 15:21:08 kernel:  ? drm_mode_getcrtc+0x180/0x180 [drm]
Dec 29 15:21:08 kernel:  drm_ioctl_kernel+0x86/0xd0 [drm]
Dec 29 15:21:08 kernel:  drm_ioctl+0x20f/0x3a0 [drm]
Dec 29 15:21:08 kernel:  ? drm_mode_getcrtc+0x180/0x180 [drm]
Dec 29 15:21:08 kernel:  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
Dec 29 15:21:08 kernel:  __x64_sys_ioctl+0x83/0xb0
Dec 29 15:21:08 kernel:  do_syscall_64+0x33/0x40
Dec 29 15:21:08 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 29 15:21:08 kernel: RIP: 0033:0x7f0defb3538b
Dec 29 15:21:08 kernel: Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d bd ba 0c 00 f7 d8 64 89 01 48
Dec 29 15:21:08 kernel: RSP: 002b:00007fff1fdf1898 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Dec 29 15:21:08 kernel: RAX: ffffffffffffffda RBX: 00007fff1fdf18d0 RCX: 00007f0defb3538b
Dec 29 15:21:08 kernel: RDX: 00007fff1fdf18d0 RSI: 00000000c06864a2 RDI: 0000000000000009
Dec 29 15:21:08 kernel: RBP: 00000000c06864a2 R08: 0000000000000000 R09: 0000563833617a10
Dec 29 15:21:08 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000049
Dec 29 15:21:08 kernel: R13: 0000000000000009 R14: 0000563833617960 R15: 00005638336179a0
Dec 29 15:21:08 kernel: Modules linked in: hid_logitech_hidpp hid_logitech_dj amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel iommu_v2 ghash_clmulni_intel gpu_sched ttm i2c_algo_bit drm_kms_helper serio_raw cec drm r8169 xhci_pci xhci_pci_renesas wmi video hid_multitouch fuse
Dec 29 15:21:08 kernel: CR2: 0000000000000000
Dec 29 15:21:08 kernel: ---[ end trace 744138fdca27bd9c ]---
Dec 29 15:21:08 kernel: RIP: 0010:0x0
Dec 29 15:21:08 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Dec 29 15:21:08 kernel: RSP: 0018:ffffa0cdc05938c8 EFLAGS: 00010286
Dec 29 15:21:08 kernel: RAX: 0000000000000000 RBX: ffff8d638f2c01b8 RCX: ffff8d638a86a000
Dec 29 15:21:08 kernel: RDX: 0000000000000000 RSI: 00000000000005cf RDI: ffff8d6388d99420
Dec 29 15:21:08 kernel: RBP: ffff8d638f2c0000 R08: ffffa0cdc05938c4 R09: 0000000000000001
Dec 29 15:21:08 kernel: R10: 0000000000000004 R11: 0000000000000003 R12: 0000000000000000
Dec 29 15:21:08 kernel: R13: 0000000000000000 R14: ffff8d638b16ec00 R15: ffff8d638e870000
Dec 29 15:21:08 kernel: FS:  00007f0defc53f40(0000) GS:ffff8d6477500000(0000) knlGS:0000000000000000
Dec 29 15:21:08 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 15:21:08 kernel: CR2: ffffffffffffffd6 CR3: 00000001097ca000 CR4: 00000000001506e0

A warning involving amdgpu and systemd-backlight happened shortly after that. I've seen this warning before when booting 5.9 kernels where systemd-backlight failed to start so I'm unsure if it's related to the black screen and unresponsive system problem.

Dec 29 15:22:16 kernel: ------------[ cut here ]------------
Dec 29 15:22:16 kernel: WARNING: CPU: 2 PID: 619 at drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link.c:2548 dc_link_set_backlight_level+0x8a/0xf0 [amdgpu]
Dec 29 15:22:16 kernel: Modules linked in: soundcore fjes(-) i2c_scmi hp_wireless acpi_cpufreq zram ip_tables hid_logitech_hidpp hid_logitech_dj amdgpu crct10dif_pclmul crc32_pclmul crc32c_intel iommu_v2 ghash_clmulni_intel gpu_sched ttm i2c_algo_bit drm_kms_helper serio_raw cec drm r8169 xhci_pci xhci_pci_renesas wmi video hid_multitouch fuse
Dec 29 15:22:16 kernel: CPU: 2 PID: 619 Comm: systemd-backlig Tainted: G      D           5.10.3-200.fc33.x86_64 #1
Dec 29 15:22:16 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019
Dec 29 15:22:16 kernel: RIP: 0010:dc_link_set_backlight_level+0x8a/0xf0 [amdgpu]
Dec 29 15:22:16 kernel: Code: 70 03 00 00 31 c0 48 8d 96 c0 01 00 00 48 8b 0a 48 85 c9 74 06 48 3b 59 08 74 20 83 c0 01 48 81 c2 d8 04 00 00 83 f8 06 75 e3 <0f> 0b 45 31 e4 5b 44 89 e0 5d 41 5c 41 5d 41 5e c3 48 98 48 69 c0
Dec 29 15:22:16 kernel: RSP: 0018:ffffa0cdc0a77e08 EFLAGS: 00010246
Dec 29 15:22:16 kernel: RAX: 0000000000000006 RBX: ffff8d638b16ec00 RCX: 0000000000000000
Dec 29 15:22:16 kernel: RDX: ffff8d638e8c1ed0 RSI: ffff8d638e8c0000 RDI: 0000000000000000
Dec 29 15:22:16 kernel: RBP: ffff8d638e870000 R08: 0000000000000032 R09: 000000000000000a
Dec 29 15:22:16 kernel: R10: 000000000000000a R11: f000000000000000 R12: 0000000000003b01
Dec 29 15:22:16 kernel: R13: 0000000000000000 R14: 0000000000003be1 R15: ffff8d6380f550e0
Dec 29 15:22:16 kernel: FS:  00007f151b4bc000(0000) GS:ffff8d6477500000(0000) knlGS:0000000000000000
Dec 29 15:22:16 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 29 15:22:16 kernel: CR2: 000055dfffaf18f8 CR3: 0000000102068000 CR4: 00000000001506e0
Dec 29 15:22:16 kernel: Call Trace:
Dec 29 15:22:16 kernel:  amdgpu_dm_backlight_update_status+0xb4/0xc0 [amdgpu]
Dec 29 15:22:16 kernel:  backlight_device_set_brightness+0x6e/0x110
Dec 29 15:22:16 kernel:  brightness_store+0x3b/0x50
Dec 29 15:22:16 kernel:  kernfs_fop_write+0xce/0x1b0
Dec 29 15:22:16 kernel:  vfs_write+0xc3/0x270
Dec 29 15:22:16 kernel:  ksys_write+0x4f/0xc0
Dec 29 15:22:16 kernel:  do_syscall_64+0x33/0x40
Dec 29 15:22:16 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Dec 29 15:22:16 kernel: RIP: 0033:0x7f151b5bf297
Dec 29 15:22:16 kernel: Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
Dec 29 15:22:16 kernel: RSP: 002b:00007fff7481c928 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
Dec 29 15:22:16 kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f151b5bf297
Dec 29 15:22:16 kernel: RDX: 0000000000000003 RSI: 00007fff7481ca10 RDI: 0000000000000004
Dec 29 15:22:16 kernel: RBP: 00007fff7481ca10 R08: 0000000000000000 R09: 0000000000000000
Dec 29 15:22:16 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003
Dec 29 15:22:16 kernel: R13: 000055dfffadb650 R14: 0000000000000003 R15: 00007f151b692720
Dec 29 15:22:16 kernel: ---[ end trace 744138fdca27bd9d ]---

Comment 3 Matt Fagnani 2020-12-30 03:53:36 UTC

Created attachment 1743062 [details]
journalctl output for boot with default kernel parameters ending with a black screen and unresponsive system

The traces with the null pointer dereference and warning in amdgpu in my previous comment were for the first boot of 5.10.3 with the default kernel command line parameters which ended with the black screen and unresponsive system. I just didn't see them until the next successful boot with 5.10.3 and amdgpu.dc=0. I'm attaching the journal for the first boot with the null pointer dereference and black screen problem.

Comment 4 Matt Fagnani 2021-01-10 01:34:35 UTC

This problem appears to have been fixed in 5.10.5. 5.10.5 has booted normally each time with the default kernel parameters. 5.10.4 was affected by this problem. I reported this problem on 12/30 at https://gitlab.freedesktop.org/drm/amd/-/issues/1421 Thanks.

Note You need to log in before you can comment on or make changes to this bug.