1. Please describe the problem: I booted the Fedora Rawhide live image Fedora-KDE-Live-x86_64-Rawhide-20240602.n.0.iso which has kernel-6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41 on an hp laptop with an AMD A10-9620P CPU and integrated Radeon R5 GPU. The screen froze when amdgpu started during boot. I could see an unusual dotted white line near the bottom of the screen when the screen froze. I booted with quiet removed from the kernel command line. The screen froze with the last line shown as kernel: amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT The Plasma startup sound played about a minute after the screen froze, so I think the boot continued and Plasma started. This problem happened 3/3 boots with this image. This problem didn't happen with 6.9.3 and earlier. I'll see if I can bisect this problem, though it might take me a while. This problem didn't happen when I booted the same image in Basic graphics mode which put nomodeset on the kernel command line and used the simpledrm kernel driver and llvmpipe mesa driver. 2. What is the Version-Release number of the kernel: kernel-6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : Yes, 6.9.3 and earlier weren't affected by this problem. I first saw the problem with kernel-6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Download Fedora Rawhide live image Fedora-KDE-Live-x86_64-Rawhide-20240602.n.0.iso from https://koji.fedoraproject.org/koji/buildinfo?buildID=2459739 install Fedora Media writer in Fedora with sudo dnf install mediawriter Start Fedora Media Writer write Fedora-KDE-Live-x86_64-Rawhide-20240602.n.0.iso with Fedora Media Writer to a USB flash drive Reboot into Fedora-KDE-Live-x86_64-Rawhide-20240602.n.0.iso on a system with an AMD GPU affected by this problem 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. The logs weren't saved on the following boots because it was a live image. I'll try to get logs by installing kernel-6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41 in my Fedora 40 installation and reproducing the problem. I reported this problem at https://gitlab.freedesktop.org/drm/amd/-/issues/3417 Reproducible: Always
Created attachment 2036127 [details] The kernel log for a boot of kernel-6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41 where the screen froze I installed kernel-6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41 in my Fedora 40 installation and reproduced the problem. I used sysrq+alt+s,u,b after a minute. The journal showed many AMD IOMMU errors like Jun 03 00:21:49 kernel: iommu ivhd0: AMD-Vi: Event logged [ILLEGAL_DEV_TABLE_ENTRY device=0000:00:01.0 pasid=0x00000 address=0x10b680000 flags=0x0080] Jun 03 00:21:49 kernel: AMD-Vi: DTE[0]: 6d90000000000003 Jun 03 00:21:49 kernel: AMD-Vi: DTE[1]: 0000100101a10002 Jun 03 00:21:49 kernel: AMD-Vi: DTE[2]: 200000010462c013 Jun 03 00:21:49 kernel: AMD-Vi: DTE[3]: 0000000000000000 After that there were more amdgpu errors leading to amdgpu: Fatal error during GPU init and repeated warnings in amdgpu_irq_put. Jun 03 00:21:49 kernel: amdgpu 0000:00:01.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring gfx test failed (-110) Jun 03 00:21:49 kernel: [drm:amdgpu_device_init.cold [amdgpu]] *ERROR* hw_init of IP block <gfx_v8_0> failed -110 Jun 03 00:21:49 kernel: amdgpu 0000:00:01.0: amdgpu: amdgpu_device_ip_init failed Jun 03 00:21:49 kernel: amdgpu 0000:00:01.0: amdgpu: Fatal error during GPU init Jun 03 00:21:49 kernel: amdgpu 0000:00:01.0: amdgpu: amdgpu: finishing device. Jun 03 00:21:49 kernel: ------------[ cut here ]------------ Jun 03 00:21:49 kernel: WARNING: CPU: 3 PID: 403 at drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c:630 amdgpu_irq_put+0x46/0x70 [amdgpu] Jun 03 00:21:49 kernel: Modules linked in: amdgpu(+) hid_logitech_hidpp crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni amdxcp polyval_generic i2c_algo_bit drm_ttm_helper ttm drm_exec ghash_clmulni_intel gpu_sched sha512_ssse3 drm_suballoc_helper sha256_ssse3 sp5100_tco drm_buddy sha1_ssse3 drm_display_helper wdat_wdt cec video wmi hid_logitech_dj serio_raw hid_multitouch scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse i2c_dev Jun 03 00:21:49 kernel: CPU: 3 PID: 403 Comm: (udev-worker) Not tainted 6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41.x86_64 #1 Jun 03 00:21:49 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 Jun 03 00:21:49 kernel: RIP: 0010:amdgpu_irq_put+0x46/0x70 [amdgpu] Jun 03 00:21:49 kernel: Code: c0 74 33 48 8b 4e 10 48 83 39 00 74 29 89 d1 48 8d 04 88 8b 08 85 c9 74 11 f0 ff 08 74 07 31 c0 e9 2f 46 c6 d3 e9 1a fd ff ff <0f> 0b b8 ea ff ff ff e9 1e 46 c6 d3 b8 ea ff ff ff e9 14 46 c6 d3 Jun 03 00:21:49 kernel: RSP: 0018:ffffb241006d3778 EFLAGS: 00010246 Jun 03 00:21:49 kernel: RAX: ffffa07ac0d67d40 RBX: ffffa07ad3598888 RCX: 0000000000000000 Jun 03 00:21:49 kernel: RDX: 0000000000000000 RSI: ffffa07ad35a54c8 RDI: ffffa07ad3580000 Jun 03 00:21:49 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0720072007200720 Jun 03 00:21:49 kernel: R10: 072007200720072e R11: 0765076307690776 R12: ffffa07ad3580000 Jun 03 00:21:49 kernel: R13: ffffa07ad3580010 R14: ffffa07ad35a54c8 R15: ffffa07ad3580010 Jun 03 00:21:49 kernel: FS: 00007f966d292980(0000) GS:ffffa07bb7580000(0000) knlGS:0000000000000000 Jun 03 00:21:49 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 03 00:21:49 kernel: CR2: 00007f49afcba3e8 CR3: 0000000109c94000 CR4: 00000000001506f0 Jun 03 00:21:49 kernel: Call Trace: Jun 03 00:21:49 kernel: <TASK> Jun 03 00:21:49 kernel: ? amdgpu_irq_put+0x46/0x70 [amdgpu] Jun 03 00:21:49 kernel: ? __warn.cold+0x8e/0xe8 Jun 03 00:21:49 kernel: ? amdgpu_irq_put+0x46/0x70 [amdgpu] Jun 03 00:21:49 kernel: ? report_bug+0xff/0x140 Jun 03 00:21:49 kernel: ? handle_bug+0x3c/0x80 Jun 03 00:21:49 kernel: ? exc_invalid_op+0x17/0x70 Jun 03 00:21:49 kernel: ? asm_exc_invalid_op+0x1a/0x20 Jun 03 00:21:49 kernel: ? amdgpu_irq_put+0x46/0x70 [amdgpu] Jun 03 00:21:49 kernel: amdgpu_fence_driver_hw_fini+0x116/0x160 [amdgpu] Jun 03 00:21:49 kernel: amdgpu_device_fini_hw+0x9b/0x45a [amdgpu] Jun 03 00:21:49 kernel: amdgpu_driver_load_kms.cold+0x18/0x2e [amdgpu] Jun 03 00:21:49 kernel: amdgpu_pci_probe+0x1a7/0x4b0 [amdgpu] Jun 03 00:21:49 kernel: local_pci_probe+0x45/0x90 Jun 03 00:21:49 kernel: pci_device_probe+0xc1/0x2a0 Jun 03 00:21:49 kernel: really_probe+0xde/0x340 Jun 03 00:21:49 kernel: ? pm_runtime_barrier+0x54/0x90 Jun 03 00:21:49 kernel: ? __pfx___driver_attach+0x10/0x10 Jun 03 00:21:49 kernel: __driver_probe_device+0x78/0x110 Jun 03 00:21:49 kernel: driver_probe_device+0x1f/0xa0 Jun 03 00:21:49 kernel: __driver_attach+0xba/0x1c0 Jun 03 00:21:49 kernel: bus_for_each_dev+0x8f/0xe0 Jun 03 00:21:49 kernel: bus_add_driver+0x142/0x220 Jun 03 00:21:49 kernel: driver_register+0x72/0xd0 Jun 03 00:21:49 kernel: ? __pfx_amdgpu_init+0x10/0x10 [amdgpu] Jun 03 00:21:49 kernel: do_one_initcall+0x5b/0x310 Jun 03 00:21:49 kernel: do_init_module+0x90/0x250 Jun 03 00:21:49 kernel: __do_sys_init_module+0x17a/0x1b0 Jun 03 00:21:49 kernel: do_syscall_64+0x82/0x160 Jun 03 00:21:49 kernel: ? get_page_from_freelist+0x5d1/0x1c80 Jun 03 00:21:49 kernel: ? mas_update_gap.part.0+0xa3/0x1d0 Jun 03 00:21:49 kernel: ? mas_wr_slot_store+0xd1/0x170 Jun 03 00:21:49 kernel: ? __alloc_pages_noprof+0x182/0x350 Jun 03 00:21:49 kernel: ? __mod_memcg_lruvec_state+0xe5/0x1e0 Jun 03 00:21:49 kernel: ? __lruvec_stat_mod_folio+0x68/0xa0 Jun 03 00:21:49 kernel: ? set_ptes.isra.0+0x28/0x90 Jun 03 00:21:49 kernel: ? do_anonymous_page+0xf8/0x8a0 Jun 03 00:21:49 kernel: ? __pte_offset_map+0x1b/0x180 Jun 03 00:21:49 kernel: ? __handle_mm_fault+0xc06/0x1040 Jun 03 00:21:49 kernel: ? __count_memcg_events+0x75/0x130 Jun 03 00:21:49 kernel: ? count_memcg_events.constprop.0+0x1a/0x30 Jun 03 00:21:49 kernel: ? handle_mm_fault+0x1f0/0x300 Jun 03 00:21:49 kernel: ? do_user_addr_fault+0x36c/0x620 Jun 03 00:21:49 kernel: ? exc_page_fault+0x7e/0x180 Jun 03 00:21:49 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Jun 03 00:21:49 kernel: RIP: 0033:0x7f966d16e57e Jun 03 00:21:49 kernel: Code: 48 8b 0d 9d 98 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6a 98 0c 00 f7 d8 64 89 01 48 Jun 03 00:21:49 kernel: RSP: 002b:00007ffdfa3a64d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af Jun 03 00:21:49 kernel: RAX: ffffffffffffffda RBX: 000055ddca4acd70 RCX: 00007f966d16e57e Jun 03 00:21:49 kernel: RDX: 00007f966d28c07d RSI: 00000000024df4f6 RDI: 00007f9669800010 Jun 03 00:21:49 kernel: RBP: 00007ffdfa3a6590 R08: 000055ddca46b010 R09: 0000000000000007 Jun 03 00:21:49 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 00007f966d28c07d Jun 03 00:21:49 kernel: R13: 0000000000020000 R14: 000055ddca49f6f0 R15: 000055ddca4ae8d0 Jun 03 00:21:49 kernel: </TASK> Jun 03 00:21:49 kernel: ---[ end trace 0000000000000000 ]--- There were further amdgpu errors and warnings in amdgpu_bo_release_notify. ``` Jun 03 00:21:49 kernel: amdgpu 0000:00:01.0: probe with driver amdgpu failed with error -110 Jun 03 00:21:49 kernel: [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off. Jun 03 00:21:49 kernel: ------------[ cut here ]------------ Jun 03 00:21:49 kernel: WARNING: CPU: 3 PID: 403 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:1382 amdgpu_bo_release_notify+0x1ff/0x220 [amdgpu] Jun 03 00:21:49 kernel: Modules linked in: amdgpu(+) hid_logitech_hidpp crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni amdxcp polyval_generic i2c_algo_bit drm_ttm_helper ttm drm_exec ghash_clmulni_intel gpu_sched sha512_ssse3 drm_suballoc_helper sha256_ssse3 sp5100_tco drm_buddy sha1_ssse3 drm_display_helper wdat_wdt cec video wmi hid_logitech_dj serio_raw hid_multitouch scsi_dh_rdac scsi_dh_emc scsi_dh_alua ip6_tables ip_tables fuse i2c_dev Jun 03 00:21:49 kernel: CPU: 3 PID: 403 Comm: (udev-worker) Tainted: G W ------- --- 6.10.0-0.rc1.20240531git4a4be1ad3a6e.21.fc41.x86_64 #1 Jun 03 00:21:49 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 Jun 03 00:21:49 kernel: RIP: 0010:amdgpu_bo_release_notify+0x1ff/0x220 [amdgpu] Jun 03 00:21:49 kernel: Code: 0b e9 af fe ff ff 48 ba ff ff ff ff ff ff ff 7f 31 f6 4c 89 e7 e8 b1 44 76 d3 eb 98 e8 6a 3b 76 d3 eb b2 0f 0b e9 58 fe ff ff <0f> 0b eb a7 be 03 00 00 00 e8 c3 a7 41 d3 eb 9b e8 6c 4a d1 d3 66 Jun 03 00:21:49 kernel: RSP: 0018:ffffb241006d3700 EFLAGS: 00010282 Jun 03 00:21:49 kernel: RAX: 00000000ffffffea RBX: ffffa07ac47be448 RCX: 0000000000000000 Jun 03 00:21:49 kernel: RDX: 0000000000000000 RSI: ffffa07bb75a18c0 RDI: ffffa07bb75a18c0 Jun 03 00:21:49 kernel: RBP: ffffa07ad358ef58 R08: 0000000000000000 R09: 0720072007200720 Jun 03 00:21:49 kernel: R10: 0720072007200720 R11: 0720072007200720 R12: ffffa07ac47be400 Jun 03 00:21:49 kernel: R13: ffffa07ac47be548 R14: ffffa07ad358ef58 R15: ffffa07ac146b36c Jun 03 00:21:49 kernel: FS: 00007f966d292980(0000) GS:ffffa07bb7580000(0000) knlGS:0000000000000000 Jun 03 00:21:49 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 03 00:21:49 kernel: CR2: 00007f49afcba3e8 CR3: 0000000109c94000 CR4: 00000000001506f0 Jun 03 00:21:49 kernel: Call Trace: Jun 03 00:21:49 kernel: <TASK> Jun 03 00:21:49 kernel: ? amdgpu_bo_release_notify+0x1ff/0x220 [amdgpu] Jun 03 00:21:49 kernel: ? __warn.cold+0x8e/0xe8 Jun 03 00:21:49 kernel: ? amdgpu_bo_release_notify+0x1ff/0x220 [amdgpu] Jun 03 00:21:49 kernel: ? report_bug+0xff/0x140 Jun 03 00:21:49 kernel: ? handle_bug+0x3c/0x80 Jun 03 00:21:49 kernel: ? exc_invalid_op+0x17/0x70 Jun 03 00:21:49 kernel: ? asm_exc_invalid_op+0x1a/0x20 Jun 03 00:21:49 kernel: ? amdgpu_bo_release_notify+0x1ff/0x220 [amdgpu] Jun 03 00:21:49 kernel: ttm_bo_release+0x100/0x2e0 [ttm] Jun 03 00:21:49 kernel: ? ttm_resource_move_to_lru_tail+0x166/0x260 [ttm] Jun 03 00:21:49 kernel: amdgpu_bo_free_kernel+0xcb/0x110 [amdgpu] Jun 03 00:21:49 kernel: amdgpu_vce_sw_fini+0x47/0xb0 [amdgpu] Jun 03 00:21:49 kernel: amdgpu_device_fini_sw+0x100/0x540 [amdgpu] Jun 03 00:21:49 kernel: amdgpu_driver_release_kms+0x16/0x30 [amdgpu] Jun 03 00:21:49 kernel: devm_drm_dev_init_release+0x51/0x70 Jun 03 00:21:49 kernel: release_nodes+0x38/0xb0 Jun 03 00:21:49 kernel: devres_release_all+0x90/0xd0 Jun 03 00:21:49 kernel: device_unbind_cleanup+0xe/0x70 Jun 03 00:21:49 kernel: really_probe+0x221/0x340 Jun 03 00:21:49 kernel: ? pm_runtime_barrier+0x54/0x90 Jun 03 00:21:49 kernel: ? __pfx___driver_attach+0x10/0x10 Jun 03 00:21:49 kernel: __driver_probe_device+0x78/0x110 Jun 03 00:21:49 kernel: driver_probe_device+0x1f/0xa0 Jun 03 00:21:49 kernel: __driver_attach+0xba/0x1c0 Jun 03 00:21:49 kernel: bus_for_each_dev+0x8f/0xe0 Jun 03 00:21:49 kernel: bus_add_driver+0x142/0x220 Jun 03 00:21:49 kernel: driver_register+0x72/0xd0 Jun 03 00:21:49 kernel: ? __pfx_amdgpu_init+0x10/0x10 [amdgpu] Jun 03 00:21:49 kernel: do_one_initcall+0x5b/0x310 Jun 03 00:21:49 kernel: do_init_module+0x90/0x250 Jun 03 00:21:49 kernel: __do_sys_init_module+0x17a/0x1b0 Jun 03 00:21:49 kernel: do_syscall_64+0x82/0x160 Jun 03 00:21:49 kernel: ? get_page_from_freelist+0x5d1/0x1c80 Jun 03 00:21:49 kernel: ? mas_update_gap.part.0+0xa3/0x1d0 Jun 03 00:21:49 kernel: ? mas_wr_slot_store+0xd1/0x170 Jun 03 00:21:49 kernel: ? __alloc_pages_noprof+0x182/0x350 Jun 03 00:21:49 kernel: ? __mod_memcg_lruvec_state+0xe5/0x1e0 Jun 03 00:21:49 kernel: ? __lruvec_stat_mod_folio+0x68/0xa0 Jun 03 00:21:49 kernel: ? set_ptes.isra.0+0x28/0x90 Jun 03 00:21:49 kernel: ? do_anonymous_page+0xf8/0x8a0 Jun 03 00:21:49 kernel: ? __pte_offset_map+0x1b/0x180 Jun 03 00:21:49 kernel: ? __handle_mm_fault+0xc06/0x1040 Jun 03 00:21:49 kernel: ? __count_memcg_events+0x75/0x130 Jun 03 00:21:49 kernel: ? count_memcg_events.constprop.0+0x1a/0x30 Jun 03 00:21:49 kernel: ? handle_mm_fault+0x1f0/0x300 Jun 03 00:21:49 kernel: ? do_user_addr_fault+0x36c/0x620 Jun 03 00:21:49 kernel: ? exc_page_fault+0x7e/0x180 Jun 03 00:21:49 kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Jun 03 00:21:49 kernel: RIP: 0033:0x7f966d16e57e Jun 03 00:21:49 kernel: Code: 48 8b 0d 9d 98 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6a 98 0c 00 f7 d8 64 89 01 48 Jun 03 00:21:49 kernel: RSP: 002b:00007ffdfa3a64d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af Jun 03 00:21:49 kernel: RAX: ffffffffffffffda RBX: 000055ddca4acd70 RCX: 00007f966d16e57e Jun 03 00:21:49 kernel: RDX: 00007f966d28c07d RSI: 00000000024df4f6 RDI: 00007f9669800010 Jun 03 00:21:49 kernel: RBP: 00007ffdfa3a6590 R08: 000055ddca46b010 R09: 0000000000000007 Jun 03 00:21:49 kernel: R10: 0000000000000002 R11: 0000000000000246 R12: 00007f966d28c07d Jun 03 00:21:49 kernel: R13: 0000000000020000 R14: 000055ddca49f6f0 R15: 000055ddca4ae8d0 Jun 03 00:21:49 kernel: </TASK> Jun 03 00:21:49 kernel: ---[ end trace 0000000000000000 ]--- Jun 03 00:21:49 kernel: [drm:amdgpu_fill_buffer [amdgpu]] *ERROR* Trying to clear memory with ring turned off. I'm attaching the kernel log. When I rebooted into 6.10-rc1 with amd_iommu=off added to the kernel command line, the problem didn't happen. The AMD IOMMU enabled in my system looks like it's involved.
Upstream reports of this problem are at https://bugzilla.kernel.org/show_bug.cgi?id=218900 which has a patch and https://bugzilla.kernel.org/show_bug.cgi?id=218921 and https://lore.kernel.org/all/20240527192159.GEZlTdV7OoOuJrHmI0@fat_crate.local/
This problem was fixed in 6.10-rc3 by the patch https://bugzilla.kernel.org/show_bug.cgi?id=218900#c5 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.10-rc3&id=48dc345a23b984c457d1c5878168d026c500618f