1. Please describe the problem: I booted the Fedora Rawhide KDE Plasma live image Fedora-KDE-Live-x86_64-Rawhide-20221227.n.0.iso https://koji.fedoraproject.org/koji/buildinfo?buildID=2104562 from a USB flash drive written with Fedora Media Writer on an hp laptop with an integrated Radeon R5 GPU. The system froze with a black screen when amdgpu started during 6.2-rc1 kernel boot. When I booted with quiet rhgb removed from the kernel command line the last line shown before the black screen was kernel: [drm] amdgpu kernel modesetting enabled. This problem happened each of several boots when using the amdgpu driver (the default). This problem didn't happen when I booted the same image using Troubleshooting > Boot Fedora-KDE-Plasma-live in basic graphics mode which used the simpledrm driver and started Plasma on X normally. This problem also didn't happen when I booted the image in a QEMU/KVM VM in GNOME Boxes with 3 GB RAM using the virtio-gpu driver. The data from the previous boots using live images aren't saved by default so I couldn't get the journal that way as far as I knew. I installed kernel-6.2.0-0.rc1.14.fc38 in my Fedora 37 KDE Plasma installation and reproduced the problem 3 times with quiet rhgb removed from the kernel command line and sysrq_always_enabled drm.debug=14 added to it. I used sysrq+alt+r,s,u,b which rebooted the system so the kernel wasn't completely frozen. The journals from the boots with the problem weren't shown in journalctl. I booted with amdgpu.dc=0 on the kernel command line and the screen froze with the last line kernel: [drm] amdgpu kernel modesetting enabled. and the black screen didn't happen. I booted with drm.debug=94 on the kernel command line and the screen's drm settings were shown repeatedly until I rebooted after 2-3 minutes. I reported this problem at https://gitlab.freedesktop.org/drm/amd/-/issues/2319 2. What is the Version-Release number of the kernel: 6.2.0-0.rc1.14.fc38 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : Yes. This problem didn't happen with kernel-6.1.0-65.fc38 or earlier in the Fedora Rawhide live image Fedora-KDE-Live-x86_64-Rawhide-20221217.n.0.iso. I first saw the problem with 6.2.0-0.rc1.14.fc38. The problem was likely introduced in the 6.2 merge window. I haven't tried any of the 6.2 merge window kernels, but I might try to narrow down the problem using the Fedora Rawhide 6.2 merge window builds and then bisect using the narrowed range. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: 1. Download Fedora Rawhide KDE Plasma live image Fedora-KDE-Live-x86_64-Rawhide-20221227.n.0.iso from https://koji.fedoraproject.org/koji/buildinfo?buildID=2104562 2. Install Fedora Media Writer if isn't already with sudo dnf install mediawriter in Fedora 3. Start Fedora Media Writer 4. Write Fedora-KDE-Live-x86_64-Rawhide-20221227.n.0.iso to a USB flash drive in Fedora Media Writer 5. Boot Fedora-KDE-Live-x86_64-Rawhide-20221227.n.0.iso from the USB flash drive with the default boot option using the amdgpu driver on a laptop with an integrated AMD Radeon R5 GPU 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: Yes. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag. I haven't got a full kernel log from when the problem happened due to the nature of the problem as I described above.
The first Fedora Rawhide kernel with this problem was 6.2.0-0.rc0.20221215git041fae9c105a.5.fc38, while 6.2.0-0.rc0.20221214gite2ca6ba6ba01.3.fc38 was the last one without the problem. I bisected the mainline kernel between e2ca6ba6ba01 and 041fae9c105a. The first bad commit was the following involving PCI and IOMMUs. 201007ef707a8bb5592cd07dd46fc9222c48e0b9 is the first bad commit commit 201007ef707a8bb5592cd07dd46fc9222c48e0b9 Author: Lu Baolu <baolu.lu.com> Date: Mon Oct 31 08:59:08 2022 +0800 PCI: Enable PASID only when ACS RR & UF enabled on upstream path The Requester ID/Process Address Space ID (PASID) combination identifies an address space distinct from the PCI bus address space, e.g., an address space defined by an IOMMU. But the PCIe fabric routes Memory Requests based on the TLP address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with PASID that SHOULD go upstream to the IOMMU may instead be routed as a P2P Request if its address falls in a bridge window. To ensure that all Memory Requests with PASID are routed upstream, only enable PASID if ACS P2P Request Redirect and Upstream Forwarding are enabled for the path leading to the device. Suggested-by: Jason Gunthorpe <jgg> Suggested-by: Kevin Tian <kevin.tian> Signed-off-by: Lu Baolu <baolu.lu.com> Acked-by: Bjorn Helgaas <bhelgaas> Reviewed-by: Jason Gunthorpe <jgg> Tested-by: Tony Zhu <tony.zhu> Link: https://lore.kernel.org/r/20221031005917.45690-5-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel> drivers/pci/ats.c | 3 +++ 1 file changed, 3 insertions(+) My system has an AMD IOMMU enabled. When I booted 6.2-rc1 with amd_iommu=off on the kernel command line, the problem didn't happen and the boot completed. There were IOMMU-related errors when amdgpu started with amd_iommu=off. So the problem appears to involve amdgpu's usage of the IOMMU. When I booted with quiet rhgb removed from the kernel command line, I noted that the AMD IOMMU started about 3 seconds before the problem happened when amdgpu started with a line like kernel: AMD-Vi: AMD IOMMUv2 loaded and initialized
I reported this problem upstream to the Drivers > IOMMU component at https://bugzilla.kernel.org/show_bug.cgi?id=216865 since Alex Deucher wrote "Please report this upstream to the IOMMU subsystem: https://bugzilla.kernel.org/" at https://gitlab.freedesktop.org/drm/amd/-/issues/2319#note_1699814
Created attachment 1935076 [details] dmesg from 6.2-rc1 boot with early kdump enabled I reproduced the problem with 6.2-rc1 in a Fedora 37 installation with early kdump enabled as described at https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes https://github.com/k-hagio/fedora-kexec-tools/blob/master/early-kdump-howto.txt I panicked the kernel with sysrq+alt+c. The dmesg saved with kdump showed warnings at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 and at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 involving AMD IOMMU and amdgpu functions in the trace. A null pointer dereference occurred in amd_iommu_int_thread afterwards. [ 13.132368] [drm] amdgpu kernel modesetting enabled. [ 13.133766] amdgpu: Topology: Add APU node [0x0:0x0] [ 13.137596] Console: switching to colour dummy device 80x25 [ 13.143717] amdgpu 0000:00:01.0: vgaarb: deactivate vga console [ 13.143970] [drm] initializing kernel modesetting (CARRIZO 0x1002:0x9874 0x103C:0x8332 0xCA). [ 13.144205] [drm] register mmio base: 0xF0400000 [ 13.144209] [drm] register mmio size: 262144 [ 13.144310] [drm] add ip block number 0 <vi_common> [ 13.144316] [drm] add ip block number 1 <gmc_v8_0> [ 13.144320] [drm] add ip block number 2 <cz_ih> [ 13.144324] [drm] add ip block number 3 <gfx_v8_0> [ 13.144328] [drm] add ip block number 4 <sdma_v3_0> [ 13.144332] [drm] add ip block number 5 <powerplay> [ 13.144336] [drm] add ip block number 6 <dm> [ 13.144340] [drm] add ip block number 7 <uvd_v6_0> [ 13.144343] [drm] add ip block number 8 <vce_v3_0> [ 13.144347] [drm] add ip block number 9 <acp_ip> [ 13.144388] amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT [ 13.144397] amdgpu: ATOM BIOS: 113-C75100-031 [ 13.144425] [drm] UVD is enabled in physical mode [ 13.144431] [drm] VCE enabled in physical mode [ 13.144435] amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 13.144491] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit [ 13.144503] amdgpu 0000:00:01.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used) [ 13.144511] amdgpu 0000:00:01.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF [ 13.144524] [drm] Detected VRAM RAM=512M, BAR=512M [ 13.144529] [drm] RAM width 64bits UNKNOWN [ 13.144623] [drm] amdgpu: 512M of VRAM memory ready [ 13.144630] [drm] amdgpu: 3572M of GTT memory ready. [ 13.144653] [drm] GART: num cpu pages 262144, num gpu pages 262144 [ 13.144705] [drm] PCIE GART of 1024M enabled (table at 0x000000F400600000). [ 13.158820] amdgpu: hwmgr_sw_init smu backed is smu8_smu [ 13.175036] [drm] Found UVD firmware Version: 1.91 Family ID: 11 [ 13.175097] [drm] UVD ENC is disabled [ 13.186675] [drm] Found VCE firmware Version: 52.4 Binary ID: 3 [ 13.187879] amdgpu: smu version 27.18.00 [ 13.193760] [drm] DM_PPLIB: values for Engine clock [ 13.193773] [drm] DM_PPLIB: 300000 [ 13.193776] [drm] DM_PPLIB: 480000 [ 13.193779] [drm] DM_PPLIB: 533340 [ 13.193781] [drm] DM_PPLIB: 576000 [ 13.193784] [drm] DM_PPLIB: 626090 [ 13.193786] [drm] DM_PPLIB: 685720 [ 13.193788] [drm] DM_PPLIB: 720000 [ 13.193791] [drm] DM_PPLIB: 757900 [ 13.193793] [drm] DM_PPLIB: Validation clocks: [ 13.193796] [drm] DM_PPLIB: engine_max_clock: 75790 [ 13.193799] [drm] DM_PPLIB: memory_max_clock: 93300 [ 13.193802] [drm] DM_PPLIB: level : 8 [ 13.193806] [drm] DM_PPLIB: values for Display clock [ 13.193809] [drm] DM_PPLIB: 300000 [ 13.193811] [drm] DM_PPLIB: 400000 [ 13.193814] [drm] DM_PPLIB: 496560 [ 13.193816] [drm] DM_PPLIB: 626090 [ 13.193819] [drm] DM_PPLIB: 685720 [ 13.193821] [drm] DM_PPLIB: 757900 [ 13.193823] [drm] DM_PPLIB: 800000 [ 13.193825] [drm] DM_PPLIB: 847060 [ 13.193828] [drm] DM_PPLIB: Validation clocks: [ 13.193830] [drm] DM_PPLIB: engine_max_clock: 75790 [ 13.193833] [drm] DM_PPLIB: memory_max_clock: 93300 [ 13.193836] [drm] DM_PPLIB: level : 8 [ 13.193839] [drm] DM_PPLIB: values for Memory clock [ 13.193842] [drm] DM_PPLIB: 667000 [ 13.193844] [drm] DM_PPLIB: 933000 [ 13.193847] [drm] DM_PPLIB: Validation clocks: [ 13.193849] [drm] DM_PPLIB: engine_max_clock: 75790 [ 13.193852] [drm] DM_PPLIB: memory_max_clock: 93300 [ 13.193854] [drm] DM_PPLIB: level : 8 [ 13.193973] [drm] Display Core initialized with v3.2.215! [ 13.309967] [drm] UVD initialized successfully. [ 13.511031] [drm] VCE initialized successfully. [ 13.515217] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 13.515442] amdgpu: sdma_bitmap: f [ 13.515549] ------------[ cut here ]------------ [ 13.515555] WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:251 pci_disable_pri+0x75/0x80 [ 13.515571] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath [ 13.515620] CPU: 0 PID: 477 Comm: systemd-udevd Kdump: loaded Not tainted 6.2.0-0.rc1.14.fc38.x86_64 #1 [ 13.515628] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 [ 13.515634] RIP: 0010:pci_disable_pri+0x75/0x80 [ 13.515642] Code: 54 24 06 89 ee 48 89 df 83 e2 fe 66 89 54 24 06 0f b7 d2 e8 1d e1 fc ff 80 a3 4b 08 00 00 fd 48 83 c4 08 5b 5d e9 2b 8b 69 00 <0f> 0b eb b6 0f 1f 80 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 [ 13.515651] RSP: 0018:ffffbaf4407ab8e8 EFLAGS: 00010046 [ 13.515658] RAX: 0000000000000000 RBX: ffff90aa00ac4000 RCX: 0000000000000009 [ 13.515663] RDX: 0000000000000000 RSI: 0000000000000014 RDI: ffff90aa00ac4000 [ 13.515668] RBP: ffff90aa0e0c3810 R08: 0000000000000002 R09: 0000000000000000 [ 13.515673] R10: 0000000000000000 R11: ffffffffade4e430 R12: ffff90aa011a8800 [ 13.515678] R13: ffff90aa0e0c3800 R14: ffff90aa011a8800 R15: ffff90aa0e0c3960 [ 13.515683] FS: 00007fabd67feb40(0000) GS:ffff90aaf7400000(0000) knlGS:0000000000000000 [ 13.515689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.515695] CR2: 00007f5689ff54c0 CR3: 0000000100f16000 CR4: 00000000001506f0 [ 13.515700] Call Trace: [ 13.515704] <TASK> [ 13.515710] amd_iommu_attach_device+0x2e0/0x300 [ 13.515719] __iommu_attach_device+0x1b/0x90 [ 13.515727] iommu_attach_group+0x65/0xa0 [ 13.515735] amd_iommu_init_device+0x16b/0x250 [iommu_v2] [ 13.515747] kfd_iommu_resume+0x4c/0x1a0 [amdgpu] [ 13.517094] kgd2kfd_resume_iommu+0x12/0x30 [amdgpu] [ 13.518419] kgd2kfd_device_init.cold+0x346/0x49a [amdgpu] [ 13.519699] amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu] [ 13.520877] amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu] [ 13.522118] ? _raw_spin_lock_irqsave+0x23/0x50 [ 13.522126] amdgpu_driver_load_kms+0x15/0x110 [amdgpu] [ 13.523386] amdgpu_pci_probe+0x161/0x370 [amdgpu] [ 13.524516] local_pci_probe+0x41/0x80 [ 13.524525] pci_device_probe+0xb3/0x220 [ 13.524533] really_probe+0xde/0x380 [ 13.524540] ? pm_runtime_barrier+0x50/0x90 [ 13.524546] __driver_probe_device+0x78/0x170 [ 13.524555] driver_probe_device+0x1f/0x90 [ 13.524560] __driver_attach+0xce/0x1c0 [ 13.524565] ? __pfx___driver_attach+0x10/0x10 [ 13.524570] bus_for_each_dev+0x73/0xa0 [ 13.524575] bus_add_driver+0x1ae/0x200 [ 13.524580] driver_register+0x89/0xe0 [ 13.524586] ? __pfx_init_module+0x10/0x10 [amdgpu] [ 13.525819] do_one_initcall+0x59/0x230 [ 13.525828] do_init_module+0x4a/0x200 [ 13.525834] __do_sys_init_module+0x157/0x180 [ 13.525839] do_syscall_64+0x5b/0x80 [ 13.525845] ? handle_mm_fault+0xff/0x2f0 [ 13.525850] ? do_user_addr_fault+0x1ef/0x690 [ 13.525856] ? exc_page_fault+0x70/0x170 [ 13.525860] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 13.525867] RIP: 0033:0x7fabd66cde4e [ 13.525872] Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 89 01 48 [ 13.525878] RSP: 002b:00007ffdd89bc6a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af [ 13.525884] RAX: ffffffffffffffda RBX: 0000563e4d23f0a0 RCX: 00007fabd66cde4e [ 13.525887] RDX: 00007fabd6817453 RSI: 000000000174fb66 RDI: 00007fabd3bd4010 [ 13.525890] RBP: 00007fabd6817453 R08: 0000563e4d237c70 R09: 00007fabd672f900 [ 13.525893] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000 [ 13.525896] R13: 0000563e4d239060 R14: 0000000000000000 R15: 0000563e4d23e450 [ 13.525900] </TASK> [ 13.525902] ---[ end trace 0000000000000000 ]--- [ 13.525964] ------------[ cut here ]------------ [ 13.525966] WARNING: CPU: 0 PID: 477 at drivers/pci/ats.c:419 pci_disable_pasid+0x45/0x50 [ 13.525974] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath [ 13.526006] CPU: 0 PID: 477 Comm: systemd-udevd Kdump: loaded Tainted: G W ------- --- 6.2.0-0.rc1.14.fc38.x86_64 #1 [ 13.526012] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 [ 13.526015] RIP: 0010:pci_disable_pasid+0x45/0x50 [ 13.526020] Code: 53 48 89 fb 85 f6 75 06 5b e9 67 8c 69 00 83 c6 06 31 d2 e8 3d e2 fc ff 80 a3 4b 08 00 00 fe 5b e9 50 8c 69 00 e9 4b 8c 69 00 <0f> 0b e9 44 8c 69 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 [ 13.526025] RSP: 0018:ffffbaf4407ab900 EFLAGS: 00010046 [ 13.526028] RAX: 0000000000000000 RBX: ffff90aa00ac4000 RCX: 0000000000000009 [ 13.526031] RDX: 0000000000000000 RSI: 0000000000000014 RDI: ffff90aa00ac4000 [ 13.526034] RBP: ffff90aa0e0c3810 R08: 0000000000000002 R09: 0000000000000000 [ 13.526037] R10: 0000000000000000 R11: ffffffffade4e430 R12: ffff90aa011a8800 [ 13.526040] R13: ffff90aa0e0c3800 R14: ffff90aa011a8800 R15: ffff90aa0e0c3960 [ 13.526043] FS: 00007fabd67feb40(0000) GS:ffff90aaf7400000(0000) knlGS:0000000000000000 [ 13.526047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.526050] CR2: 00007f5689ff54c0 CR3: 0000000100f16000 CR4: 00000000001506f0 [ 13.526053] Call Trace: [ 13.526056] <TASK> [ 13.526058] amd_iommu_attach_device+0x2e8/0x300 [ 13.526064] __iommu_attach_device+0x1b/0x90 [ 13.526070] iommu_attach_group+0x65/0xa0 [ 13.526075] amd_iommu_init_device+0x16b/0x250 [iommu_v2] [ 13.526083] kfd_iommu_resume+0x4c/0x1a0 [amdgpu] [ 13.527397] kgd2kfd_resume_iommu+0x12/0x30 [amdgpu] [ 13.528709] kgd2kfd_device_init.cold+0x346/0x49a [amdgpu] [ 13.529877] amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu] [ 13.531039] amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu] [ 13.532322] ? _raw_spin_lock_irqsave+0x23/0x50 [ 13.532333] amdgpu_driver_load_kms+0x15/0x110 [amdgpu] [ 13.533642] amdgpu_pci_probe+0x161/0x370 [amdgpu] [ 13.534758] local_pci_probe+0x41/0x80 [ 13.534766] pci_device_probe+0xb3/0x220 [ 13.534771] really_probe+0xde/0x380 [ 13.534776] ? pm_runtime_barrier+0x50/0x90 [ 13.534781] __driver_probe_device+0x78/0x170 [ 13.534785] driver_probe_device+0x1f/0x90 [ 13.534789] __driver_attach+0xce/0x1c0 [ 13.534793] ? __pfx___driver_attach+0x10/0x10 [ 13.534797] bus_for_each_dev+0x73/0xa0 [ 13.534801] bus_add_driver+0x1ae/0x200 [ 13.534805] driver_register+0x89/0xe0 [ 13.534809] ? __pfx_init_module+0x10/0x10 [amdgpu] [ 13.536000] do_one_initcall+0x59/0x230 [ 13.536010] do_init_module+0x4a/0x200 [ 13.536015] __do_sys_init_module+0x157/0x180 [ 13.536020] do_syscall_64+0x5b/0x80 [ 13.536025] ? handle_mm_fault+0xff/0x2f0 [ 13.536030] ? do_user_addr_fault+0x1ef/0x690 [ 13.536036] ? exc_page_fault+0x70/0x170 [ 13.536040] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 13.536047] RIP: 0033:0x7fabd66cde4e [ 13.536051] Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 89 01 48 [ 13.536057] RSP: 002b:00007ffdd89bc6a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af [ 13.536063] RAX: ffffffffffffffda RBX: 0000563e4d23f0a0 RCX: 00007fabd66cde4e [ 13.536066] RDX: 00007fabd6817453 RSI: 000000000174fb66 RDI: 00007fabd3bd4010 [ 13.536069] RBP: 00007fabd6817453 R08: 0000563e4d237c70 R09: 00007fabd672f900 [ 13.536072] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000 [ 13.536075] R13: 0000563e4d239060 R14: 0000000000000000 R15: 0000563e4d23e450 [ 13.536079] </TASK> [ 13.536081] ---[ end trace 0000000000000000 ]--- [ 13.536117] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874 [ 13.537198] kfd kfd: amdgpu: device 1002:9874 NOT added due to errors [ 13.537218] amdgpu 0000:00:01.0: amdgpu: SE 1, SH per SE 1, CU per SH 8, active_cu_number 6 [ 13.537481] BUG: kernel NULL pointer dereference, address: 0000000000000058 [ 13.537499] #PF: supervisor read access in kernel mode [ 13.537504] #PF: error_code(0x0000) - not-present page [ 13.537509] PGD 0 P4D 0 [ 13.537515] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 13.537522] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: G W ------- --- 6.2.0-0.rc1.14.fc38.x86_64 #1 [ 13.537530] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 [ 13.537534] RIP: 0010:report_iommu_fault+0x11/0x90 [ 13.537548] Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 da 3f 57 00 41 [ 13.537557] RSP: 0018:ffffbaf4403ebe08 EFLAGS: 00010246 [ 13.537562] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 13.537567] RDX: 000000010e9b0400 RSI: ffff90aa00ac40d0 RDI: 0000000000000010 [ 13.537572] RBP: 000000010e9b0400 R08: ffff90aa011a8800 R09: 0000000000000050 [ 13.537576] R10: ffff90aa00244000 R11: 0000000000000000 R12: 0000000000000000 [ 13.537581] R13: ffff90aa0005b000 R14: 0000000000000008 R15: 0000000000000000 [ 13.537585] FS: 0000000000000000(0000) GS:ffff90aaf7500000(0000) knlGS:0000000000000000 [ 13.537591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.537596] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 00000000001506e0 [ 13.537601] Call Trace: [ 13.537607] <TASK> [ 13.537612] amd_iommu_int_thread+0x60c/0x760 [ 13.537620] ? __pfx_irq_thread_fn+0x10/0x10 [ 13.537627] irq_thread_fn+0x1f/0x60 [ 13.537633] irq_thread+0xea/0x1a0 [ 13.537638] ? preempt_count_add+0x6a/0xa0 [ 13.537647] ? __pfx_irq_thread_dtor+0x10/0x10 [ 13.537652] ? __pfx_irq_thread+0x10/0x10 [ 13.537657] kthread+0xe9/0x110 [ 13.537662] ? __pfx_kthread+0x10/0x10 [ 13.537667] ret_from_fork+0x2c/0x50 [ 13.537676] </TASK> [ 13.537678] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath [ 13.537723] CR2: 0000000000000058 [ 13.537727] ---[ end trace 0000000000000000 ]--- [ 13.537731] RIP: 0010:report_iommu_fault+0x11/0x90 [ 13.537737] Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 da 3f 57 00 41 [ 13.537746] RSP: 0018:ffffbaf4403ebe08 EFLAGS: 00010246 [ 13.537751] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 13.537755] RDX: 000000010e9b0400 RSI: ffff90aa00ac40d0 RDI: 0000000000000010 [ 13.537759] RBP: 000000010e9b0400 R08: ffff90aa011a8800 R09: 0000000000000050 [ 13.537764] R10: ffff90aa00244000 R11: 0000000000000000 R12: 0000000000000000 [ 13.537768] R13: ffff90aa0005b000 R14: 0000000000000008 R15: 0000000000000000 [ 13.537773] FS: 0000000000000000(0000) GS:ffff90aaf7500000(0000) knlGS:0000000000000000 [ 13.537779] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.537783] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 00000000001506e0 [ 13.537795] genirq: exiting task "irq/24-AMD-Vi" (56) is an active IRQ thread (irq 24) [ 13.537808] general protection fault, probably for non-canonical address 0x1ee201e8df8948: 0000 [#2] PREEMPT SMP NOPTI [ 13.537815] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: G D W ------- --- 6.2.0-0.rc1.14.fc38.x86_64 #1 [ 13.537822] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 [ 13.537825] RIP: 0010:__x86_return_thunk+0x0/0x40 [ 13.537833] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f6 <c3> cc 0f ae e8 eb f9 cc 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 2e [ 13.537840] RSP: 0018:ffffbaf4403ebeb0 EFLAGS: 00010282 [ 13.537844] RAX: 001ee201e8df8948 RBX: fff38839e8df8948 RCX: 0000000000000000 [ 13.537848] RDX: 0000000080000000 RSI: ffff90aa00400b68 RDI: ffffffffad106b7f [ 13.537852] RBP: ffff90aa00aa0000 R08: ffff90aa00400c50 R09: ffffffffaf143f00 [ 13.537856] R10: 0000000000000000 R11: 0000000000000000 R12: ffff90aa00aa0cac [ 13.537859] R13: ffff90aa00938001 R14: 0000000000000000 R15: 0000000000000000 [ 13.537863] FS: 0000000000000000(0000) GS:ffff90aaf7500000(0000) knlGS:0000000000000000 [ 13.537868] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.537872] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 00000000001506e0 [ 13.537876] Call Trace: [ 13.537879] <TASK> [ 13.537882] ? task_work_run+0x59/0x90 [ 13.537888] ? do_exit+0x31f/0xaf0 [ 13.537894] ? __pfx_irq_thread_dtor+0x10/0x10 [ 13.537900] ? make_task_dead+0x7a/0x80 [ 13.537905] ? rewind_stack_and_make_dead+0x17/0x20 [ 13.537912] </TASK> [ 13.537914] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath [ 13.537946] ---[ end trace 0000000000000000 ]--- [ 13.537950] RIP: 0010:report_iommu_fault+0x11/0x90 [ 13.537955] Code: 0f 0b eb cd 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 41 55 41 54 41 89 cc 55 48 89 d5 53 <48> 8b 47 48 48 89 f3 48 85 c0 74 64 4c 8b 47 50 e8 da 3f 57 00 41 [ 13.537962] RSP: 0018:ffffbaf4403ebe08 EFLAGS: 00010246 [ 13.537967] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 13.537971] RDX: 000000010e9b0400 RSI: ffff90aa00ac40d0 RDI: 0000000000000010 [ 13.537974] RBP: 000000010e9b0400 R08: ffff90aa011a8800 R09: 0000000000000050 [ 13.537978] R10: ffff90aa00244000 R11: 0000000000000000 R12: 0000000000000000 [ 13.537982] R13: ffff90aa0005b000 R14: 0000000000000008 R15: 0000000000000000 [ 13.537986] FS: 0000000000000000(0000) GS:ffff90aaf7500000(0000) knlGS:0000000000000000 [ 13.537991] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 13.537995] CR2: 0000000000000058 CR3: 000000010e22c000 CR4: 00000000001506e0 [ 13.537999] Fixing recursive fault but reboot is needed! [ 13.538003] check_preemption_disabled: 6 callbacks suppressed [ 13.538005] BUG: using smp_processor_id() in preemptible [00000000] code: irq/24-AMD-Vi/56 [ 13.538012] caller is __schedule+0x30/0x1390 [ 13.538017] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: G D W ------- --- 6.2.0-0.rc1.14.fc38.x86_64 #1 [ 13.538023] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 [ 13.538027] Call Trace: [ 13.538030] <TASK> [ 13.538032] dump_stack_lvl+0x44/0x5c [ 13.538039] check_preemption_disabled+0xe1/0xf0 [ 13.538045] __schedule+0x30/0x1390 [ 13.538049] ? __wake_up_klogd.part.0+0x56/0x80 [ 13.538055] ? vprintk_emit+0x11d/0x290 [ 13.538061] ? _printk+0x5a/0x60 [ 13.538068] do_task_dead+0x3f/0x50 [ 13.538074] make_task_dead.cold+0x51/0xba [ 13.538080] rewind_stack_and_make_dead+0x17/0x20 [ 13.538085] RIP: 0000:0x0 [ 13.538092] Code: Unable to access opcode bytes at 0xffffffffffffffd6. [ 13.538096] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000 [ 13.538101] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 13.538105] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 13.538108] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 13.538112] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 13.538116] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 13.538121] </TASK> [ 13.538124] BUG: scheduling while atomic: irq/24-AMD-Vi/56/0x00000000 [ 13.538128] Modules linked in: amdgpu(+) drm_ttm_helper ttm iommu_v2 hid_logitech_hidpp crct10dif_pclmul drm_buddy crc32_pclmul gpu_sched crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 drm_display_helper wdat_wdt serio_raw hid_multitouch sp5100_tco hid_logitech_dj r8169 cec video wmi scsi_dh_rdac scsi_dh_emc scsi_dh_alua fuse dm_multipath [ 13.538159] Preemption disabled at: [ 13.538160] [<0000000000000000>] 0x0 [ 13.538166] CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Kdump: loaded Tainted: G D W ------- --- 6.2.0-0.rc1.14.fc38.x86_64 #1 [ 13.538172] Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52 12/03/2019 [ 13.538175] Call Trace: [ 13.538178] <TASK> [ 13.538180] dump_stack_lvl+0x44/0x5c [ 13.538185] __schedule_bug.cold+0x80/0x8d [ 13.538191] __schedule+0xf5c/0x1390 [ 13.538195] ? __wake_up_klogd.part.0+0x56/0x80 [ 13.538200] ? vprintk_emit+0x11d/0x290 [ 13.538206] ? _printk+0x5a/0x60 [ 13.538211] do_task_dead+0x3f/0x50 [ 13.538216] make_task_dead.cold+0x51/0xba [ 13.538221] rewind_stack_and_make_dead+0x17/0x20 [ 13.538226] RIP: 0000:0x0 [ 13.538231] Code: Unable to access opcode bytes at 0xffffffffffffffd6. [ 13.538234] RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000 [ 13.538240] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 13.538243] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 13.538247] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 13.538251] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 13.538254] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 13.538260] </TASK> I tried to use the crash program on the core dump but it stopped with an error crash: page excluded: kernel virtual address: ffff90aa0044db60 type: "xa_node shift" I'm attaching the full dmesg file vmcore-dmesg.txt.
This problem was reported to the upstream kernel mailing lists by regzbot at https://lore.kernel.org/all/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/
Vasant Hegde wrote four patches that fix this problem which were posted at https://lore.kernel.org/all/20230111121503.5931-1-vasant.hegde@amd.com/ and https://lore.kernel.org/linux-iommu/20230215052642.6016-1-vasant.hegde@amd.com/ Those patches were pulled into the mainline branch at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=080920e52148b4fbbf9360d5345fdcd7846e4841 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f451c7a5a3b818ecfeba2ba258570769998baf3a https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=master&id=996d120b4de2b0d6b592bd9fbbe6e244b81ab3cc https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=master&id=2cc73c5712f97de98c38c2fafc1f288354a9f3c3 Fedora-KDE-Live-x86_64-Rawhide-20230301.n.0.iso with kernel-6.3.0-0.rc0.20230228gitae3419fbac84.9.fc39 didn't have this problem when booting. The patches were queued for the 6.2 branch. 6.2.0 still had the problem when booting F38 live images. The patches didn't appear to be in 6.2.1. https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.2.1
Proposed as a Blocker and Freeze Exception for 38-beta by Fedora user mattf using the blocker tracking app because: A system with an integrated AMD Radeon R5 GPU got stuck on a black screen when amdgpu started during boot with 6.2 kernels. A null pointer dereference in an AMD IOMMU function led to an amdgpu crash on each boot. Based on the upstream discussion https://lore.kernel.org/all/15d0f9ff-2a56-b3e9-5b45-e6b23300ae3b@leemhuis.info/ and first bad commit https://bugzilla.redhat.com/show_bug.cgi?id=2156691#c1 this problem might affect systems with AMD GPUs using the amdgpu kernel driver which don't show the Access Control Services (ACS) capability of PCIe as available according to sudo lspci -vvvv and have an AMD IOMMU enabled. This problem still affects Fedora-KDE-Live-x86_64-38-20230302.n.0.iso with the 6.2.0 kernel. Patches which fix this problem were pulled into the mainline kernel during the 6.3 merge window https://bugzilla.redhat.com/show_bug.cgi?id=2156691#c5 The Rawhide image Fedora-KDE-Live-x86_64-Rawhide-20230301.n.0.iso with kernel-6.3.0-0.rc0.20230228gitae3419fbac84.9.fc39 didn't have this problem. Adding amd_iommu=off to the kernel command line is a workaround as is booting in basic graphics mode which uses nomodeset on the kernel command line. The Fedora 38 Beta blocker criterion "All release-blocking images must boot in their supported configurations. " might be violated. https://fedoraproject.org/wiki/Fedora_38_Beta_Release_Criteria#Release-blocking_images_must_boot
The commits are queued up against 6.2.y, should be in 6.2.2.
(In reply to Mario Limonciello from comment #7) > The commits are queued up against 6.2.y, should be in 6.2.2. Thanks. The patches don't appear to be in 6.2.2 https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.2.2 I think they are queued for 6.2.3 based on https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?id=caf843f8b59d0e54cd579938ae38aa6e1d4985af
+3 FE in https://pagure.io/fedora-qa/blocker-review/issue/1068 , marking accepted FE.
Discussed during the 2023-03-06 blocker review meeting: [0] The decision to classify this bug as a "RejectedBlocker (Beta)" was made as current indications suggest this probably isn't widespread enough to block the Beta for. Note it is already accepted as a freeze exception issue. [0] https://meetbot.fedoraproject.org/fedora-blocker-review/2023-03-06/f38-blocker-review.2023-03-06-17.00.txt
FEDORA-2023-772dc52d53 has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-772dc52d53
FEDORA-2023-772dc52d53 has been pushed to the Fedora 38 testing repository. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-772dc52d53 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2023-772dc52d53 has been pushed to the Fedora 38 stable repository. If problem still persists, please make note of it in this bug report.