Bug 2177111 - amdgpu failed to resume involving AMD IOMMU with 6.2.2-301 kernel resulting in a black screen
Summary: amdgpu failed to resume involving AMD IOMMU with 6.2.2-301 kernel resulting i...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 38
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-03-10 03:23 UTC by Matt Fagnani
Modified: 2023-03-21 05:02 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug


Attachments (Terms of Use)
The kernel log for a boot when I clicked Sleep in sddm, tried to resume the system, and the problem happened. (108.24 KB, text/plain)
2023-03-10 03:23 UTC, Matt Fagnani
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 217170 0 P1 RESOLVED amdgpu failed to resume involving AMD IOMMU with 6.2.2-301 kernel resulting in a black screen 2023-03-21 05:09:01 UTC
freedesktop.org Gitlab drm amd issues 2454 0 None opened amdgpu failed to resume involving AMD IOMMU with 6.2.2-301 kernel resulting in a black screen 2023-03-21 05:09:04 UTC

Description Matt Fagnani 2023-03-10 03:23:49 UTC
Created attachment 1949458 [details]
The kernel log for a boot when I clicked Sleep in sddm, tried to resume the system, and the problem happened.

1. Please describe the problem:

I booted a Fedora 38 KDE Plasma installation on an hp laptop with an AMD A10-9620P CPU and an integrated Radeon R5 GPU. I selected Sleep in either the Application Launcher menu in Plasma 5.27.2 on Wayland or sddm on Wayland. The system went to sleep. I moved the mouse to wake the system. The screen remained black, but the LEDs on the side of the laptop flickered indicating drive activity and the fan resumed making noise. I pressed sysrq+alt+s,u,b to do an emergency sync, remount read-only, and reboot. The system rebooted. The journal indicated the amdgpu failed to resume due to errors including amdgpu: amdgpu_device_ip_resume failed (-6). which started after the kernel failed to resume the AMD IOMMU.

Mar 09 20:27:55 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
Mar 09 20:27:55 kernel: amdgpu 0000:00:01.0: amdgpu: amdgpu_device_ip_resume failed (-6).
Mar 09 20:27:55 kernel: amdgpu 0000:00:01.0: PM: dpm_run_callback(): pci_pm_resume+0x0/0xe0 returns -6
Mar 09 20:27:55 kernel: amdgpu 0000:00:01.0: PM: failed to resume async: error -6
Mar 09 20:27:55 kernel: sd 0:0:0:0: [sda] Starting disk
Mar 09 20:27:55 kernel: usb 2-1.4: reset full-speed USB device number 4 using ehci-pci
Mar 09 20:27:55 kernel: usb 2-1.3: reset full-speed USB device number 3 using ehci-pci
Mar 09 20:27:55 kernel: psmouse serio1: synaptics: queried max coordinates: x [..5648], y [..4826]
Mar 09 20:27:55 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Mar 09 20:27:55 kernel: psmouse serio1: synaptics: queried min coordinates: x [1292..], y [1026..]
Mar 09 20:27:55 kernel: ata1.00: configured for UDMA/133
Mar 09 20:27:55 kernel: PM: resume devices took 2.703 seconds
Mar 09 20:27:55 kernel: OOM killer enabled.
Mar 09 20:27:55 kernel: Restarting tasks ... done.
Mar 09 20:27:55 kernel: random: crng reseeded on system resumption
Mar 09 20:27:55 kernel: thermal thermal_zone2: failed to read out thermal zone (-61)
Mar 09 20:27:55 kernel: Bluetooth: hci0: Legacy ROM 2.x revision 5.0 build 25 week 20 2015
Mar 09 20:27:55 kernel: Bluetooth: hci0: Intel Bluetooth firmware file: intel/ibt-hw-37.8.10-fw-22.50.19.14.f.bseq
Mar 09 20:27:55 kernel: PM: suspend exit
Mar 09 20:27:55 kernel: Generic FE-GE Realtek PHY r8169-0-100:00: attached PHY driver (mii_bus:phy_addr=r8169-0-100:00, irq=MAC)
Mar 09 20:27:55 kernel: r8169 0000:01:00.0 enp1s0: Link is Down
Mar 09 20:27:56 kernel: Bluetooth: hci0: Intel BT fw patch 0x43 completed & activated
Mar 09 20:28:00 kernel: r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control off
Mar 09 20:28:00 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
Mar 09 20:28:01 kernel: r8169 0000:01:00.0 enp1s0: Link is Down
Mar 09 20:28:02 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=fe80:0000:0000:0000:265c:5b24:c7aa:102b DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=185 TC=0 HOPLIMIT=255 FLOWLBL=110208 PROTO=UDP SPT=5353 DPT=5353 LEN=145 
Mar 09 20:28:04 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=fe80:0000:0000:0000:265c:5b24:c7aa:102b DST=ff02:0000:0000:0000:0000:0000:0000:00fb LEN=185 TC=0 HOPLIMIT=255 FLOWLBL=110208 PROTO=UDP SPT=5353 DPT=5353 LEN=145 
Mar 09 20:28:05 kernel: r8169 0000:01:00.0 enp1s0: Link is Up - 1Gbps/Full - flow control off
Mar 09 20:28:06 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=49904, emitted seq=49906
Mar 09 20:28:06 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Mar 09 20:28:06 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset begin!
Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: ib ring test failed (-110).
Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Mar 09 20:28:07 kernel: [drm:gfx_v8_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
Mar 09 20:28:07 kernel: amdgpu: cp is busy, skip halt cp
Mar 09 20:28:07 kernel: amdgpu: rlc is busy, skip halt rlc
Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset succeeded, trying to resume
Mar 09 20:28:07 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset(1) failed
Mar 09 20:28:07 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Mar 09 20:28:07 kernel: amdgpu: sdma_bitmap: f
Mar 09 20:28:07 kernel: kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:9874
Mar 09 20:28:07 kernel: kfd kfd: amdgpu: device 1002:9874 NOT added due to errors
Mar 09 20:28:07 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset end with ret = -6
Mar 09 20:28:07 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -6
Mar 09 20:28:10 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=192.168.2.10 DST=224.0.0.251 LEN=234 TOS=0x00 PREC=0x00 TTL=255 ID=40777 DF PROTO=UDP SPT=5353 DPT=5353 LEN=214 
Mar 09 20:28:10 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=192.168.2.10 DST=224.0.0.251 LEN=234 TOS=0x00 PREC=0x00 TTL=255 ID=40988 DF PROTO=UDP SPT=5353 DPT=5353 LEN=214 
Mar 09 20:28:10 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=192.168.2.10 DST=224.0.0.251 LEN=234 TOS=0x00 PREC=0x00 TTL=255 ID=41207 DF PROTO=UDP SPT=5353 DPT=5353 LEN=214 
Mar 09 20:28:11 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=192.168.2.10 DST=224.0.0.251 LEN=216 TOS=0x00 PREC=0x00 TTL=255 ID=41247 DF PROTO=UDP SPT=5353 DPT=5353 LEN=196 
Mar 09 20:28:12 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=192.168.2.10 DST=224.0.0.251 LEN=216 TOS=0x00 PREC=0x00 TTL=255 ID=41784 DF PROTO=UDP SPT=5353 DPT=5353 LEN=196 
Mar 09 20:28:14 kernel: filter_IN_drop_DROP: IN=enp1s0 OUT= MAC= SRC=192.168.2.10 DST=224.0.0.251 LEN=216 TOS=0x00 PREC=0x00 TTL=255 ID=42530 DF PROTO=UDP SPT=5353 DPT=5353 LEN=196 
Mar 09 20:28:18 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=49906, emitted seq=49908
Mar 09 20:28:18 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Mar 09 20:28:18 kernel: amdgpu 0000:00:01.0: amdgpu: GPU reset begin!
Mar 09 20:28:18 kernel: amdgpu 0000:00:01.0: amdgpu: IP block:gfx_v8_0 is hung!
Mar 09 20:28:18 kernel: amdgpu 0000:00:01.0: amdgpu: soft reset failed, will fallback to full reset!

This problem happened 3/3 times with the 6.2.2-301 kernel which contained patches which fixed the black screen problem when amdgpu started during boot with all previous 6.2 branch kernels on this system as reported at https://bugzilla.redhat.com/show_bug.cgi?id=2156691 I booted with amd_iommu=off on the kernel command line which was a workaround for that previous problem, and the failure to resume didn't happen when I put the system to sleep 5 times. The AMD IOMMU is likely involved in this problem. I reported this problem at https://gitlab.freedesktop.org/drm/amd/-/issues/2454 and https://bugzilla.kernel.org/show_bug.cgi?id=217170 This problem didn't happen with 6.1.15 or earlier. Bisecting this problem might be problematic because previous 6.2 kernels had the black screen problem on boot with the default kernel command line parameters, and the failure to resume didn't happen with amd_iommu=off. 

2. What is the Version-Release number of the kernel:
6.2.2-301.fc38

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Yes, 6.1.15 and earlier resumed normally. The problem first appeared with 6.2.2-301.fc38

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
1. Boot a Fedora 38 KDE Plasma installation updated to 2023-3-9 with updates-testing enabled on a laptop with an AMD A10-9620P CPU, an integrated Radeon R5 GPU, and an AMD IOMMU enabled
2. Select Virtual Keyboard at the bottom left of sddm if the Sleep, Reboot, Shut down buttons don't appear
3. Select Sleep in sddm
4. Resume the system by moving the mouse or pressing a key

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
I haven't tested the latest Rawhide kernel yet.

6. Are you running any modules that not shipped with directly Fedora's kernel?:
No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
I'm attaching the kernel log for a boot when I clicked Sleep in sddm, tried to resume the system, and the problem happened.

Comment 1 Matt Fagnani 2023-03-15 06:57:38 UTC
kernel-6.3.0-0.rc1.20230309git6a98c9cae232.18.fc39.x86_64 has this resume problem. kernel-6.3.0-0.rc0.20230227gitf3a2439f20d9.9.fc39.x86_64 is the first Rawhide kernel without the black screen during boot problem https://gitlab.freedesktop.org/drm/amd/-/issues/2319 and it has this failure to resume problem. The previous build kernel-6.3.0-0.rc0.20230223gita5c95ca18a98.4.fc39.x86_64 had the black screen during boot

I reported this problem to the IOMMU subsystem mailing list at https://lore.kernel.org/all/4a3b225c-2ffd-e758-4de1-447375e34cad@bell.net/T/#u Vasant Hegde and Felix Kuehling explained the details of the problem in amdgpu there. Thorsten Leemhuis added the problem to regzbot. https://lore.kernel.org/all/4a3b225c-2ffd-e758-4de1-447375e34cad@bell.net/T/#m52dfb8f457727ce725aad66e5e7db4e8afa46fad https://linux-regtracking.leemhuis.info/regzbot/regression/217170/ I built 6.3-rc2 after applying Felix's patch at https://lore.kernel.org/stable/20230314175359.1747662-1-Felix.Kuehling@amd.com/ amdgpu resumed normally 5/5 times with 6.3-rc2 + the patch. Felix's patch fixed the problem. Thanks.

Comment 2 Matt Fagnani 2023-03-21 05:02:59 UTC
Felix Kuehling's patch to fix this problem was pulled into the mainline branch on 2023-3-17 and is in 6.3-rc3 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=master&id=f3921a9a641483784448fb982b2eb738b383d9b9 6.3.0-0.rc3.30.fc39 didn't have this problem when resuming a few times. Felix's patch is queued for the 6.2 branch at https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?id=d68ccb83abd757877de8c7f344fa43c05b81760f


Note You need to log in before you can comment on or make changes to this bug.