2262577 – kernel-6.7.4 broken suspend (QCNFA765 ath11k)

Bug 2262577 - kernel-6.7.4 broken suspend (QCNFA765 ath11k)

Summary: kernel-6.7.4 broken suspend (QCNFA765 ath11k)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	39
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2264875 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-02-04 02:27 UTC by Warren Togami
Modified:	2024-03-02 10:22 UTC (History)
CC List:	33 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-02-22 02:19:26 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Gitlab	kernel-firmware/linux-firmware/-/commit/5217b76bed90ae86d5f3fe9a5f4e2301868cdd02	None	None	None	2024-02-08 17:24:39 UTC
Linux Kernel	217239	P1	RESOLVED	ath11k: WCN6855: firmware -3.6510.23 (and later) breaks suspend on certain setups	2024-02-08 05:47:07 UTC
Red Hat Bugzilla	2192082	unspecified	CLOSED	ath11k firmware breaks suspend	2024-02-29 01:58:01 UTC

Internal Links: 2192082

Description Warren Togami 2024-02-04 02:27:26 UTC

Thinkpad T14s Gen 3 AMD
AMD Ryzen 7 PRO 6850U with Radeon Graphics

kernel-6.7.3-200.fc39.x86_64
Suspend causes deadlock. Screen goes black but does not turn on. Keyboard lights are on. Caps Lock does not respond suggesting deadlock. Nothing is logged to journal.

Working Versions
kernel-6.5.*
kernel-6.6.*

Reproducible: Always

Comment 1 Martin Wolf 2024-02-04 06:24:21 UTC

I have the same problem on my HP 845 G9
(Same CPU)

Comment 2 Peter Robinson 2024-02-05 13:11:56 UTC

What WiFi modules do these have out of interest?

Comment 3 Martin Wolf 2024-02-05 13:15:37 UTC

01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)
	Subsystem: Foxconn International, Inc. Device e0c4
	Flags: bus master, fast devsel, latency 0, IRQ 120, IOMMU group 11
	Memory at b4000000 (64-bit, non-prefetchable) [size=2M]
	Capabilities: [40] Power Management version 3
	Capabilities: [50] MSI: Enable+ Count=32/32 Maskable+ 64bit-
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [148] Secondary PCI Express
	Capabilities: [158] Transaction Processing Hints
	Capabilities: [1e4] Latency Tolerance Reporting
	Capabilities: [1ec] L1 PM Substates
	Kernel driver in use: ath11k_pci
	Kernel modules: ath11k_pci

I plan to do a bisect, might take a while.

Comment 4 Warren Togami 2024-02-05 13:22:36 UTC

01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)
        Subsystem: Lenovo Device 9309
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin ? routed to IRQ 92
        IOMMU group: 11
        Region 0: Memory at 98000000 (64-bit, non-prefetchable) [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: ath11k_pci
        Kernel modules: ath11k_pci


1. Disable Wifi
2. modprobe -r ath11k_pci ath11k
3. suspend and resume works
4. modprobe ath11k_pci causes deadlock

Comment 5 Martin Wolf 2024-02-05 13:24:24 UTC

This is helpful for the bisect.

Comment 6 Peter Robinson 2024-02-05 17:00:04 UTC

So I suspect it might be this bug:
https://bugzilla.kernel.org/show_bug.cgi?id=218364

Fixed upstream with:
556857aa1d0855aba02b1c63bc52b91ec63fc2cc

A fix should be heading to a 6.7 soon

Comment 7 Warren Togami 2024-02-06 09:11:38 UTC

556857aa1d0855aba02b1c63bc52b91ec63fc2cc was already included in kernel-6.7.3 yet we experience this suspend crash.
The crash seems to be gone from kernel-6.7.4 though. It seems they fixed something else?

Comment 8 Gilbert Fernandes 2024-02-06 14:03:58 UTC

hello everone. i just upgraded my Fedora 39 kernel to 6.7.3-200.fc39.x86_64 and I am using that Qualcomm Technologies, Inc QCNFA765
(Lenovo P16s AMD Gen2)
Tested the modern-standby after kernel upgrade and reboot, and I am not seeing the issue currently on 6.7.3 I just received.

Available for any tests, any kernel you want me to try if you need help.

Comment 9 Gilbert Fernandes 2024-02-06 14:06:06 UTC

My hardware wifi card :

gf@aesir:~$ lspci -vv -s 01:00.0
01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)
	Subsystem: Lenovo Device 9309
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin ? routed to IRQ 91
	IOMMU group: 12
	Region 0: Memory at 78600000 (64-bit, non-prefetchable) [size=2M]
	Capabilities: <access denied>
	Kernel driver in use: ath11k_pci
	Kernel modules: ath11k_pci

Comment 10 Dennis 2024-02-06 15:01:59 UTC

Same issue here for me in Fedora 39 since installing kernel 6.7.3. Till kernel 6.6.13 installed, everything worked fine.

There seems to be an issue in connection to kernel 6.7.x (and above) and an AMD RX 7800 XT graphics card. And it isn’t a specific Fedora issue. When rebooting, the monitor goes into sleep mode. There is no way to wake it up again by pressing keys. In the background, it seems that the OS is booting up. So it would be possible to log into the system without seeing anything. Pressing the power button shuts down the computer. Turning it on again and the boot screen is visible and also the login screen appears. Everything is working until the next reboot.

So, the only solution to prevent the monitor going to sleep is shutting down the computer when a reboot is needed.

The same issue appeared in Nobara 39 when kernel 6.7.0-200 was available and installed. The developer was able to fix this issue by himself. With kernel 6.7.0-204 installed, the issue is fixed in Nobara 39. Looks like this commit was to blame here and was reverted by the developer of Nobara.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f38ac54e60562323ea4abb1bfb37d043ee23357

Now I switched back to kernel 6.6.13 in Grub menu to be able to reboot my system if needed.

I'm not using a Wi-Fi card! Here is the output of inxi -Fz:

System:
  Kernel: 6.6.13-200.fc39.x86_64 arch: x86_64 bits: 64 Desktop: KDE Plasma
    v: 5.27.10 Distro: Fedora release 39 (Thirty Nine)
Machine:
  Type: Desktop Mobo: Micro-Star model: MAG X570 TOMAHAWK WIFI (MS-7C84)
    v: 1.0 serial: <superuser required> UEFI: American Megatrends LLC. v: 1.F0
    date: 10/12/2023
CPU:
  Info: 12-core model: AMD Ryzen 9 3900X bits: 64 type: MT MCP cache:
    L2: 6 MiB
  Speed (MHz): avg: 2336 min/max: 2200/4672 cores: 1: 2054 2: 2049 3: 2200
    4: 2200 5: 2200 6: 3800 7: 2200 8: 2200 9: 2200 10: 2200 11: 2200 12: 2200
    13: 2038 14: 2031 15: 4515 16: 2200 17: 2200 18: 2200 19: 2200 20: 2200
    21: 2200 22: 2199 23: 2200 24: 2200
Graphics:
  Device-1: AMD Navi 32 [Radeon RX 7700 XT / 7800 XT] driver: amdgpu v: kernel
  Display: wayland server: X.org v: 1.20.14 with: Xwayland v: 23.2.4
    compositor: kwin_wayland driver: X: loaded: amdgpu
    unloaded: fbdev,modesetting,radeon,vesa dri: radeonsi gpu: amdgpu
    resolution: 3440x1440
  API: EGL v: 1.5 drivers: radeonsi,swrast
    platforms: wayland,x11,surfaceless,device
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: amd mesa v: 23.3.3 renderer: AMD
    Radeon RX 7800 XT (radeonsi navi32 LLVM 17.0.6 DRM 3.54
    6.6.13-200.fc39.x86_64)
  API: Vulkan v: 1.3.268 drivers: radv,llvmpipe surfaces: xcb,xlib,wayland
Audio:
  Device-1: AMD Navi 31 HDMI/DP Audio driver: snd_hda_intel
  Device-2: AMD Starship/Matisse HD Audio driver: snd_hda_intel
  API: ALSA v: k6.6.13-200.fc39.x86_64 status: kernel-api
  Server-1: PipeWire v: 1.0.3 status: active
Network:
  Device-1: Realtek RTL8125 2.5GbE driver: r8169
  IF: enp38s0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Bluetooth:
  Device-1: Intel AX200 Bluetooth driver: btusb type: USB
  Report: btmgmt ID: hci0 state: up address: <filter> bt-v: 5.2
Drives:
  Local Storage: total: 7.51 TiB used: 2.65 TiB (35.3%)
  ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO 1TB size: 931.51 GiB
  ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 970 EVO Plus 2TB
    size: 1.82 TiB
  ID-3: /dev/sda vendor: Samsung model: SSD 850 EVO 1TB size: 931.51 GiB
  ID-4: /dev/sdb vendor: Seagate model: ST2000DM001-1ER164 size: 1.82 TiB
  ID-5: /dev/sdc vendor: Seagate model: ST2000DM001-1CH164 size: 1.82 TiB
  ID-6: /dev/sdd vendor: Samsung model: SSD 850 PRO 256GB size: 238.47 GiB
Partition:
  ID-1: / size: 929.93 GiB used: 30.85 GiB (3.3%) fs: btrfs
    dev: /dev/nvme0n1p3
  ID-2: /boot size: 973.4 MiB used: 352 MiB (36.2%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-3: /boot/efi size: 598.8 MiB used: 19 MiB (3.2%) fs: vfat
    dev: /dev/nvme0n1p1
  ID-4: /home size: 929.93 GiB used: 30.85 GiB (3.3%) fs: btrfs
    dev: /dev/nvme0n1p3
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Sensors:
  System Temperatures: cpu: 41.0 C mobo: N/A gpu: amdgpu temp: 37.0 C
  Fan Speeds (rpm): N/A gpu: amdgpu fan: 1
Info:
  Processes: 715 Uptime: 18m Memory: total: 32 GiB available: 31.26 GiB
  used: 4.39 GiB (14.1%) Shell: Bash inxi: 3.3.31

Comment 11 Dennis 2024-02-06 15:12:18 UTC

Just forget to mention, that the same issue exists in Fedora 40 Rawhide kernel 6.8.0 RC also!

Comment 12 Gilbert Fernandes 2024-02-06 15:18:44 UTC

Yes. This seems very plausible to me.
Since I have my own laptop, from time to time the screen went black. I found out that the laptop was still working fine, it's just the display that goes black.
Putting the machine to sleep and waking it turns the display ON again.
In the system log I usually see this when it happens :

[ 5260.723233] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 5260.723557] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait

I'm using the integrated GPU that comes with the 7840U : a Radeon 780M
And from the Lenovo forums it seems we're quite a few to have issues with the recent Radeon GPUs :(

Mark Pearson from the Lenovo's team told me to add this to my kernel options :
amdgpu.dcdebugmask=0x10

Since two monthes I have been running all my Fc39 kernels with that option, and that issue very rarely happens (maybe once when watching Youtube or doing anything graphically intensive, every 2 weeks or so).

My laptop came with a low-power version of the LCD so I wonder if that's the issue : the kernel tries to tell the graphic card to do something related to power-use, and since i'm using already a low-power display, it fails to do something you can you in standard displays with high power/low power modes (mine is in permanent low-power to reduce power use, it's an option when you order the display).

Sadly, I don't have a 7800 XT so I cannot give useful information if the issue is tied to that hardware part.

Comment 13 Warren Togami 2024-02-07 01:56:40 UTC

I was mistaken. The ath11k suspend crash is not fixed in 6.7.4. Investigating...

Comment 14 Warren Togami 2024-02-07 13:13:43 UTC

6.7.4 ath11 crashes on suspend if bluetooth is enabled.
Disable bluetooth and it doesn't crash.

It does have a separate problem where data transfer becomes very slow after resume. Removing and loading the ath11k_pci kernel module again seems to be the only fix without a reboot.

Comment 15 Peter Robinson 2024-02-07 14:18:34 UTC

Also there looks to be a GPU suspend regression reported here:
https://gitlab.freedesktop.org/drm/amd/-/issues/3132

Comment 16 moe_jo 2024-02-08 03:34:04 UTC

Blank screen when resuming from suspend. I guess the keyboard becomes non-responsive since the CapsLK is non-responsive. I have to do hard reboot. My system information is below.

Laptop model: Lenovo Slim 7 ProX 14ARH
CPU: AMD Ryzen™ 9 6900HS Creator Edition × 16
Graphics: AMD Radeon™ Graphics / NVIDIA GeForce RTX™ 3050 Laptop GPU
Network controller: Intel Corporation Wi-Fi 6 AX210/AX211/AX411 160MHz (rev 1a)

OS: Fedora 39
Kernel: Linux 6.7.4-200.fc39.x86_64

Comment 17 Mario Limonciello 2024-02-08 05:43:29 UTC

The regression for ath11k (WCN6855) is actually in the linux-firmware.

Here is the fixed binary:
https://gitlab.com/kernel-firmware/linux-firmware/-/commit/5217b76bed90ae86d5f3fe9a5f4e2301868cdd02

Here is the broken version string:
fw_version 0x1109996e fw_build_timestamp 2023-12-19 11:11 fw_build_id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.36

Here is the fixed version string:
fw_version 0x1106196e fw_build_timestamp 2024-01-12 11:30 fw_build_id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37

Comment 18 Stéphane Klein 2024-02-08 22:15:58 UTC

I opened this thread: https://discussion.fedoraproject.org/t/random-resume-after-suspend-issue-on-thinkpad-t14s-amd-gen3-radeon-680m-ryzen-7/103452/7

I think I have the bug described in the current issue.

> Here is the fixed version string: fw_version 0x1106196e fw_build_timestamp 2024-01-12 11:30 fw_build_id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37

@mario.limonciello How can I apply this fix?

Comment 19 Mario Limonciello 2024-02-08 22:20:48 UTC

From that link I posted above https://gitlab.com/kernel-firmware/linux-firmware/-/commit/5217b76bed90ae86d5f3fe9a5f4e2301868cdd02 there is a view file button.  You'd need to download that firmware binary, compress it with the same compression matched in Fedora (xz w/ crc32 IIRC) and then replace the file in the right structure in /lib/firmware.

I hope that the Fedora can get an updated firmware binary into updates ASAP though.

Comment 20 Warren Togami 2024-02-09 13:18:53 UTC

I've extensively tested linux-firmware/ath11k/WCN6855/hw2.0/amss.bin versions 23, 36 and 37 with kernels 6.6.14, 6.7.4 and 6.8.0-rc3.

* It was claimed that 37 fixes suspend broken by 36 but it does not. It seems a little better than firmware 36 but it still frequently deadlocks in 6.7.4 and 6.8.0-rc3 during suspend. Sometimes it fails the first time. Sometimes it works the first time then fails the second time. Sometimes it succeeds at suspending 10 times in a row.
* 6.6.14 doesn't fail to suspend but firmware 37 makes data throughput very slow after suspend. Firmware 37 is also broken in that regard with 6.7.4 and 6.8.0-rc3. See notes below.
* 6.8.0-rc3 fails more than 6.7.4 but is otherwise similar in behavior.
* Firmware version 23 from 2023-02-15 does not suffer from slow data throughput after suspend like firmware 38. This behavior happens with 6.6.14, 6.7.4, and 6.8.0-rc3.
* Suspend deadlock behavior in 6.7.4 seems to behave the same with firmware 23 and 37.

Bluetooth and power_save toggle
===============================
Bluetooth disable or power_save mode off seems to have some effect on the likelihood of suspend deadlock. Unclear.

Regarding power_save quirky behavior
====================================
https://forums.lenovo.com/t5/Other-Linux-Discussions/QCNFA765-Linux-ath11k-wifi-crippled-high-latency-packet-loss-frequent-disassociations-T14s-AMD/m-p/5252399
Since kernel 6.4.* and earlier many of us have struggling with this flaky driver while power_save=on. With many but not all access points it would exhibit extreme packet loss. Many of us have been turning off power_save which used to workaround the problem.

You can see the current status with: iw dev wlp1s0 get power_save
With fresh boot it starts with power_save=on.
Recent more kernels switched to power_save=off automatically on suspend. I'm guessing it's because somebody realized power_save was problematic.

Behavior is different pre and post-suspend
========================================== 
kernel-6.6.14 with firmware 23 or 37
kernel-6.7.4  with firmware 23 or 37
Both the above exhibit slow data throughput immediately after boot while power_save=on. I see a maximum of 1MB/sec.
(I am not seeing packet loss like with kernel-6.4 but I am on a different network so can't compare at the moment. The slow data throughput is despite the lack of packet loss.)
# iw dev wlp1s0 set power_save off
Immediately after turning it off power_save able to achieve 15MB/sec.

kernel-6.6.14 and 6.7.4 with firmware 23
After suspend data throughput is still 15MB/sec while power_save=off.

kernel-6.6.14 and 6.7.4 with firmware 36 or 37
After suspend data throughput becomes CRIPPLED like 11KB/sec with maximum 50KB/sec. Toggling power_save after this point doesn't fix it.
Half of the time unloading and reloading the ath11k_pci kernel module brings it back to the original state of max 1MB/sec where power_save=off can reach 15MB/sec.
The other half of the time reloading the kernel module deadlocks the machine.

Comment 21 Warren Togami 2024-02-09 13:29:58 UTC

kernel-6.6.14
linux-firmware/ath11k/WCN6855/hw2.0/amss.bin
power_save=off

Only this combination of kernel, firmware, and settings has been crash-free and fast for me. I have experienced zero problems with this combination.

Comment 22 Warren Togami 2024-02-09 13:35:14 UTC

kernel-6.6.14
linux-firmware/ath11k/WCN6855/hw2.0/amss.bin version 23
power_save=off

Only this combination of kernel, firmware, and settings has been crash-free and fast for me. I have experienced zero problems with this combination.

Comment 23 Warren Togami 2024-02-09 14:38:47 UTC

I'm going with the assumption that the continued suspend problems are in fact the amdgpu regression.

I will separately file the "suspend with firmware 37 makes ath11k slow" problem upstream.

Comment 24 Martin Wolf 2024-02-10 10:35:40 UTC

I did the same "tests" @wtogami postulated and I came to the same conclusion on my HP 845 G9.

It is not fixed (yet). 

Even downgrading to the Firmware from 2022 does not help.

@wtogami in the kernel bugzilla someone suggested to file a separate bug. I think that is a good idea and mark it with "regression" so that Thorsten Leemhuis gets involved.

Comment 25 Thomas Moschny 2024-02-10 15:52:51 UTC

Same problem here, very annoying. Is there a workaround? rmmod'ing ath11k_pci before suspend doesn't seem to help. Latest on second suspend deadlocks the machine. 

System:
  Kernel: 6.7.3-200.fc39.x86_64 arch: x86_64 bits: 64 Desktop: GNOME v: 45.3
    Distro: Fedora release 39 (Thirty Nine)
Machine:
  Type: Laptop System: LENOVO product: 21CQCTO1WW v: ThinkPad T14s Gen 3
  Mobo: LENOVO model: 21CQCTO1WW v: SDK0T76530 WIN
    UEFI: LENOVO v: R22ET65W (1.35 )
    date: 08/08/2023
CPU:
  Info: 8-core model: AMD Ryzen 7 PRO 6850U with Radeon Graphics bits: 64
    type: MT MCP cache: L2: 4 MiB
  Speed (MHz): avg: 616 min/max: 400/4768 cores: 1: 1397 2: 1862 3: 400
    4: 400 5: 400 6: 400 7: 400 8: 400 9: 400 10: 400 11: 400 12: 400 13: 400
    14: 400 15: 1397 16: 400
Graphics:
  Device-1: AMD Rembrandt [Radeon 680M] driver: amdgpu v: kernel
Network:
  Device-1: Qualcomm QCNFA765 Wireless Network Adapter driver: ath11k_pci

Comment 26 Gilbert Fernandes 2024-02-10 16:06:49 UTC

Please upgrade your kernel to 6.7.4-200. Seems to fix the atheros issue as one of the comments says :
https://bodhi.fedoraproject.org/updates/FEDORA-2024-3ca09cc1a0

Please create an account on bodhi.fedoraproject.org
And run the kernel regression tests as explained here :
https://fedoramagazine.org/running-fedora-kernel-regression-tests/
If you configure it properly it will upload results.
And report how it goes there :
https://bodhi.fedoraproject.org/updates/FEDORA-2024-3ca09cc1a0

Comment 27 Mario Limonciello 2024-02-11 04:33:15 UTC

As mentioned above there are two separate regressions. 

1) The ath11k issue is fixed by the upgraded firmware binary.  -36 is definitely broken for many but not all people.  -37 fixes it.  This needs to be updated in Fedora.

2) There is a GPU driver regression: https://gitlab.freedesktop.org/drm/amd/-/issues/3132.  This is only triggered when there is activity specifically at suspend time such as triggering the lock screen from a lid close event.  It's fixed by this series https://lore.kernel.org/amd-gfx/20240208055256.130917-1-mario.limonciello@amd.com/ which patches 1 and 2 should be sent out to the 6.8-rc5 fixes pull request.

Comment 28 Thomas Moschny 2024-02-11 15:21:36 UTC

(In reply to Gilbert Fernandes from comment #26)
> Please upgrade your kernel to 6.7.4-200. Seems to fix the atheros issue as
> one of the comments says :
> https://bodhi.fedoraproject.org/updates/FEDORA-2024-3ca09cc1a0

kernel 6.7.4-200 does *not* fix the suspend issues. Neither disabling bluetooth nor unloading ath11k_pci does help. Deadlock can happen at resume or suspend time.

Comment 29 Justin M. Forbes 2024-02-13 16:02:15 UTC

(In reply to Mario Limonciello from comment #27)
> As mentioned above there are two separate regressions. 
> 2) There is a GPU driver regression:
> https://gitlab.freedesktop.org/drm/amd/-/issues/3132.  This is only
> triggered when there is activity specifically at suspend time such as
> triggering the lock screen from a lid close event.  It's fixed by this
> series
> https://lore.kernel.org/amd-gfx/20240208055256.130917-1-mario.
> limonciello/ which patches 1 and 2 should be sent out to the 6.8-rc5
> fixes pull request.

These fixes are in linux-next now, and I have pulled them back so that they will be in the 6.7.5 stable update when it release.

Comment 30 bruno c 2024-02-15 13:19:56 UTC

I seem to have the same problem on kernels >6.6.14 (e.g., 6.7.3, 6.7.4) on my Dell XPS 9320. Interestingly, this machine does not have a non-integrated GPU, and this problem started exactly when the IPU camera stopped working, so initially I thought these were related.

```
System:
  Kernel: 6.6.14-200.fc39.x86_64 arch: x86_64 bits: 64
  Desktop: GNOME v: 45.4 Distro: Fedora Linux 39 (Workstation Edition)
Machine:
  Type: Laptop System: Dell product: XPS 9320 v: N/A
    serial: <superuser required>
  Mobo: Dell model: 0JPN6G v: A00 serial: <superuser required> UEFI: Dell
    v: 1.9.0 date: 09/23/2022
CPU:
  Info: 12-core (4-mt/8-st) model: 12th Gen Intel Core i7-1260P bits: 64
    type: MST AMCP cache: L2: 9 MiB
Graphics:
  Device-1: Intel Alder Lake-P GT2 [Iris Xe Graphics] driver: i915 v: kernel
  Display: wayland server: X.Org v: 23.2.4 with: Xwayland v: 23.2.4
    compositor: gnome-shell driver: dri: iris gpu: i915
    resolution: 1920x1200~60Hz
  API: OpenGL v: 4.6 vendor: intel mesa v: 23.3.5 renderer: Mesa Intel
    Graphics (ADL GT2)
  API: EGL Message: EGL data requires eglinfo. Check --recommends.
Audio:
  Device-1: Intel Alder Lake Imaging Signal Processor driver: intel-ipu6
  Device-2: Intel Alder Lake PCH-P High Definition Audio
    driver: sof-audio-pci-intel-tgl
  API: ALSA v: k6.6.14-200.fc39.x86_64 status: kernel-api
  Server-1: PipeWire v: 1.0.3 status: active
Network:
  Device-1: Intel Alder Lake-P PCH CNVi WiFi driver: iwlwifi
  IF: wlp0s20f3 state: up mac: <filter>
Bluetooth:
  Device-1: Intel AX211 Bluetooth driver: btusb type: USB
  Report: btmgmt ID: hci0 rfk-id: 0 state: down bt-service: enabled,running
    rfk-block: hardware: no software: yes address: <filter> bt-v: 5.3
Drives:
  Local Storage: total: 953.87 GiB used: 66.18 GiB (6.9%)
  ID-1: /dev/nvme0n1 vendor: SK Hynix model: PC801 NVMe 1TB
    size: 953.87 GiB
Swap:
  ID-1: swap-1 type: zram size: 8 GiB used: 0 KiB (0.0%) dev: /dev/zram0
Info:
  Memory: total: 16 GiB note: est. available: 15.23 GiB
    used: 3.02 GiB (19.8%)
  Processes: 420 Uptime: 25m Shell: fish inxi: 3.3.32
```

Comment 31 Peter Robinson 2024-02-19 14:37:02 UTC

*** Bug 2264875 has been marked as a duplicate of this bug. ***

Comment 32 Dennis 2024-02-19 14:43:48 UTC

(In reply to Justin M. Forbes from comment #29)
> These fixes are in linux-next now, and I have pulled them back so that they
> will be in the 6.7.5 stable update when it release.

My issue (Comment 10) is fixed after installing kernel 6.7.5 (https://bodhi.fedoraproject.org/updates/FEDORA-2024-88847bc77a) in Fedora 39 KDE. I'm now able to reboot my system without getting a black screen.

Comment 33 simpre 2024-02-19 16:00:12 UTC

While I had occasional problems with suspend in the past, since a recent update I can now reproduce this issue every single time I enter the suspend mode. If my Thinkpad is plugged into an external power source and if the lid is closed for entering the suspend mode, the screen just stays black after attempting a wake up. If it is not plugged in, it wakes up just fine most of the time.

Tried booting with a older kernel that previously worked, with the current kernel and with the 6.8.0-0.rc4 rawhide kernel - all show the same problem. 

I am using a P14s Gen3 AMD with AMD Ryzen 7 PRO 6850U with Radeon Graphics. Fedora 39.

Comment 34 Fedora Update System 2024-02-20 21:46:14 UTC

FEDORA-2024-0e9661ca97 (linux-firmware-20240220-1.fc39) has been submitted as an update to Fedora 39.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-0e9661ca97

Comment 35 Fedora Update System 2024-02-20 21:46:19 UTC

FEDORA-2024-355c0ca9d3 (linux-firmware-20240220-1.fc38) has been submitted as an update to Fedora 38.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-355c0ca9d3

Comment 36 kattendorf 2024-02-20 21:46:58 UTC

I experience exactly the same issue with a T14s Gen 3 AMD

Comment 37 Fedora Update System 2024-02-21 02:03:44 UTC

FEDORA-2024-355c0ca9d3 has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-355c0ca9d3`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-355c0ca9d3

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 38 Fedora Update System 2024-02-21 02:36:02 UTC

FEDORA-2024-0e9661ca97 has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-0e9661ca97`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-0e9661ca97

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 39 Fedora Update System 2024-02-22 02:19:26 UTC

FEDORA-2024-0e9661ca97 (linux-firmware-20240220-1.fc39) has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 40 simpre 2024-02-22 10:31:23 UTC

Problem still persists with the latest linux-firmware 20240220-1.fc39. And in the kernel 6.7.5 suspend mode (and network + some of the FN keys) is not working any more.

Comment 41 Mario Limonciello 2024-02-22 16:45:29 UTC

As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2262577#c27 there are two separate suspend related problems that happened at about the same time.
There are two kernels patches that need to be backported still.

Comment 42 Justin M. Forbes 2024-02-22 17:06:35 UTC

(In reply to Mario Limonciello from comment #41)
> As I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2262577#c27
> there are two separate suspend related problems that happened at about the
> same time.
> There are two kernels patches that need to be backported still.

If you are talking about:

- drm/amd: Stop evicting resources on APUs in suspend (Mario Limonciello)
- Revert "drm/amd: flush any delayed gfxoff on suspend entry" (Mario Limonciello)

I pulled those back as soon as they hit linux-next, and they are included in the 6.7.5 fedora kernels.

Comment 43 Mario Limonciello 2024-02-22 17:28:43 UTC

In that case we must still have another problem :/

Comment 44 Mario Limonciello 2024-02-22 17:37:17 UTC

I can't trip it, but I think we're looking at a race condition problem.

This isn't a solution; but at least to confirm that hypothesis can you build a Fedora test kernel for people to try that reverts 6b1adc1bd3fe38c7af00aed18086b86d13f5db8b but is otherwise the same?

Comment 45 Justin M. Forbes 2024-02-22 18:34:54 UTC

(In reply to Mario Limonciello from comment #44)
> I can't trip it, but I think we're looking at a race condition problem.
> 
> This isn't a solution; but at least to confirm that hypothesis can you build
> a Fedora test kernel for people to try that reverts
> 6b1adc1bd3fe38c7af00aed18086b86d13f5db8b but is otherwise the same?

What tree is that commit ID from? It doesn't exist in linus's tree, stable 6.7.y, or the fedora tree.

Comment 46 Mario Limonciello 2024-02-22 19:01:21 UTC

Must have been a bad copy paste somehow; sorry I can't even find it locally.  Here's the hash I meant:

94b1e028e15c94362420f9f3f711fafbf9d52996

Comment 47 Justin M. Forbes 2024-02-22 23:16:33 UTC

https://koji.fedoraproject.org/koji/taskinfo?taskID=113897958 should finish soon, it has been building for a bit, this is the current fedora 6.7.5 with the revert if people will test.

Comment 48 Alessandro 2024-02-26 17:55:31 UTC

Hello,

I have a 

 LENOVO 21D2CTO1WW (ThinkPad Z13 Gen 1) running BIOS 1.64 (N3GET64W (1.64 ))
 AMD Ryzen 7 PRO 6860Z with Radeon Graphics (family 19 model 44)
 WCN6855 WLAN (fw build id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37)
 Fedora Linux 39 (Workstation Edition)

which is affected by the issue described in this report (machine hangs/crash on suspend when the lid is closed).
FWIW, I rebuilt 6.8-rc5 without 94b1e028e15c94362420f9f3f711fafbf9d52996 and I can't see anymore crashes: I can consistently close the lid and the machine suspends correctly.

I'm willing to try 6.7.5 but I can't find it on koji.

Comment 49 Justin M. Forbes 2024-02-26 18:05:02 UTC

https://koji.fedoraproject.org/koji/taskinfo?taskID=113897958 is the build in koji of 6.7.5 withe that patch reverted.

Comment 50 Alessandro 2024-02-26 18:07:53 UTC

(In reply to Justin M. Forbes from comment #49)
> https://koji.fedoraproject.org/koji/taskinfo?taskID=113897958 is the build
> in koji of 6.7.5 withe that patch reverted.

$ cd $(mktemp -d) && koji download-build --arch=x86_64 kernel-6.7.5-201.fc39
No such build: kernel-6.7.5-201.fc39

Perhaps I'm doing something wrong, first time I use koji, bear with me.

Comment 51 Justin M. Forbes 2024-02-26 18:13:18 UTC

(In reply to Alessandro from comment #50)
> (In reply to Justin M. Forbes from comment #49)
> > https://koji.fedoraproject.org/koji/taskinfo?taskID=113897958 is the build
> > in koji of 6.7.5 withe that patch reverted.
> 
> $ cd $(mktemp -d) && koji download-build --arch=x86_64 kernel-6.7.5-201.fc39
> No such build: kernel-6.7.5-201.fc39
> 
> Perhaps I'm doing something wrong, first time I use koji, bear with me.

Ahh yes, you can't do that with scratch builds because there could be several with the same name... 

cd $(mktemp -d) && koji download-task 113897958

Comment 52 Alessandro 2024-02-26 18:34:24 UTC

(In reply to Justin M. Forbes from comment #51)
> (In reply to Alessandro from comment #50)
> > (In reply to Justin M. Forbes from comment #49)
> > > https://koji.fedoraproject.org/koji/taskinfo?taskID=113897958 is the build
> > > in koji of 6.7.5 withe that patch reverted.
> > 
> > $ cd $(mktemp -d) && koji download-build --arch=x86_64 kernel-6.7.5-201.fc39
> > No such build: kernel-6.7.5-201.fc39
> > 
> > Perhaps I'm doing something wrong, first time I use koji, bear with me.
> 
> Ahh yes, you can't do that with scratch builds because there could be
> several with the same name... 
> 
> cd $(mktemp -d) && koji download-task 113897958

Thanks Justin, that works. TIL
I rebooted into 6.7.5-201 and it seems consistent:

 - I ran amd_s2idle.py --count 4 and it didn't break;
 - I closed the lid three times and it didn't break;
 - I suspended from the Gnome menu two times and it didn't break;

It looks good so far.

The only issue I keep seeing with amd_s2idle.py is this:
Explanations for your system
🚦 ACPI BIOS Errors detected
	When running a firmware component utilized for s2idle
	the ACPI interpreter in the Linux kernel encountered some
	problems. This usually means it's a bug in the system BIOS
	that should be fixed the system manufacturer.

	You may have problems with certain devices after resume or high
	power consumption when this error occurs.
	ACPI BIOS Error (bug): Failure creating named object [\_SB.PCI0.GP17.XHC0.PSTA], AE_ALREADY_EXISTS (20230628/dswload2-326)	
        ACPI Error: AE_ALREADY_EXISTS, During name lookup/catalog (20230628/psobject-220)

Perhaps I'll have to liaise with the manufacturer on the latter, though any hints would be appreciated.

Comment 53 Mario Limonciello 2024-02-26 19:13:50 UTC

Well if the scratch build (and your rc5 test) works it confirms there is either a race condition or a mutex deadlock occurring.

Let me explain the situation:
When you close the lid or suspend from the GNOME menu it uses logind to kick off the suspend sequence.
Logind isn't synchronous, and so the kernel suspend sequence will start while userspace is still active.
During this time the lock screen will come up, DPMS engaged, etc.

What's happening is that there is some SDMA traffic at this time from the lock screen coming up or the DPMS action.

That patch that you reverted intentionally blocks GFXOFF from occurring to workaround a low level platform issue that was reported under SDMA stress.
During the suspend sequence there is a point when all pending GFXOFF requests are flushed, and I "think" that's conflicting.

Unfortunately, I can't reproduce the issue locally, so it's very hard for me to accurately hypothesize the specifics. 
So this is purely a guess; but does this help?  You can apply it to 6.8-rc6.

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
index 0058f3f7cf6e..c78aa71d8753 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
@@ -1655,7 +1655,8 @@ static void sdma_v5_2_ring_begin_use(struct amdgpu_ring *ring)
         * this GFXOFF will be disallowed anyway when SDMA is
         * active, this just makes it explicit.
         */
-       amdgpu_gfx_off_ctrl(adev, false);
+       if (!adev->in_s0ix)
+               amdgpu_gfx_off_ctrl(adev, false);
 }

 static void sdma_v5_2_ring_end_use(struct amdgpu_ring *ring)
@@ -1666,7 +1667,8 @@ static void sdma_v5_2_ring_end_use(struct amdgpu_ring *ring)
         * disallow GFXOFF in some cases leading to
         * hangs in SDMA.  Allow GFXOFF when SDMA is complete.
         */
-       amdgpu_gfx_off_ctrl(adev, true);
+       if (!adev->in_s0ix)
+               amdgpu_gfx_off_ctrl(adev, true);
 }

 const struct amd_ip_funcs sdma_v5_2_ip_funcs = {

Comment 54 Alessandro 2024-02-27 21:34:07 UTC

(In reply to Mario Limonciello from comment #53)
> Well if the scratch build (and your rc5 test) works it confirms there is
> either a race condition or a mutex deadlock occurring.
> 
> Let me explain the situation:
> When you close the lid or suspend from the GNOME menu it uses logind to kick
> off the suspend sequence.
> Logind isn't synchronous, and so the kernel suspend sequence will start
> while userspace is still active.
> During this time the lock screen will come up, DPMS engaged, etc.
> 
> What's happening is that there is some SDMA traffic at this time from the
> lock screen coming up or the DPMS action.
> 
> That patch that you reverted intentionally blocks GFXOFF from occurring to
> workaround a low level platform issue that was reported under SDMA stress.
> During the suspend sequence there is a point when all pending GFXOFF
> requests are flushed, and I "think" that's conflicting.
> 
> Unfortunately, I can't reproduce the issue locally, so it's very hard for me
> to accurately hypothesize the specifics. 
> So this is purely a guess; but does this help?  You can apply it to 6.8-rc6.
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> index 0058f3f7cf6e..c78aa71d8753 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v5_2.c
> @@ -1655,7 +1655,8 @@ static void sdma_v5_2_ring_begin_use(struct
> amdgpu_ring *ring)
>          * this GFXOFF will be disallowed anyway when SDMA is
>          * active, this just makes it explicit.
>          */
> -       amdgpu_gfx_off_ctrl(adev, false);
> +       if (!adev->in_s0ix)
> +               amdgpu_gfx_off_ctrl(adev, false);
>  }
> 
>  static void sdma_v5_2_ring_end_use(struct amdgpu_ring *ring)
> @@ -1666,7 +1667,8 @@ static void sdma_v5_2_ring_end_use(struct amdgpu_ring
> *ring)
>          * disallow GFXOFF in some cases leading to
>          * hangs in SDMA.  Allow GFXOFF when SDMA is complete.
>          */
> -       amdgpu_gfx_off_ctrl(adev, true);
> +       if (!adev->in_s0ix)
> +               amdgpu_gfx_off_ctrl(adev, true);
>  }
> 
>  const struct amd_ip_funcs sdma_v5_2_ip_funcs = {

Thanks for your reply and for the patch, Mario.
I applied it to 6.8-rc6 without reverting 94b1e028e15c94362420f9f3f711fafbf9d52996.
I can consistently suspend the machine with amd_s2idle.py; however, when I close the lid the machine hangs/crashes.
There is nothing in /var/lib/systemd/pstore/ and this is the last line I found in the kernel log:

Feb 27 22:14:43 kernel: PM: suspend entry (s2idle)

Is there anything else I can do to help you debug further? I'm not very accustom with the code in amdgpu but willing
to help you sorting this out.

Thanks

Comment 55 Mario Limonciello 2024-02-27 21:45:25 UTC

Thanks for checking it and confirming that patch doesn't help.  Let's discuss next steps for ideas on the upstream bug report as this one is closed.  If we come up with a solution we'll nominate it for stable and ping jforbes and Fedora can pick it up more quickly considering the regression.

Comment 56 Fedora Update System 2024-02-29 01:58:04 UTC

FEDORA-2024-355c0ca9d3 (linux-firmware-20240220-1.fc38) has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 57 Stéphane Klein 2024-03-01 09:56:27 UTC

> I rebooted into 6.7.5-201 and it seems consistent:
>
> - I ran amd_s2idle.py --count 4 and it didn't break;
> - I closed the lid three times and it didn't break;
> - I suspended from the Gnome menu two times and it didn't break;

@alessandro.cassese I didn't see kernel 201 in https://koji.fedoraproject.org/koji/packageinfo?packageID=8

I see only:

- kernel-6.7.5-200.fc39
- and kernel-6.7.6-200.fc39

Questions:

- Do you think that the kernel kernel-6.7.6-200.fc39 contains the bugfix patch? 
- Else, why isn't the kernel-6.7.5-201.fc39 available?

Best regards,
Stéphane

Comment 58 Peter Robinson 2024-03-01 11:08:05 UTC

> @alessandro.cassese I didn't see kernel 201 in
> https://koji.fedoraproject.org/koji/packageinfo?packageID=8

It was a scratch build as referenced in comment 47.

> - Do you think that the kernel kernel-6.7.6-200.fc39 contains the bugfix
> patch? 

No reference to it in the changelog so unless it's in the upstream 6.7.6 changelog no.

> - Else, why isn't the kernel-6.7.5-201.fc39 available?

It's a scratch build with a test in it.

Comment 59 Stéphane Klein 2024-03-02 10:22:32 UTC

> > - Do you think that the kernel kernel-6.7.6-200.fc39 contains the bugfix
> patch? 

No reference to it in the changelog so unless it's in the upstream 6.7.6 changelog no.

I applied all the Fedora 39 updates and I've noticed since 24 hours that my initial problem seems to be fixed.

See package details here: https://discussion.fedoraproject.org/t/random-resume-after-suspend-issue-on-thinkpad-t14s-amd-gen3-radeon-680m-ryzen-7/103452/19

Best regards,
Stéphane

Note You need to log in before you can comment on or make changes to this bug.

acaringi
adscvr
agurenko
airlied
alciregi
alessandro.cassese
alex
bcclaro+redhat
bskeggs
contact
dennisalbrecht
gilbert.fernandes
hdegoede
hpa
jarod
jforbes
josef
karim.ellmann
kernel-maint
linville
mario.limonciello
masami256
mchehab
mpearson
mwolf
nixuser
pbrobinson
ptalbert
rawashdeh.mohammad
robatino
simpre
steved
thomas.moschny