Bug 2029207

Summary: Laptop does not wake up from suspend w/ kernel 5.15
Product: [Fedora] Fedora Reporter: Phil <beaaegicfqmq6rykaqaakty3lqcg6btv>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 35CC: acaringi, adscvr, airlied, alciregi, bjorn, bskeggs, dirk, hdegoede, ivan-bugzilla, jarodwilson, jeremy, jforbes, jglisse, jonathan, josef, kernel-maint, lgoncalv, linville, mail, masami256, mchehab, omalley_s, ptalbert, steved
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-11 15:14:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log file for 5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64
none
dmesg - without option
none
dmesg - with pci=use_e820 option
none
dmesg UEFI / test kernel
none
dmesg / 5.16.9-200.e820.fc35 test kernel / efi=debug
none
dmesg / 5.16.9-200.e820_2.fc35 test kernel / efi=debug
none
dmesg-5.16.12-200.e820.fc35.x86_64 / legacy bios
none
dmesg.txt 6.0.11-300.efimmio.fc37.x86_64
none
acpidump
none
[PATCH] iio: light: cm32181: Stop suspend failures causing system suspend to fail
none
dmesg 6.0.11-300.cm32181.fc37.x86_64 none

Description Phil 2021-12-05 18:50:45 UTC
1. Please describe the problem:

My laptop (Lenovo X1 Carbon (20A7) does not resume from suspend-to-ram in Fedora 35 when using either kernel-5.15.5-200.fc35.x86_64 or kernel-5.15.6-200.fc35.x86_64. Did not test other 5.15 kernels. kernel-5.14.17-301.fc35.x86_64 works fine, though.

2. What is the Version-Release number of the kernel:

kernel-5.15.6-200.fc35.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

kernel-5.15.5-200.fc35.x86_64

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

- boot 5.15 kernel
- close the lid to suspend the laptop
- wait
- open the lid and see that nothing happens

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

no, 5.16.0-0.rc3.20211203git5f58da2befa5.26.fc36.x86_64 works fine.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

no.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.
   
Unfortunately, once suspended I cannot get any logs.

Comment 1 Phil 2021-12-06 18:57:27 UTC
Well, now it happened with 5.16.0-0.rc3, too. Not reproducable, happened randomly.

Comment 2 Leonard Ehrenfried 2021-12-14 15:12:43 UTC
I'm facing exactly the same issue. Sadly I don't have more information but will post the suspend log when I try booting into a 5.15 kernel next time.

I resorted to completely uninstalling the 5.15 series kernels with the following command

sudo dnf remove kernel-core-5.15*

Comment 3 Leonard Ehrenfried 2021-12-20 13:21:39 UTC
Today I rebooted into a 5.15.8 kernel and tried again but suspend is broken for me, too. However, I managed to get the logs from the bad suspend:

Dec 20 14:07:54 fedora systemd-logind[951]: Power key pressed.
Dec 20 14:07:54 fedora ModemManager[1043]: <info>  [sleep-monitor] system is about to suspend
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4166] manager: sleep: sleep requested (sleeping: no  enabled: yes)
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4168] device (p2p-dev-wlp4s0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4170] manager: NetworkManager state is now ASLEEP
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4170] device (wlp4s0): state change: activated -> deactivating (reason 'sleeping', sys-iface-state: 'managed')
Dec 20 14:07:54 fedora systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 20 14:07:54 fedora systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 20 14:07:54 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Dec 20 14:07:54 fedora kernel: wlp4s0: deauthenticating from 74:83:c2:22:0b:95 by local choice (Reason: 3=DEAUTH_LEAVING)
Dec 20 14:07:54 fedora systemd-resolved[903]: wlp4s0: Bus client reset search domain list.
Dec 20 14:07:54 fedora systemd-resolved[903]: wlp4s0: Bus client set DNS server list to: fd00::de15:c8ff:fe45:7e67
Dec 20 14:07:54 fedora systemd-resolved[903]: wlp4s0: Bus client set default route setting: no
Dec 20 14:07:54 fedora systemd-resolved[903]: wlp4s0: Bus client reset DNS server list.
Dec 20 14:07:54 fedora wpa_supplicant[1150]: wlp4s0: CTRL-EVENT-DISCONNECTED bssid=74:83:c2:22:0b:95 reason=3 locally_generated=1
Dec 20 14:07:54 fedora wpa_supplicant[1150]: dbus: wpa_dbus_property_changed: no property SessionLength in object /fi/w1/wpa_supplicant1/Interfaces/0
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4781] device (wlp4s0): supplicant interface state: completed -> disconnected
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4782] device (wlp4s0): state change: deactivating -> disconnected (reason 'sleeping', sys-iface-state: 'managed')
Dec 20 14:07:54 fedora avahi-daemon[930]: Withdrawing address record for 2001:9e8:406:dc00:4f46:3d94:3ef9:20be on wlp4s0.
Dec 20 14:07:54 fedora avahi-daemon[930]: Leaving mDNS multicast group on interface wlp4s0.IPv6 with address 2001:9e8:406:dc00:4f46:3d94:3ef9:20be.
Dec 20 14:07:54 fedora avahi-daemon[930]: Joining mDNS multicast group on interface wlp4s0.IPv6 with address fe80::a03a:6da6:5135:c1c0.
Dec 20 14:07:54 fedora avahi-daemon[930]: Registering new address record for fe80::a03a:6da6:5135:c1c0 on wlp4s0.*.
Dec 20 14:07:54 fedora wpa_supplicant[1150]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=0 noise=9999 txrate=0
Dec 20 14:07:54 fedora avahi-daemon[930]: Withdrawing address record for fe80::a03a:6da6:5135:c1c0 on wlp4s0.
Dec 20 14:07:54 fedora avahi-daemon[930]: Leaving mDNS multicast group on interface wlp4s0.IPv6 with address fe80::a03a:6da6:5135:c1c0.
Dec 20 14:07:54 fedora avahi-daemon[930]: Interface wlp4s0.IPv6 no longer relevant for mDNS.
Dec 20 14:07:54 fedora audit[1044]: NETFILTER_CFG table=firewalld:5 family=1 entries=5 op=nft_unregister_rule pid=1044 subj=system_u:system_r:firewalld_t:s0 comm="firewalld"
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4886] dhcp4 (wlp4s0): canceled DHCP transaction
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4886] dhcp4 (wlp4s0): state changed bound -> terminated
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4887] dhcp6 (wlp4s0): canceled DHCP transaction
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4887] dhcp6 (wlp4s0): state changed bound -> terminated
Dec 20 14:07:54 fedora avahi-daemon[930]: Interface wlp4s0.IPv4 no longer relevant for mDNS.
Dec 20 14:07:54 fedora avahi-daemon[930]: Leaving mDNS multicast group on interface wlp4s0.IPv4 with address 192.168.178.56.
Dec 20 14:07:54 fedora avahi-daemon[930]: Withdrawing address record for 192.168.178.56 on wlp4s0.
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4894] device (wlp4s0): set-hw-addr: set MAC address to FA:51:5D:AE:44:65 (scanning)
Dec 20 14:07:54 fedora avahi-daemon[930]: Joining mDNS multicast group on interface wlp4s0.IPv4 with address 192.168.178.56.
Dec 20 14:07:54 fedora avahi-daemon[930]: New relevant interface wlp4s0.IPv4 for mDNS.
Dec 20 14:07:54 fedora avahi-daemon[930]: Registering new address record for 192.168.178.56 on wlp4s0.IPv4.
Dec 20 14:07:54 fedora avahi-daemon[930]: Withdrawing address record for 192.168.178.56 on wlp4s0.
Dec 20 14:07:54 fedora avahi-daemon[930]: Leaving mDNS multicast group on interface wlp4s0.IPv4 with address 192.168.178.56.
Dec 20 14:07:54 fedora avahi-daemon[930]: Interface wlp4s0.IPv4 no longer relevant for mDNS.
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.4930] device (wlp4s0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Dec 20 14:07:54 fedora gnome-shell[1905]: An active wireless connection, in infrastructure mode, involves no access point?
Dec 20 14:07:54 fedora chronyd[978]: Source 162.159.200.123 offline
Dec 20 14:07:54 fedora chronyd[978]: Source 185.120.22.12 offline
Dec 20 14:07:54 fedora chronyd[978]: Source 85.215.93.134 offline
Dec 20 14:07:54 fedora chronyd[978]: Can't synchronise: no selectable sources
Dec 20 14:07:54 fedora chronyd[978]: Source 85.220.190.246 offline
Dec 20 14:07:54 fedora NetworkManager[1058]: <info>  [1640005674.6033] device (wlp4s0): set-hw-addr: reset MAC address to 34:C9:3D:0F:E1:A5 (unmanage)
Dec 20 14:07:54 fedora wpa_supplicant[1150]: nl80211: deinit ifname=p2p-dev-wlp4s0 disabled_11b_rates=0
Dec 20 14:07:54 fedora wpa_supplicant[1150]: nl80211: deinit ifname=wlp4s0 disabled_11b_rates=0
Dec 20 14:07:55 fedora systemd[1]: Reached target Sleep.
Dec 20 14:07:55 fedora systemd[1]: Starting System Suspend...
Dec 20 14:07:55 fedora systemd-sleep[2907]: Entering sleep state 'suspend'...
Dec 20 14:07:55 fedora kernel: PM: suspend entry (deep)
-- Boot 28a2f3de1f254defac852cc20174ecf6 --

Compare this to the log for a working suspend:

Dec 03 22:00:11 fedora systemd[1]: flatpak-system-helper.service: Deactivated successfully.
Dec 03 22:00:11 fedora audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=flatpak-system-helper comm="systemd" exe="/usr/lib/systemd/systemd">
Dec 03 22:04:20 fedora gnome-shell[1895]: Window manager warning: last_user_time (1252394) is greater than comparison timestamp (1252393).  This most likely represents a buggy client sending inaccura>
Dec 03 22:04:20 fedora gnome-shell[1895]: Window manager warning: W0 appears to be one of the offending windows with a timestamp of 1252394.  Working around...
Dec 03 22:10:02 fedora systemd-logind[953]: Power key pressed.
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6230] manager: sleep: sleep requested (sleeping: no  enabled: yes)
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6232] device (p2p-dev-wlp4s0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Dec 03 22:10:02 fedora ModemManager[1042]: <info>  [sleep-monitor] system is about to suspend
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6234] manager: NetworkManager state is now ASLEEP
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6235] device (wlp4s0): state change: activated -> deactivating (reason 'sleeping', sys-iface-state: 'managed')
Dec 03 22:10:02 fedora systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 03 22:10:02 fedora systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 03 22:10:02 fedora audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-dispatcher comm="systemd" exe="/usr/lib/systemd/sys>
Dec 03 22:10:02 fedora kernel: wlp4s0: deauthenticating from 74:83:c2:22:0b:95 by local choice (Reason: 3=DEAUTH_LEAVING)
Dec 03 22:10:02 fedora systemd-resolved[903]: wlp4s0: Bus client reset search domain list.
Dec 03 22:10:02 fedora systemd-resolved[903]: wlp4s0: Bus client set DNS server list to: fd00::de15:c8ff:fe45:7e67
Dec 03 22:10:02 fedora systemd-resolved[903]: wlp4s0: Bus client set default route setting: no
Dec 03 22:10:02 fedora systemd-resolved[903]: wlp4s0: Bus client reset DNS server list.
Dec 03 22:10:02 fedora wpa_supplicant[1149]: wlp4s0: CTRL-EVENT-DISCONNECTED bssid=74:83:c2:22:0b:95 reason=3 locally_generated=1
Dec 03 22:10:02 fedora wpa_supplicant[1149]: dbus: wpa_dbus_property_changed: no property SessionLength in object /fi/w1/wpa_supplicant1/Interfaces/0
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6805] device (wlp4s0): supplicant interface state: completed -> disconnected
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6806] device (wlp4s0): state change: deactivating -> disconnected (reason 'sleeping', sys-iface-state: 'managed')
Dec 03 22:10:02 fedora avahi-daemon[929]: Withdrawing address record for 2001:9e8:41d:af00:5562:48d6:a2e4:959e on wlp4s0.
Dec 03 22:10:02 fedora wpa_supplicant[1149]: wlp4s0: CTRL-EVENT-SIGNAL-CHANGE above=0 signal=0 noise=9999 txrate=0
Dec 03 22:10:02 fedora avahi-daemon[929]: Leaving mDNS multicast group on interface wlp4s0.IPv6 with address 2001:9e8:41d:af00:5562:48d6:a2e4:959e.
Dec 03 22:10:02 fedora avahi-daemon[929]: Joining mDNS multicast group on interface wlp4s0.IPv6 with address fe80::a03a:6da6:5135:c1c0.
Dec 03 22:10:02 fedora avahi-daemon[929]: Registering new address record for fe80::a03a:6da6:5135:c1c0 on wlp4s0.*.
Dec 03 22:10:02 fedora avahi-daemon[929]: Withdrawing address record for fe80::a03a:6da6:5135:c1c0 on wlp4s0.
Dec 03 22:10:02 fedora avahi-daemon[929]: Leaving mDNS multicast group on interface wlp4s0.IPv6 with address fe80::a03a:6da6:5135:c1c0.
Dec 03 22:10:02 fedora avahi-daemon[929]: Interface wlp4s0.IPv6 no longer relevant for mDNS.
Dec 03 22:10:02 fedora audit[1043]: NETFILTER_CFG table=firewalld:5 family=1 entries=5 op=nft_unregister_rule pid=1043 subj=system_u:system_r:firewalld_t:s0 comm="firewalld"
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6881] dhcp4 (wlp4s0): canceled DHCP transaction
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6882] dhcp4 (wlp4s0): state changed bound -> terminated
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6883] dhcp6 (wlp4s0): canceled DHCP transaction
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6883] dhcp6 (wlp4s0): state changed bound -> terminated
Dec 03 22:10:02 fedora avahi-daemon[929]: Interface wlp4s0.IPv4 no longer relevant for mDNS.
Dec 03 22:10:02 fedora avahi-daemon[929]: Leaving mDNS multicast group on interface wlp4s0.IPv4 with address 192.168.178.56.
Dec 03 22:10:02 fedora avahi-daemon[929]: Withdrawing address record for 192.168.178.56 on wlp4s0.
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6893] device (wlp4s0): set-hw-addr: set MAC address to D2:55:6C:A3:22:FE (scanning)
Dec 03 22:10:02 fedora avahi-daemon[929]: Joining mDNS multicast group on interface wlp4s0.IPv4 with address 192.168.178.56.
Dec 03 22:10:02 fedora avahi-daemon[929]: New relevant interface wlp4s0.IPv4 for mDNS.
Dec 03 22:10:02 fedora avahi-daemon[929]: Registering new address record for 192.168.178.56 on wlp4s0.IPv4.
Dec 03 22:10:02 fedora avahi-daemon[929]: Withdrawing address record for 192.168.178.56 on wlp4s0.
Dec 03 22:10:02 fedora avahi-daemon[929]: Leaving mDNS multicast group on interface wlp4s0.IPv4 with address 192.168.178.56.
Dec 03 22:10:02 fedora avahi-daemon[929]: Interface wlp4s0.IPv4 no longer relevant for mDNS.
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.6936] device (wlp4s0): state change: disconnected -> unmanaged (reason 'sleeping', sys-iface-state: 'managed')
Dec 03 22:10:02 fedora chronyd[978]: Source 176.9.42.91 offline
Dec 03 22:10:02 fedora chronyd[978]: Source 81.7.16.52 offline
Dec 03 22:10:02 fedora chronyd[978]: Source 116.203.151.74 offline
Dec 03 22:10:02 fedora chronyd[978]: Can't synchronise: no selectable sources
Dec 03 22:10:02 fedora chronyd[978]: Source 85.214.83.151 offline
Dec 03 22:10:02 fedora gnome-shell[1895]: An active wireless connection, in infrastructure mode, involves no access point?
Dec 03 22:10:02 fedora NetworkManager[1057]: <info>  [1638565802.8039] device (wlp4s0): set-hw-addr: reset MAC address to 34:C9:3D:0F:E1:A5 (unmanage)
Dec 03 22:10:02 fedora wpa_supplicant[1149]: nl80211: deinit ifname=p2p-dev-wlp4s0 disabled_11b_rates=0
Dec 03 22:10:02 fedora wpa_supplicant[1149]: nl80211: deinit ifname=wlp4s0 disabled_11b_rates=0
Dec 03 22:10:03 fedora systemd[1]: Reached target Sleep.
Dec 03 22:10:03 fedora systemd[1]: Starting System Suspend...
Dec 03 22:10:03 fedora systemd-sleep[7129]: Entering sleep state 'suspend'...
Dec 03 22:10:03 fedora kernel: PM: suspend entry (deep)
Dec 03 22:10:03 fedora kernel: Filesystems sync: 0.030 seconds

In the working one we see "Filesystems sync: 0.030 seconds" but in the broken one we don't. Could this be related to the problem, ie. the filesystem sync never finishes?

Comment 4 ivan 2022-01-08 12:16:58 UTC
Same problem here on a Lenovo X1 carbon gen2 ; after going into suspend the power led slowly blinks (as usual) but the function keys are lit which seems to indicate that the issue happens during the suspend process. Installing the latest 5.16 kernel from rawhide solves the issue.

Everything was fine on f34 until I updated to f35 (although I can't say which kernel was OK back in f34 as the laptop hadn't been updated for the past 1-2 months).

- kernel-5.15.12-200.fc35.x86_64 (from updates repo) doesn't work
- kernel-5.15.13-200.fc35.x86_64 (from updates-testing repo) doesn't work
- kernel-5.16.0-0.rc8.55.fc36.x86_64 (from rawhide repo) works

Comment 5 solanum 2022-01-18 10:56:23 UTC
Created attachment 1851546 [details]
log file for  5.17.0-0.rc0.20220112gitdaadb3bd0e8d.63.fc36.x86_64

This log probably contains sensitive info. But it is a boot from the 5.17.0 kernel on a hp pavillion 15-cc123cl laptop I booted. logged in, double checked what kernel was loaded, closed the lid, and it wouldn't come out of suspend. No keys, touchscreen, or trackpad were working. I couldn't even get the backlit keyboard to light up. The fans weren't running. I had to power off. This was a clean install of f35. 

Is this a configuration issue?  I saw 
"kernel: Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7"
in the logs like 12 times.

Comment 6 solanum 2022-01-18 13:23:11 UTC
Okay, I screwed around with it, I had the intel SGX extensions disabled in bios and it was screwing up. I enabled them, and it still screwed up. Then I went to change it back and it gave a 3rd option for 'software control of sgx' and It is coming out of suspend with the 5.17.0 kernel. I didn't try the 5.15.x kernel or whatever is current. I am not sure why I need SGX enabled to get out of suspend..

Comment 7 solanum 2022-01-18 15:38:14 UTC
The kernel lockdown feature falls through to the old security settings if it doesn't have lockdown enabled.. so.

I disabled intel sgx in bios again.  I edited /boot/grub2/grubenv to add security=0, which also eliminated the timeout for the select the kernel menu. It does say something like 'grub efi secure boot enabled' in the startup screen scroll, which it shouldn't be because none is enabled.

With the current 5.15.x kernel if I closed the lid to enter suspend it came out of suspend, if I used the suspend menu in gnome, it locked me out. 

With security=0 it and the 5.17.x rawhide kernel, both closing the lid and gnome worked right and I am able to get out of suspend the first time. Closing the lid the second time and hitting suspend in the gnome menu is locking it up. You can tell if you are going to get out, because the power light flashes when you enable in suspend in the gnome menu. If it isn't flashing, it isn't going to work. 

This might be a security bug. Gnome is locking up the user interface by going into suspend, but it =seems= like gnome is somehow changing the security level of the system (escalating it) which is causing the lockup.  Which may mean something else with escalated permissions in userspace could do the same thing which lockdown isn't enabled, but it looks like it is overriding both bios and kernel settings.

Is it worth testing intel sgx enabled and security=0 with without software extensions enabled?

Comment 8 ivan 2022-02-06 17:39:07 UTC
FWIW neither 5.16.x from the updates repo and the latest 5.17.x from rawhide work.

To sum up:

- kernels from the 5.15 series didn't work
- kernel-5.16.0-0.rc8.55.fc36.x86_64 (from rawhide) worked
- kernel-5.16.5-200.fc35.x86_64 (from updates repo) doesn't work
- kernel-5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64 doesn't work.

Unfortunately I assumed 5.16.x from the updates repo would work so I've removed the 5.16.0-0.rc8 rawhide version, and I can't reinstall it. 

This is annoying - I hadn't paid attention that the 5.16 rawhide kernel was automatically updated with kernel-5.16.5-200.fc35.x86_64 and the laptop stayed in the broken suspend state for a day while in my bag, which in turn fully drained the battery - so much that it won't charge anymore.

(by the way the security=0 boot option doesn't have any effect on that issue).

Comment 9 Justin M. Forbes 2022-02-06 21:40:40 UTC
(In reply to solanum from comment #7)
> The kernel lockdown feature falls through to the old security settings if it
> doesn't have lockdown enabled.. so.
> 
> I disabled intel sgx in bios again.  I edited /boot/grub2/grubenv to add
> security=0, which also eliminated the timeout for the select the kernel
> menu. It does say something like 'grub efi secure boot enabled' in the
> startup screen scroll, which it shouldn't be because none is enabled.

This makes sense, if you didn't disable secure boot in bios, the system is operating in secure boot mode until the kernel is booted, meaning it is still in secure boot mode while in grub.
 
> With the current 5.15.x kernel if I closed the lid to enter suspend it came
> out of suspend, if I used the suspend menu in gnome, it locked me out. 
> 
> With security=0 it and the 5.17.x rawhide kernel, both closing the lid and
> gnome worked right and I am able to get out of suspend the first time.
> Closing the lid the second time and hitting suspend in the gnome menu is
> locking it up. You can tell if you are going to get out, because the power
> light flashes when you enable in suspend in the gnome menu. If it isn't
> flashing, it isn't going to work. 
> 
> This might be a security bug. Gnome is locking up the user interface by
> going into suspend, but it =seems= like gnome is somehow changing the
> security level of the system (escalating it) which is causing the lockup. 
> Which may mean something else with escalated permissions in userspace could
> do the same thing which lockdown isn't enabled, but it looks like it is
> overriding both bios and kernel settings.

You are operating in a weird state. Your bios has secure boot enabled, the kernel does not when you boot this way. I haven't looked at what exactly gnome is doing when you click the suspend button, but I would expect it is simply passing the command to the kernel.  Technically gnome doesn't need to do anything here, suspend is a kernel function and gnome is just giving you an easier way to initiate suspend than doing it through cli and the kernel.

> Is it worth testing intel sgx enabled and security=0 with without software
> extensions enabled?

I would not expect that SGX has much of anything to do with the suspend cycle, and it has nothing to do with secure boot.

Comment 10 Justin M. Forbes 2022-02-06 21:45:16 UTC
(In reply to ivan from comment #8)
> FWIW neither 5.16.x from the updates repo and the latest 5.17.x from rawhide
> work.
> 
> To sum up:
> 
> - kernels from the 5.15 series didn't work
> - kernel-5.16.0-0.rc8.55.fc36.x86_64 (from rawhide) worked
> - kernel-5.16.5-200.fc35.x86_64 (from updates repo) doesn't work
> - kernel-5.17.0-0.rc2.20220204gitdcb85f85fa6f.86.fc36.x86_64 doesn't work.
> 
> Unfortunately I assumed 5.16.x from the updates repo would work so I've
> removed the 5.16.0-0.rc8 rawhide version, and I can't reinstall it. 

Why can't you reinstall it? dnf won't install an older kernel, but you can either dnf remove the newer kernel and then install it, or you can use "rpm -ivh --oldpackage foo.rpm"  I am rather interested in how rc8 was working and then 5.16 stable kernels are not. It might be worth checking to see if 5.16.0 (rawhide) kernels work, which would help determine if things were broken again during rc8 or if it was a patch backported from 5.17 for a stable release that broke it.

> This is annoying - I hadn't paid attention that the 5.16 rawhide kernel was
> automatically updated with kernel-5.16.5-200.fc35.x86_64 and the laptop
> stayed in the broken suspend state for a day while in my bag, which in turn
> fully drained the battery - so much that it won't charge anymore.
> 

Placing a running laptop in a backpack can cause considerable heat build up, enough to kill one or more cells in the battery.

Comment 11 ivan 2022-02-07 04:59:24 UTC
(In reply to Justin M. Forbes from comment #10)

> > Unfortunately I assumed 5.16.x from the updates repo would work so I've
> > removed the 5.16.0-0.rc8 rawhide version, and I can't reinstall it. 
> 
> Why can't you reinstall it? dnf won't install an older kernel, but you can
> either dnf remove the newer kernel and then install it, or you can use "rpm
> -ivh --oldpackage foo.rpm"  I am rather interested in how rc8 was working
> and then 5.16 stable kernels are not. It might be worth checking to see if
> 5.16.0 (rawhide) kernels work, which would help determine if things were
> broken again during rc8 or if it was a patch backported from 5.17 for a
> stable release that broke it.

I simply can't find rc8 rpms anywhere: dnf only lists the newer 5.17rc2, which was confirmed by looking at a rawhide mirror. No luck with Koji either.

Is there an archive where I can download the rpms ? 


> Placing a running laptop in a backpack can cause considerable heat build up,
> enough to kill one or more cells in the battery.

I know :( The thing is, it shouldn't have been running (I never had a problem putting a suspended laptop in a bag while traveling for the last 20+ years - this is just bad luck).

Comment 12 Leonard Ehrenfried 2022-02-07 08:21:39 UTC
BTW, this issue has completely gone away for me after upgrading to the rawhide 5.17 kernel. I never changed my bios or grub settings.

Perhaps there are multiple issues at play here.

Comment 13 Hans de Goede 2022-02-07 10:19:21 UTC
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

has all past Fedora kernel builds, 5.16-rc8 is here:
https://koji.fedoraproject.org/koji/buildinfo?buildID=1875372

Comment 14 ivan 2022-02-07 11:33:07 UTC
(In reply to Hans de Goede from comment #13)
> https://koji.fedoraproject.org/koji/packageinfo?packageID=8
> 
> has all past Fedora kernel builds, 5.16-rc8 is here:
> https://koji.fedoraproject.org/koji/buildinfo?buildID=1875372

Oh - my bad. I'm not really familiar with Koji, the first link returned by a web search for "kernel-5.16.0-0.rc8.55.fc36.x86_64" lead to a page without rpms [1] so I assumed old rawhide rpms were automatically deleted to avoid consuming disk space. (a lack of sleep + Monday morning tasks didn't help either!).

Thanks !

[1] https://koji.fedoraproject.org/koji/taskinfo?taskID=80818310(In reply to Justin M. Forbes from comment #10)

> (In reply to ivan from comment #8)
> I am rather interested in how rc8 was working
> and then 5.16 stable kernels are not. It might be worth checking to see if
> 5.16.0 (rawhide) kernels work, which would help determine if things were
> broken again during rc8 or if it was a patch backported from 5.17 for a
> stable release that broke it.

Suspend works with kernel-5.16.0-60.fc36 (which seems to be the most recent rawhide build of 5.16.x). Let me know if you'd like me to test other versions...

Comment 15 Justin M. Forbes 2022-02-07 16:57:16 UTC
(In reply to ivan from comment #14)
> Suspend works with kernel-5.16.0-60.fc36 (which seems to be the most recent
> rawhide build of 5.16.x). Let me know if you'd like me to test other
> versions...

Thanks, that means it was fixed, and then came back, possibly with something different. Can you try 5.16.2-200.fc35 5.16.3-200.fc35 and 5.16.4-200.fc35.  That might help us narrow it down.

Comment 16 ivan 2022-02-07 18:09:56 UTC
(In reply to Justin M. Forbes from comment #15)

> Can you try 5.16.2-200.fc35 5.16.3-200.fc35 and 5.16.4-200.fc35. 
> That might help us narrow it down.

So - suspend is broken with 5.16.2-200.fc35 (I didn't try .3 and .4 since .2 didn't work); it's also broken with 5.16.1-200.fc35.

I don't see anything related to suspend in the changelog entries between 0.rc8 and .0 (maybe a corner case of the x86/PCI patch ?).

No problem to do more tests...

Comment 17 Hans de Goede 2022-02-07 19:10:48 UTC
(In reply to ivan from comment #16)
> (In reply to Justin M. Forbes from comment #15)
> 
> > Can you try 5.16.2-200.fc35 5.16.3-200.fc35 and 5.16.4-200.fc35. 
> > That might help us narrow it down.
> 
> So - suspend is broken with 5.16.2-200.fc35 (I didn't try .3 and .4 since .2
> didn't work); it's also broken with 5.16.1-200.fc35.
> 
> I don't see anything related to suspend in the changelog entries between
> 0.rc8 and .0 (maybe a corner case of the x86/PCI patch ?).

Hmm, I wrote that patch and I really hope it is not the cause of this. It should not be though, what it does is add:

"""
        int year = dmi_get_bios_year();

        if (year >= 2018)
                return;

        pr_info_once("PCI: Removing E820 reservations from host bridge windows\n");
"""

To remove_e820_regions() in arch/x86/kernel/resource.c, your attached log shows a BIOS date of 06/30/2017 so the added if should not trigger.

To verify, can you boot the non suspending 5.16.2 or 5.16.4 and then do:

dmesg | grep "PCI: Removing E820 reservations from host bridge windows"

That should show the message being grepped for, showing that the added 'if (year >= 2018)' check is not triggering, turning that patch into a no-op

Comment 18 ivan 2022-02-08 11:56:44 UTC
(In reply to Hans de Goede from comment #17)
> (In reply to ivan from comment #16)

> To remove_e820_regions() in arch/x86/kernel/resource.c, your attached log
> shows a BIOS date of 06/30/2017 so the added if should not trigger.

The attached log was from another poster; my bios is from 03/27/2020 (version GRET63WW / 1.40) so the added `if` should trigger.

> To verify, can you boot the non suspending 5.16.2 or 5.16.4 and then do:
> 
> dmesg | grep "PCI: Removing E820 reservations from host bridge windows"
> 
> That should show the message being grepped for, showing that the added 'if
> (year >= 2018)' check is not triggering, turning that patch into a no-op

There isn't any "PCI: Removing E820 reservations from host bridge windows" log in dmesg with any of the 5.16.0 (OK) and 5.16.1 (broken) kernels. But given that my bios is >= 2018 and looking at your patch [1] (if that's the right one) it would be normal not to see such message with both kernels.

I tried booting with pci=use_e820 (as per your patch's documentation) but I get a `PCI: Unknown option `use_e820'` error. A quick search [2] hints that this commandline option might have been removed (haven't had time to investigate this further nor to find the exact patch that went in).

[1] https://lkml.org/lkml/2021/10/11/262
[2] https://lkml.org/lkml/2021/12/15/763

Comment 19 Hans de Goede 2022-02-08 12:28:46 UTC
> The attached log was from another poster; my bios is from 03/27/2020 (version GRET63WW / 1.40) so the added `if` should trigger.

Ah, I see. And I also see that your problem also started with the 5.15 kernels, which is when the x86/PCI patch first landed in the Fedora kernels.

I believe that the 5.16-rc builds did not have that patch because it was expecting to land upstream (but didn't, at least not for 5.16) and then the patch got added to the Fedora kernels again around 5.16.0 time.

So this all nicely lines up with when you started seeing your issues.

> There isn't any "PCI: Removing E820 reservations from host bridge windows" log in dmesg with any of the 5.16.0 (OK) and 5.16.1 (broken) kernels. But given that my bios is >= 2018 and looking at your patch [1] (if that's the right one) it would be normal not to see such message with both kernels.

Right, that is correct.

> I tried booting with pci=use_e820 (as per your patch's documentation) but I get a `PCI: Unknown option `use_e820'` error. A quick search [2] hints that this commandline option might have been removed (haven't had time to investigate this further nor to find the exact patch that went in).

Yes, the option was removed at the request of the upstream PCI devs, they were worried that people would see it on some forum and use it for the wrong reasons instead of reporting bugs upstream and root-causing the issue.

But the 5.15 Fedora kernels do have the option. Can you try a Fedora 5.15.y kernel with that option set?

The 5.15 version of the patch:
https://lkml.org/lkml/2021/10/11/262

Has this logging in it:

	printk(KERN_INFO "PCI: %s E820 reservations for host bridge windows\n",
	       pci_use_e820 ? "Honoring" : "Ignoring");

So on your laptop by default a 5.15.y kernel should say (on your laptop):

PCI: Ignoring E820 reservations for host bridge windows

With the kernel cmdline option added this should turn into:

PCI: Honoring E820 reservations for host bridge windows

and I guess this may fix your problem, but I really hope not because honoring the reservations is causing a bunch of issues on other laptops. So I really hope that your issue is caused by something else, but we will see...

Comment 20 ivan 2022-02-08 13:44:49 UTC
(In reply to Hans de Goede from comment #19)

> So on your laptop by default a 5.15.y kernel should say (on your laptop):
> 
> PCI: Ignoring E820 reservations for host bridge windows
> 
> With the kernel cmdline option added this should turn into:
> 
> PCI: Honoring E820 reservations for host bridge windows

Yes, it did (tested with 5.15.18.fc35 from Koji).

> and I guess this may fix your problem, but I really hope not because
> honoring the reservations is causing a bunch of issues on other laptops. So
> I really hope that your issue is caused by something else, but we will see...

Suspend works with pci=use_e820... I'm sorry for you as it seems that without the commandline option it isn't a straightforward issue to fix ; let me know if I can help with more tests down the road (just bear in mind that I haven't compiled/used a custom kernel for the past 10+ years!).

Comment 21 Hans de Goede 2022-02-08 13:57:54 UTC
> Suspend works with pci=use_e820... I'm sorry for you as it seems that without the commandline option it isn't a straightforward issue to fix ; let me know if I can help with more tests down the road (just bear in mind that I haven't compiled/used a custom kernel for the past 10+ years!).

Ugh. Note the missing cmdline option is not really the big issue. That would just provide a workaround and we really want Linux to "just work" without needing any kernel cmdline options.

But it seems that my fix fixes some systems and breaks some others, so it is no good. It was known that some really old (new in 2010) systems need the old behavior of honoring the e820 reservations. But your machine is the first case of a somewhat newer model also needing what is in essence a workaround for things to work properly.

Anyways, thank you for reporting this and thank you for your continued work on testing this.

Can you run:

dmesg > dmesg-honor-e820.txt

resp:

dmesg > dmesg-ignore-e820.txt

Immediately after booting with the 5.15 kernel you use to test (with resp without pci=use_e820) on the kernel commandline please and attach both generated .txt files here?

And no worries about building kernels, if necessary I can have koji build a special kernel for you to test and you can grab it from koji.

Comment 22 ivan 2022-02-08 14:16:52 UTC
(In reply to Hans de Goede from comment #21)
> > Suspend works with pci=use_e820... I'm sorry for you as it seems that without the commandline option it isn't a straightforward issue to fix ; let me know if I can help with more tests down the road (just bear in mind that I haven't compiled/used a custom kernel for the past 10+ years!).
> 
> Ugh. Note the missing cmdline option is not really the big issue. That would
> just provide a workaround and we really want Linux to "just work" without
> needing any kernel cmdline options.

Sure.
 
> But it seems that my fix fixes some systems and breaks some others, so it is
> no good. It was known that some really old (new in 2010) systems need the
> old behavior of honoring the e820 reservations. But your machine is the
> first case of a somewhat newer model also needing what is in essence a
> workaround for things to work properly.

offtopic - I'm not sure I'll buy Thinkpads anymore (nor Lenovo); they used to just work with Linux - usually because kernel devs used that hardware - but newer models (other than this old X1) have been increasingly buggy.

> Anyways, thank you for reporting this and thank you for your continued work
> on testing this.

No problem, I'm happy to contribute what I can !
 
> Can you run:
> 
> dmesg > dmesg-honor-e820.txt
> 
> resp:
> 
> dmesg > dmesg-ignore-e820.txt

(attaching logs)

Comment 23 ivan 2022-02-08 14:17:41 UTC
Created attachment 1859801 [details]
dmesg - without option

Comment 24 ivan 2022-02-08 14:18:20 UTC
Created attachment 1859802 [details]
dmesg - with pci=use_e820 option

Comment 25 Hans de Goede 2022-02-08 16:21:40 UTC
Thanks, I've reported the issue upstream now also, since the patch causing this has made its way into the upcoming 5.17 release: https://lore.kernel.org/linux-pci/a7ad05fe-c2ab-a6d9-b66e-68e8c5688420@redhat.com/

And I'm discussing possible other avenues of fixing the issue the patch tries to adjust upstream now.

Justin, I'm not sure what is the best thing to do here from Fedora's pov.

I hope to have a different fix which is very narrowly tailored to fix the original issue the patch fixes without impacting other systems in a couple of days. So it might be worthwhile to wait for that and swap the patch to avoid regressions on the original issue the patch fixes.

OTOH if you decide to just drop the troublesome patch and want to wait for upstream to sort this out that is completely understandable too. So it is your call.

Comment 26 Hans de Goede 2022-02-08 16:26:19 UTC
Offtopic:

> I'm not sure I'll buy Thinkpads anymore (nor Lenovo); they used to just work with Linux ...

Actually a couple of years ago I would have agreed with you, but since Lenovo is now installing some models with Fedora pre-installed they have been actively helping with and working on making sure their hw works well with mainline Linux kernels. So atm Lenovo, at least one of the models which are available with Fedora pre-installed, is not a bad choice if you care about Linux compatibility.

See E.g. : https://bugzilla.kernel.org/show_bug.cgi?id=211313. Where Lenovo has done a BIOS update for old 4th gen X1 carbon-s to fix a fan blowing at max-speed issue, even though they never shipped that model with Fedora pre-installed and it is quite old, which is really good service IHMO.

Comment 27 Hans de Goede 2022-02-09 17:13:35 UTC
Ok, so while working on fixing the issue the troublesome patch caused in another way I realized that that issue is only present when booting in UEFI mode and that you are still using classic BIOS booting to boot your x1 2nd gen.

So I've prepared a new patch which skips the e820 reservations when booting in EFI mode. I have started a test kernel-build with this new patch (replacing the previous one) here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=82606916

Here are some generic instructions for installing a Fedora kernel directly from koji (the Fedora buildsystem):
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Can you please give this one a test and let us know if suspend/resume works with this one (I suspect it will).


I also have a favor to ask, I expect the test kernel to work since your system is booting in BIOS mode, so it will not change the behavior. But I do wonder if suspend/resume still works on the X1 2nd gen with the test-kernel when booted in EFI mode, since then the mem-reservation tables will get ignored as with the stock Fedora 5.15 kernels.

So I was wondering if you could do a test F35 install in UEFI mode on e.g. a spare disk or an external USB disk; and then :

1. Test suspend/resume works with a 5.15 Fedora kernel with pci=use_e820, to make sure that there are not e.g. some other unrelated issues with the test F35 install, e.g. loosing the connection to the external USB disk.

2. Test suspend/resume works with the test kernel

Comment 28 ivan 2022-02-11 12:13:25 UTC
Mu apologies for the late reply...

(In reply to Hans de Goede from comment #27)

> So I've prepared a new patch which skips the e820 reservations when booting
> in EFI mode. I have started a test kernel-build with this new patch
> (replacing the previous one) here:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=82606916

> Can you please give this one a test and let us know if suspend/resume works
> with this one (I suspect it will).

Yes, suspend works.

As for testing with UEFI it'll take a bit of time - hopefully I'll get back to you with more info later today or over the week-end.

Thanks !

Comment 29 Hans de Goede 2022-02-11 14:01:08 UTC
(In reply to ivan from comment #28)
> Yes, suspend works.

Great, thank you.

> As for testing with UEFI it'll take a bit of time - hopefully I'll get back
> to you with more info later today or over the week-end.

And an especially big thank you for willing to go the extra mile to also test with UEFI, that is great!

Note there is no need to rush it to get the UEFI test done today, if you can get the UEFI testing done sometime this weekend that already is amazingly fast.

Comment 30 ivan 2022-02-11 16:39:33 UTC
(In reply to Hans de Goede from comment #27)

> So I was wondering if you could do a test F35 install in UEFI mode on e.g. a
> spare disk or an external USB disk; and then :
> 
> 1. Test suspend/resume works with a 5.15 Fedora kernel with pci=use_e820, to
> make sure that there are not e.g. some other unrelated issues with the test
> F35 install, e.g. loosing the connection to the external USB disk.

That one works...
 
> 2. Test suspend/resume works with the test kernel

... but this one doesn't (attaching the output of dmesg). The Fn keys stay lit as before and it's impossible to resume from suspend.

(In reply to Hans de Goede from comment #29)

> And an especially big thank you for willing to go the extra mile to also
> test with UEFI, that is great!

Happy to help !


(In reply to Hans de Goede from comment #26)
> Offtopic:
> Actually a couple of years ago I would have agreed with you, but since
> Lenovo is now installing some models with Fedora pre-installed they have
> been actively helping with and working on making sure their hw works well
> with mainline Linux kernels. So atm Lenovo, at least one of the models which
> are available with Fedora pre-installed, is not a bad choice if you care
> about Linux compatibility.

Hopefully that'll still be the case when I buy new hardware :) (by the way I looked at the current offers but there were only desktop models with linux).
Maybe I was a bit harsh though - the T450s that I use daily works well (modulo having to disable USB3 to be able to use pci passthrough in Qubes OS, but that's more an issue with Intel chipset); the kids' x250 work perfectly as well as an old T410 used as jukebox. But for instance an X1 tablet (1st gen - so an older model) that I bought unused a few months ago for drawing and taking notes is really buggy - it randomly takes 10 seconds to reach the UEFI prompt, sometimes discharges while powered off, battery charge thresholds are randonly ignored, the trackpoint works only after a suspend, the camera works only under Windows, regular S3 suspend doesn't work, etc, etc.

Have a nice week-end ahead !

Comment 31 ivan 2022-02-11 16:42:47 UTC
Created attachment 1860638 [details]
dmesg UEFI / test kernel

Comment 32 Hans de Goede 2022-02-14 12:24:17 UTC
Thank(In reply to ivan from comment #30)
> > 2. Test suspend/resume works with the test kernel
> 
> ... but this one doesn't (attaching the output of dmesg). The Fn keys stay
> lit as before and it's impossible to resume from suspend.

Ok, so fixing this is not just as simple as ignoring E820 reservations for PCI bridge windows when booting in EFI mode while honoring them for classic BIOS boots.

I already was afraid that this would be the case, which is why I asked you to test, but it was worth a shot.

Can you keep the UEFI install around for further testing, please ?

Also can you boot the UEFI install with "efi=debug" added to the kernel commandline and then collect dmesg output and collect that here please?

Comment 33 Hans de Goede 2022-02-14 12:36:17 UTC
p.s. Thank you for the testing!

A note about the root-cause of this (mostly for myself). The E820 reservations table has the following in both BIOS and EFI boot modes:

[    0.000000] BIOS-e820: [mem 0x00000000dceff000-0x00000000dfa0ffff] reserved

Which has a small overlap with:

[    0.884684] pci_bus 0000:00: root bus resource [mem 0xdfa00000-0xfebfffff window]

This leads to the following difference in assignments of PCI resources with pci=use_e820:

[    0.966573] pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfb00000-0xdfcfffff]
[    0.966698] pci_bus 0000:02: resource 1 [mem 0xdfb00000-0xdfcfffff]

vs without pci=use_e820:

[    0.966850] pci 0000:00:1c.0: BAR 14: assigned [mem 0xdfa00000-0xdfbfffff]
[    0.966973] pci_bus 0000:02: resource 1 [mem 0xdfa00000-0xdfbfffff]

And the overlap of 0xdfa00000-0xdfa0ffff from the e820 reservations seems to be what is causing the suspend/resume issue.

Comment 34 Hans de Goede 2022-02-14 15:23:14 UTC
OK, I've prepared a series of 2 patches which tries to fix the original problem in a new way, which will hopefully not break suspend/resume for you:
https://lore.kernel.org/linux-pci/20220214151759.98267-1-hdegoede@redhat.com/T/

A Fedora test-kernel with these patches is building here:
https://koji.fedoraproject.org/koji/taskinfo?taskID=82813003
this should be done in a couple of hours.

Here are some generic instructions for installing a Fedora kernel directly from koji (the Fedora buildsystem):
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Can you please give this one a test using the UEFI install you did and check if suspend/resume still works with this new test-kernel?

Please add efi=debug to the kernel commandline when testing this, collect dmesg output with this kernel and attach it here.

Comment 35 ivan 2022-02-14 17:58:30 UTC
Created attachment 1861035 [details]
dmesg / 5.16.9-200.e820.fc35 test kernel / efi=debug

(In reply to Hans de Goede from comment #34)

> A Fedora test-kernel with these patches is building here:
> https://koji.fedoraproject.org/koji/taskinfo?taskID=82813003
> this should be done in a couple of hours.

> Can you please give this one a test using the UEFI install you did and check
> if suspend/resume still works with this new test-kernel?

Suspend works !

> Please add efi=debug to the kernel commandline when testing this, collect
> dmesg output with this kernel and attach it here.

dmesg attached (I've stripped audit: ... lines for clarity).

Cheers
Ivan

Comment 36 Justin M. Forbes 2022-02-15 19:42:54 UTC
(In reply to Hans de Goede from comment #34)
> OK, I've prepared a series of 2 patches which tries to fix the original
> problem in a new way, which will hopefully not break suspend/resume for you:
> https://lore.kernel.org/linux-pci/20220214151759.98267-1-hdegoede@redhat.com/

I am not sure if you are doing a v2 based on the comments, but I expect 5.16.10 will release tomorrow if you want to create an MR or  just tell me which patches to pull.

Comment 37 Hans de Goede 2022-02-15 19:51:35 UTC
(In reply to Justin M. Forbes from comment #36)
> I am not sure if you are doing a v2 based on the comments, but I expect
> 5.16.10 will release tomorrow if you want to create an MR or  just tell me
> which patches to pull.

I'm afraid we are not at the end of the story here yet, see below. I believe it is probably best to just drop the downstream patch Fedora is carrying to fix the touchpad issue from bug 1868899 for now; and then once this is all sorted out I'll submit a pull-req to get the new patches re-added. 


(In reply to ivan from comment #35)
> > Can you please give this one a test using the UEFI install you did and check
> > if suspend/resume still works with this new test-kernel?
> 
> Suspend works !

Thank you for testing.

Unfortunately the efi=debug output shows that the address-range which is causing the issue on your laptop: 0xdfa00000-0xdfa0ffff is marked as MMIO; and that is actually what my latest RFC series is using as an indication to allow assigning PCI bars to that range despite it being reserved. By some weird sheer luck no BARs end up getting assigned with the series. But sheer luck is not something I want to count on.

So I've prepared a test-kernel with yet another approach to working around the original issue from bug 1868899.:

https://koji.fedoraproject.org/koji/taskinfo?taskID=82854981

Note this time the kernel is already fully build, so you can grab it right away. Please give this one a spin in UEFI mode and collect debug output. FWIW I fully expect this one to work fine, but you never know.

Comment 38 ivan 2022-02-16 17:55:53 UTC
Created attachment 1861539 [details]
dmesg / 5.16.9-200.e820_2.fc35 test kernel / efi=debug

(In reply to Hans de Goede from comment #37)

> Unfortunately the efi=debug output shows that the address-range which is
> causing the issue on your laptop: 0xdfa00000-0xdfa0ffff is marked as MMIO;
> and that is actually what my latest RFC series is using as an indication to
> allow assigning PCI bars to that range despite it being reserved. By some
> weird sheer luck no BARs end up getting assigned with the series. But sheer
> luck is not something I want to count on.

To be honest, despite being an "advanced user" the above sounds like double Dutch to me :)

> So I've prepared a test-kernel with yet another approach to working around
> the original issue from bug 1868899.:
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=82854981
> 
> Note this time the kernel is already fully build, so you can grab it right
> away. Please give this one a spin in UEFI mode and collect debug output.
> FWIW I fully expect this one to work fine, but you never know.

Suspend works ; dmesg attached.

Thank you for your work !

Cheers
Ivan

Comment 39 Hans de Goede 2022-02-16 20:20:55 UTC
(In reply to ivan from comment #38)
> Suspend works ; dmesg attached.
> 
> Thank you for your work !

Great, thank you for all the testing!

I've posted what will hopefully really be the final final version of the original fix which caused your regression upstream now:
https://lore.kernel.org/linux-pci/20220216150121.9400-2-hdegoede@redhat.com/T/

Comment 40 Hans de Goede 2022-03-04 14:13:37 UTC
Hi Ivan,

Bjorn, the upstream PCI subsystem maintainer has come up with a slightly different version of my final final fix. I've done a Fedora test kernel build of 5.16.12 with the fix from Bjorn added:

https://koji.fedoraproject.org/koji/taskinfo?taskID=83634323

Building is already done, so you can grab it right away. As always if you need install instructions they are here:

https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

Please give this a new kernel a try and please collect dmesg output after booting it and attach the dmesg output here.

Thanks & Regards,

Hans

Comment 41 ivan 2022-03-07 09:29:30 UTC
(In reply to Hans de Goede from comment #40)
> Hi Ivan,
> [...]
> Please give this a new kernel a try and please collect dmesg output after
> booting it and attach the dmesg output here.

Hi Hans,

My partner is still travelling with the (actually, her!) laptop so I couldn't test the kernel yet. I'll do the tests remotely and I'll send the results once I manage to have ssh access to her laptop (hopefully in 1 or 2 days).

Apologies for the delay...

Cheers
Ivan

Comment 42 Hans de Goede 2022-03-07 09:59:29 UTC
(In reply to ivan from comment #41)
> (In reply to Hans de Goede from comment #40)
> > Hi Ivan,
> > [...]
> > Please give this a new kernel a try and please collect dmesg output after
> > booting it and attach the dmesg output here.
> 
> Hi Hans,
> 
> My partner is still travelling with the (actually, her!) laptop so I
> couldn't test the kernel yet. I'll do the tests remotely and I'll send the
> results once I manage to have ssh access to her laptop (hopefully in 1 or 2
> days).
> 
> Apologies for the delay...

No problem, thank you for the status update.

Comment 43 ivan 2022-03-11 14:30:50 UTC
Created attachment 1865472 [details]
dmesg-5.16.12-200.e820.fc35.x86_64 / legacy bios

(In reply to Hans de Goede from comment #40)

> Bjorn, the upstream PCI subsystem maintainer has come up with a slightly
> different version of my final final fix. I've done a Fedora test kernel
> build of 5.16.12 with the fix from Bjorn added:
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=83634323

> Please give this a new kernel a try and please collect dmesg output after
> booting it and attach the dmesg output here.

Suspend works fine with that version too ! (dmesg attached / booting with legacy bios).

Comment 44 Hans de Goede 2022-03-11 15:11:37 UTC
(In reply to ivan from comment #43)
> Created attachment 1865472 [details]
> dmesg-5.16.12-200.e820.fc35.x86_64 / legacy bios
> 
> (In reply to Hans de Goede from comment #40)
> 
> > Bjorn, the upstream PCI subsystem maintainer has come up with a slightly
> > different version of my final final fix. I've done a Fedora test kernel
> > build of 5.16.12 with the fix from Bjorn added:
> > 
> > https://koji.fedoraproject.org/koji/taskinfo?taskID=83634323
> 
> > Please give this a new kernel a try and please collect dmesg output after
> > booting it and attach the dmesg output here.
> 
> Suspend works fine with that version too ! (dmesg attached / booting with
> legacy bios).

Thank you for testing and the dmesg attachment also shows some new log messages about the special handling your laptop needs:

[    0.326504] acpi PNP0A08:00: clipped [mem 0xdfa00000-0xfebfffff window] to [mem 0xdfa10000-0xfebfffff window] for e820 entry [mem 0xdceff000-0xdfa0ffff]
[    0.326515] acpi PNP0A08:00: clipped [mem 0xdfa10000-0xfebfffff window] to [mem 0xdfa10000-0xf7ffffff window] for e820 entry [mem 0xf8000000-0xfbffffff]

This is expected with the new patches, so everything is working as it should.

Comment 45 Hans de Goede 2022-03-11 15:14:47 UTC
In the mean time the original fix causing the suspend/resume issue has been dropped from the Fedora kernels, so this bug can be closed now. Thank you for all your testing!

Comment 46 dirk 2022-03-22 13:31:44 UTC
Funny that this bug occurred on F35 just today for me, never had this before. Beside other things (gnome-shell, mutter...) I updated to 5.16.16-200.fc35.x86_64 yesterday. (Before I had kernel-5.16.15-x; always have the testing updates every day). My laptop is X1 Carbon gen8. After the problem today I have enabled Intel SGX in BIOS (was disabled before), but this does not help. Any idea? Maybe waiting for the next kernel version?

Comment 47 dirk 2022-03-23 14:29:16 UTC
I downgraded many packages (kernel, gnome-*, and more), but this did not help. I again disabled Intel SGX in BIOS (as before) and mask-ed two services (systemd-rfkill, power-profiles-daemon), and surprisingly now it works again, even after upgrading everything again. So it is completely unclear to me, what yesterday created this problem and which action finally has fixed it.

Comment 48 Hans de Goede 2022-12-03 12:52:23 UTC
Hi Ivan (and others who hit this problem),

Do you still have your thinkpad affected by this bug; and if yes would you be willing to run some (more tests)?  There is a new patch-set trying to address the same issues as the patches causing the suspend/resume problem originally :

https://lore.kernel.org/linux-pci/20221202211838.1061278-1-helgaas@kernel.org/

And I'm afraid that this might again cause suspend/resume problems on your laptop(s).

If you are willing to test please let me know and I will prepare a Fedora kernel build with the new patches added then you can test to see if these patches re-introduces the problem, as I fear they might do.

Note you will need to test this new kernel in UEFI boot mode.

Regards,

Hans

Comment 49 ivan 2022-12-07 11:40:37 UTC
Hi Hans,

Just seeing this, sorry.

If it's not too late, OK to test...

Comment 50 Hans de Goede 2022-12-07 15:29:50 UTC
Hi Ivan,

(In reply to ivan from comment #49)
> Hi Hans,
> 
> Just seeing this, sorry.

No worries, actually a response time of 4 days is pretty good in my book :)

> If it's not too late, OK to test...

Thank you. We are still discussing things on the list, I'll get back to you with a kernel to test once we have a better idea which testing would be useful.

For now can you please make an acpidump of your laptop (while booted in EFI mode, e.g.:

dmesg | grep efi

Should show lines similar to these:

[    0.000000] efi: EFI v2.70 by American Megatrends
[    0.261238] pci 0000:30:00.0: BAR 0: assigned to efifb
[    0.263122] Registered efivars operations

Once you have verified the machine was booted in EFI mode please do:

sudo dnf install acpica-tools
sudo dnf acpidump -o acpidump.txt

And then attach the generated acpidump.txt file here.

Thank you.

Regards,

Hans

Comment 51 Hans de Goede 2022-12-08 19:59:52 UTC
Ivan,

Bjorn has posted a v2 of his patch series, this new version should keep things working on your laptop.

To make sure this does not cause regressions it would be great if you can give this v2 series a test.

I have started a test Fedora kernel build with the v2 patches added:
https://koji.fedoraproject.org/koji/taskinfo?taskID=95100621

Note this is still building atm. It should be finished in a couple of hours.

Here are some generic instructions for installing a kernel directly from koji (Fedora's buildsystem):
https://fedorapeople.org/~jwrdegoede/kernel-test-instructions.txt

After installing please boot with "efi=debug" added to the kernel commandline and collect dmesg output directly after boot. 

And test if suspend/resume still works of course.

Regards,

Hans

Comment 52 Hans de Goede 2022-12-09 08:13:01 UTC
Note the kernel build is done now. If you don't have time to test right away please at least download the rpms from:
https://koji.fedoraproject.org/koji/taskinfo?taskID=95100621

koji will remove the rpms for test builds after about a week to replace the diskspace.

Comment 53 ivan 2022-12-09 09:26:30 UTC
Created attachment 1931275 [details]
dmesg.txt 6.0.11-300.efimmio.fc37.x86_64

Comment 54 ivan 2022-12-09 09:26:53 UTC
Created attachment 1931276 [details]
acpidump

Comment 55 ivan 2022-12-09 09:35:06 UTC
Hans,

With the new kernel the laptop can't be suspended ("Some devices failed to suspend" - you'll see that in dmesg).

I've attached dmesg/acpidump.

Note: I used the - now old - fedora35 installation we did tests on before (I'm using an external HD to test EFI as the laptop's fedora install is with legacy boot); I imagine it shouldn't be an issue with suspend, except if there's a firmware related bug/feature with newer kernels preventing the device(s) from being suspended. If needed I'll update to the latest fedora.

Comment 56 Hans de Goede 2022-12-09 09:57:42 UTC
(In reply to ivan from comment #55)
> Hans,
> 
> With the new kernel the laptop can't be suspended ("Some devices failed to
> suspend" - you'll see that in dmesg).

Thanks you for testing. So looking at the dmesg output the extra patches in this kernel seem to work as intended and they do not appear to make any difference on your laptop wrt E820 memory reservations vs PCI resource allocations, which is what these patches are about.

So your new laptop suspend/resume issue does not appear to be caused by these patches.

Looking at the actual suspend error:

[  150.458421] cm32181 i2c-CPLM3218:00: PM: dpm_run_callback(): acpi_subsys_suspend+0x0/0x60 returns -121
[  150.458435] cm32181 i2c-CPLM3218:00: PM: failed to suspend async: error -121

There seems to be an I2C communication problem communicating with the ambient light sensor, which I'm 99.9% sure is completely unrelated to Bjorn's PCI patches.

This problem likely is introduced by:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=68c1b3dd5c48b2323067f8c1f0649ae2f31ab20b

Which is new in kernel 6.0. So it seems that catching this regression is a happy coincidence of the testing of Bjorn's patches.

Can you give a regular 6.0 build:
https://koji.fedoraproject.org/koji/buildinfo?buildID=2096126

a try and confirm that this issue also happens in a regular 6.0 build?

I will also prepare a patch fixing this and do another test kernel build for you.

Comment 57 ivan 2022-12-09 10:43:54 UTC
(In reply to Hans de Goede from comment #56)
> Can you give a regular 6.0 build:
> https://koji.fedoraproject.org/koji/buildinfo?buildID=2096126
> 
> a try and confirm that this issue also happens in a regular 6.0 build?

Indeed - same issue with the regular 6.0 build.

> I will also prepare a patch fixing this and do another test kernel build for
> you.

OK!

Comment 58 Hans de Goede 2022-12-09 11:01:32 UTC
Ok, here is a new test kernel build:

https://koji.fedoraproject.org/koji/taskinfo?taskID=95126795

This is the standard Fedora 6.0.11 + Bjorn's 4 PCI resource patches (which should be harmless) + a fix for the cm32181 suspend/resume issue.

Note as before I just started the build, so please give this some time to finish building.

Please boot in EFI mode, with "efi=debug" and collect dmesg output after what hopefully will be a successful suspend/resume test.

Comment 59 Hans de Goede 2022-12-09 11:07:15 UTC
Created attachment 1931322 [details]
[PATCH] iio: light: cm32181: Stop suspend failures causing system suspend to fail

Here is the (proposed) patch to fix the cm32181 error causing the system suspend to fail.

Comment 60 ivan 2022-12-09 14:35:30 UTC
(In reply to Hans de Goede from comment #58)
> Ok, here is a new test kernel build:
> 
> https://koji.fedoraproject.org/koji/taskinfo?taskID=95126795
> 
> This is the standard Fedora 6.0.11 + Bjorn's 4 PCI resource patches (which
> should be harmless) + a fix for the cm32181 suspend/resume issue.
> 
> Note as before I just started the build, so please give this some time to
> finish building.
> 
> Please boot in EFI mode, with "efi=debug" and collect dmesg output after
> what hopefully will be a successful suspend/resume test.

With this kernel the laptop now suspends properly but freezes on resume: the display is lit on, showing whatever there was on the screen when the machine was suspended, but the keyboard/mouse aren't responsive, network is down, etc.; so I couldn't get a dmesg trace after a suspend/resume cycle. I've tried a second time with Wayland disabled / in console mode but it's the same thing.

(attaching dmesg.txt in the next comment)

Comment 61 ivan 2022-12-09 14:36:12 UTC
Created attachment 1931358 [details]
dmesg 6.0.11-300.cm32181.fc37.x86_64

Comment 62 Bjorn Helgaas 2022-12-09 14:49:48 UTC
Ivan, thank you very much for testing this.  I know there's still a resume issue (possibly should be a new bugzilla?), but testing the PCI resource patches was very helpful.

Let me know if it's OK to credit you by name and email address in the commit log (this bugzilla is already public, but your email address is not).

Comment 63 ivan 2022-12-09 15:33:43 UTC
(In reply to Bjorn Helgaas from comment #62)
> Ivan, thank you very much for testing this.  I know there's still a resume
> issue (possibly should be a new bugzilla?), but testing the PCI resource
> patches was very helpful.

Glad I could help. I've tested the same kernel as above in legacy mode and the laptop also froze after resume. I'll search current bugzilla issues, maybe it's already reported.
BTW the laptop also has the i2c suspend issue with the last three f36 kernel updates - I'm wondering how my partner didn't notice her laptop wasn't suspending recently.

> Let me know if it's OK to credit you by name and email address in the commit
> log (this bugzilla is already public, but your email address is not).

If it's not a problem I'd prefer to keep this email address private...

Thanks !

Comment 64 Bjorn Helgaas 2022-12-09 15:40:40 UTC
> If it's not a problem I'd prefer to keep this email address private...

Not a problem at all!  I'll just mention the bugzilla itself, i.e.,

  Link: https://bugzilla.redhat.com/show_bug.cgi?id=2029207    X1 Carbon

Thanks again for your testing; this is a popular machine so it helps many other folks.

Comment 65 Hans de Goede 2022-12-09 18:24:41 UTC
(In reply to ivan from comment #60)
> With this kernel the laptop now suspends properly but freezes on resume: the
> display is lit on, showing whatever there was on the screen when the machine
> was suspended, but the keyboard/mouse aren't responsive, network is down,
> etc.; so I couldn't get a dmesg trace after a suspend/resume cycle. I've
> tried a second time with Wayland disabled / in console mode but it's the
> same thing.

Hmm, ok, what happens if you create a:

/etc/modprobe.d/cm32181-blaclist.conf

With:

blacklist cm32181

in there, then reboot into a normal Fedora 6.0 kernel and try suspend/resume ?

If that works, can you also test suspend/resume again with the test kernel with cm32181 in its version ?

Comment 66 ivan 2022-12-10 06:37:59 UTC
(In reply to Hans de Goede from comment #65)

> Hmm, ok, what happens if you create a:
> 
> /etc/modprobe.d/cm32181-blaclist.conf
> 
> With:
> 
> blacklist cm32181
> 
> in there, then reboot into a normal Fedora 6.0 kernel and try suspend/resume
> ?

It works !

When doing tests yesterday I tried rmmod'ing i2c but had the same issue (I thought that cm32181 being a i2c device, removing i2c would fix that ; lsmod didn't show any relevant i2c dependencies either); so I assumed the driver was "in-kernel" and it didn't occur to me I could simply blacklist it. Interestingly, loading the driver manually triggers a kernel oops - I'll file another bug if someone hasn't reported that issue yet.
Anyway - thanks for the fix !

> If that works, can you also test suspend/resume again with the test kernel
> with cm32181 in its version ?

It works well too.

Thanks !

Comment 67 Hans de Goede 2022-12-12 12:00:10 UTC
(In reply to ivan from comment #66)
> (In reply to Hans de Goede from comment #65)
> 
> > Hmm, ok, what happens if you create a:
> > 
> > /etc/modprobe.d/cm32181-blaclist.conf
> > 
> > With:
> > 
> > blacklist cm32181
> > 
> > in there, then reboot into a normal Fedora 6.0 kernel and try suspend/resume
> > ?
> 
> It works !

That is good to know, thank you for testing, lets continue discussing the cm32181 issue further in the separate bug (bug 2152281) which you filed for this.