Bug 1671504

Summary: disabling secondary CPU hangs with kernel 4.19+ on Lenovo ThinkPad X1 Carbon 5th (was: Lenovo ThinkPad X1 Carbon 5th fails to suspend with kernel 4.19+)
Product: [Fedora] Fedora Reporter: Thomas Müller <thomas>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 29CC: airlied, andy, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved, y9t7sypezp
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
URL: https://bugzilla.kernel.org/show_bug.cgi?id=202679
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-12 05:30:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg from failed suspend attempt
none
dmesg from failed suspend attempt with 4.20.6 and wifi and LTE disabled
none
dmesg from 4.18.18 with successful suspend none

Description Thomas Müller 2019-01-31 18:33:06 UTC
Created attachment 1525522 [details]
dmesg from failed suspend attempt

1. Please describe the problem:
Starting with kernel 4.19 my Lenovo ThinkPad X1 Carbon 5th fails to suspend to RAM.
When closing the lid or executing "systemctl suspend" the screen goes black and the status led starts to blink rapidly (just like when power is plugged in).
The keyboard lights can still be toggled using Fn+space, so the firmware appears to be (partly) alive.


2. What is the Version-Release number of the kernel:
kernel-4.20.5-200.fc29.x86_64


3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :
Yes, it still works with kernel-4.18.18-300.fc29.x86_64.
It basically started failing with the first 4.19 kernel that hit updates-testing.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:
Yes, this happens every single time suspend is triggered with a 4.19+ kernel.
I only have to boot it up and try to suspend. Even in runlevel 3.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:
Hmm... I'll try that and add a comment.


6. Are you running any modules that not shipped with directly Fedora's kernel?:
No, no out-of-tree modules.


7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Thomas Müller 2019-01-31 18:46:40 UTC
I've installed kernel-5.0.0-0.rc4.git2.1.fc30.x86_64 from koji... still fails to suspend.

Comment 2 Thomas Müller 2019-02-01 06:52:53 UTC
Just realized I probably wasn't very clear about the implications of the failed suspend...

After the screen goes blank and the led starts blinking there is no way to wake up / recover the system. Only way out is to press the power button several seconds to force it off and power cycle.

Comment 3 Steve 2019-02-01 19:59:46 UTC
Several of the entries at the end of the log are for the wireless device, so it might be worth seeing if disabling wireless allows suspending to complete. If the BIOS supports it, you could try disabling wireless there.

Comment 4 Thomas Müller 2019-02-05 18:39:00 UTC
(In reply to Steve from comment #3)
> Several of the entries at the end of the log are for the wireless device, so
> it might be worth seeing if disabling wireless allows suspending to
> complete. If the BIOS supports it, you could try disabling wireless there.

I've just disabled wireless in the BIOS and booted 4.20.6-200.fc29.x86_64, but suspend still fails.

The kernel log doesn't show anything interesting at the end, probably because it doesn't hit the disk :(

Comment 5 Steve 2019-02-05 21:28:54 UTC
(In reply to Thomas Müller from comment #4)
...
> I've just disabled wireless in the BIOS and booted 4.20.6-200.fc29.x86_64,
> but suspend still fails.
...

Thanks for checking that. There is also a USB Sierra Wireless EM7455 Qualcomm Snapdragon X7 LTE-A device. Is there a way to disable it?

Snippet from attached log:

$ grep -n 'usb 1-6' dmesg_4.20.5_failedSuspend-1.txt 
704:Jan 31 20:11:51 kernel: usb 1-6: new high-speed USB device number 2 using xhci_hcd
709:Jan 31 20:11:51 kernel: usb 1-6: config 1 has an invalid interface number: 12 but max is 1
710:Jan 31 20:11:51 kernel: usb 1-6: config 1 has an invalid interface number: 13 but max is 1
711:Jan 31 20:11:51 kernel: usb 1-6: config 1 has an invalid interface number: 13 but max is 1
712:Jan 31 20:11:51 kernel: usb 1-6: config 1 has no interface number 0
713:Jan 31 20:11:51 kernel: usb 1-6: config 1 has no interface number 1
714:Jan 31 20:11:51 kernel: usb 1-6: New USB device found, idVendor=1199, idProduct=9079, bcdDevice= 0.06
715:Jan 31 20:11:51 kernel: usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
716:Jan 31 20:11:51 kernel: usb 1-6: Product: Sierra Wireless EM7455 Qualcomm Snapdragon X7 LTE-A
717:Jan 31 20:11:51 kernel: usb 1-6: Manufacturer: Sierra Wireless, Incorporated

Comment 6 Steve 2019-02-05 22:24:20 UTC
Does your BIOS have an option to set the sleep state to "Linux"?

This is for Carbon, 6th, but it mentions "BIOS version 1.30", and the attached log shows "1.36", so this might be applicable:

Lenovo ThinkPad X1 Carbon (Gen 6)
Suspend issues
https://wiki.archlinux.org/index.php/Lenovo_ThinkPad_X1_Carbon_(Gen_6)#Suspend_issues

Snippet from attached log:

$ egrep 'DMI:|ACPI.*supports' dmesg_4.20.5_failedSuspend-1.txt
Jan 31 20:11:51 kernel: DMI: LENOVO 20HRCTO1WW/20HRCTO1WW, BIOS N1MET51W(1.36) 01/11/2019
Jan 31 20:11:51 kernel: ACPI: (supports S0 S3 S4 S5)

Comment 7 Thomas Müller 2019-02-06 07:23:47 UTC
(In reply to Steve from comment #5)
> (In reply to Thomas Müller from comment #4)
> ...
> > I've just disabled wireless in the BIOS and booted 4.20.6-200.fc29.x86_64,
> > but suspend still fails.
> ...
> 
> Thanks for checking that. There is also a USB Sierra Wireless EM7455
> Qualcomm Snapdragon X7 LTE-A device. Is there a way to disable it?
Actually it already was disabled alongside wireless lan during the last boot of 4.20.6.
I'll add the full kernel log from that unsuccessful experiment for reference.



(In reply to Steve from comment #6)
> Does your BIOS have an option to set the sleep state to "Linux"?
> 
> This is for Carbon, 6th, but it mentions "BIOS version 1.30", and the
> attached log shows "1.36", so this might be applicable:
> 
> Lenovo ThinkPad X1 Carbon (Gen 6)
> Suspend issues
> https://wiki.archlinux.org/index.php/
> Lenovo_ThinkPad_X1_Carbon_(Gen_6)#Suspend_issues
> 
> Snippet from attached log:
> 
> $ egrep 'DMI:|ACPI.*supports' dmesg_4.20.5_failedSuspend-1.txt
> Jan 31 20:11:51 kernel: DMI: LENOVO 20HRCTO1WW/20HRCTO1WW, BIOS
> N1MET51W(1.36) 01/11/2019
> Jan 31 20:11:51 kernel: ACPI: (supports S0 S3 S4 S5)

No, the 5th does not have this option, even with the current BIOS version.
However, S3 is advertised as supported by the firmware according to the kernel log:
> $ cat dmesg_4.20.6-200.fc29.x86_64_noWifi_noLTE | grep -i "acpi: (supports"
> Feb 05 19:28:08 kernel: ACPI: (supports S0 S3 S4 S5)

Comment 8 Thomas Müller 2019-02-06 07:25:01 UTC
Created attachment 1527433 [details]
dmesg from failed suspend attempt with 4.20.6 and wifi and LTE disabled

Comment 9 Steve 2019-02-07 13:29:49 UTC
Thanks for your followup report and for attaching the 4.20.6 output. For the record, could you post the output from:

$ grep . /sys/power/*

Comment 10 Thomas Müller 2019-02-07 18:09:41 UTC
(In reply to Steve from comment #9)
> Thanks for your followup report and for attaching the 4.20.6 output. For the
> record, could you post the output from:
> 
> $ grep . /sys/power/*

4.18.18:
> /sys/power/disk:[disabled]
> /sys/power/image_size:6609518592
> /sys/power/mem_sleep:s2idle [deep]
> /sys/power/pm_async:1
> /sys/power/pm_debug_messages:0
> /sys/power/pm_freeze_timeout:20000
> /sys/power/pm_print_times:0
> /sys/power/pm_test:[none] core processors platform devices freezer
> /sys/power/pm_trace:0
> /sys/power/pm_trace_dev_match:acpi
> /sys/power/pm_trace_dev_match:memory
> grep: /sys/power/pm_wakeup_irq: No data available
> /sys/power/reserved_size:1048576
> /sys/power/resume:0:0
> /sys/power/resume_offset:0
> /sys/power/state:freeze mem
> /sys/power/wakeup_count:75

4.20.6
> /sys/power/disk:[disabled]
> /sys/power/image_size:6607970304
> /sys/power/mem_sleep:s2idle [deep]
> /sys/power/pm_async:1
> /sys/power/pm_debug_messages:0
> /sys/power/pm_freeze_timeout:20000
> /sys/power/pm_print_times:0
> /sys/power/pm_test:[none] core processors platform devices freezer
> /sys/power/pm_trace:0
> /sys/power/pm_trace_dev_match:memory
> grep: /sys/power/pm_wakeup_irq: No data available
> /sys/power/reserved_size:1048576
> /sys/power/resume:0:0
> /sys/power/resume_offset:0
> /sys/power/state:freeze mem
> /sys/power/wakeup_count:1

Comment 11 Steve 2019-02-07 22:03:20 UTC
(In reply to Thomas Müller from comment #10)
...
> > /sys/power/pm_test:[none] core processors platform devices freezer
...

Thanks for posting the /sys/power/ output. Here is a possible debugging strategy using "pm_test".

In a terminal window, run:

$ dmesg -w

In a separate terminal window, run as root:

# sync
# cat /sys/power/pm_test
# echo devices > /sys/power/pm_test  # Echo "devices" or one of the other strings in pm_test.
# cat /sys/power/pm_test # This should show "[devices]" (in square brackets).
# echo mem > /sys/power/state

Wait for about 10 seconds -- the system should automatically resume.

Documentation:

Debugging hibernation and suspend
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

Scroll down to "2. Testing suspend to RAM (STR)".

I'm not sure how to use this information, but you don't need to do a mount before running:
# cat /sys/kernel/debug/suspend_stats

This documents the files in /sys/power/, but not "pm_test":
https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-power

Comment 12 Thomas Müller 2019-02-10 18:58:29 UTC
When I execute
> echo core > /sys/power/pm_test
and then
> echo mem > /sys/power/state
the system immediately goes blank and freezes just like when I really try to activate suspend...

No chance to get anything from `dmesg -w` :(


The other options (processors platform devices freezer) worked without any errors.

Comment 13 Thomas Müller 2019-02-10 19:12:41 UTC
Correction, both "core" and "processors" fail.

Comment 14 Steve 2019-02-10 20:02:27 UTC
(In reply to Thomas Müller from comment #13)
> Correction, both "core" and "processors" fail.

Thanks for your report. The documentation says:

'If the "processors" test fails, the disabling/enabling of nonboot CPUs does not
work (of course, this only may be an issue on SMP systems) and the problem
should be reported.  In that case you can also try to switch the nonboot CPUs
off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and
see if that works.'
https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

Try:

# cat /sys/devices/system/cpu/cpu*/online

# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > /sys/devices/system/cpu/cpu2/online
# echo 0 > /sys/devices/system/cpu/cpu3/online

# cat /sys/devices/system/cpu/cpu*/online

(NB: There is no "online" file for "cpu0".)

After that, try:

# echo processors > /sys/power/pm_test
# echo mem > /sys/power/state

For the record, the Intel i7-7600U has two cores and four threads:

$ grep 'smpboot: CPU0:' dmesg_4.20.6-noWifi_noLTE-1.txt 
Feb 05 19:28:08 kernel: smpboot: CPU0: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz (family: 0x6, model: 0x8e, stepping: 0x9)

https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3-90-GHz-

Comment 15 Steve 2019-02-10 20:15:11 UTC
Here is a more elegant way to manage CPUs:

# lscpu -e # list

# chcpu -d 1,2,3 # disable

# lscpu -e

# chcpu -e 1,2,3 # enable

Documentation:

$ man lscpu
$ man chcpu

Comment 16 Thomas Müller 2019-02-15 11:41:52 UTC
Well, we are coming closer to the actual problem I guess...

Initially, lscpu -e shows the following
> CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
> 0   0    0      0    0:0:0:0       ja     3900,0000 400,0000
> 1   0    0      1    1:1:1:0       ja     3900,0000 400,0000
> 2   0    0      0    0:0:0:0       ja     3900,0000 400,0000
> 3   0    0      1    1:1:1:0       ja     3900,0000 400,0000


If I try to execute
# chcpu -d 1,2,3
or
# echo 0 > /sys/devices/system/cpu/cpu1/online
on 4.20.8-200.fc29.x86_64, the command blocks, while the system itself remains (mostly) usable.

lscpu still shows the same output (i.e. all cpus online), but if i try to read directly from /sys/devices/system/cpu/cpu1/online (i.e. `cat /sys/devices/system/cpu/cpu1/online`) that command also blocks indefinitely.
Unfortunately, no message whatsoever is shown in the kernel logs. Also, reboot or poweroff no longer works and the system needs a hard reset. :(


On 4.18.18-300.fc29.x86_64 the above commands successfully take a cpu offline (and online again).

Comment 17 Steve 2019-02-15 14:40:15 UTC
Thanks for testing and for your report. I suggest updating the bug summary to say something like this:

"disabling secondary CPU hangs with kernel 4.19+ on Lenovo ThinkPad X1 Carbon 5th"

Comment 18 Steve 2019-02-15 14:47:31 UTC
These messages could be related. For comparison, could you attach a log for 4.18.18-300.fc29.x86_64?

$ grep -n 'CPU.*temp' dmesg_4.20.6-noWifi_noLTE-1.txt 
693:Feb 05 19:28:08 kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
694:Feb 05 19:28:08 kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
695:Feb 05 19:28:08 kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
696:Feb 05 19:28:08 kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
697:Feb 05 19:28:08 kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
698:Feb 05 19:28:08 kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
701:Feb 05 19:28:08 kernel: CPU0: Core temperature/speed normal
702:Feb 05 19:28:08 kernel: CPU2: Core temperature/speed normal
703:Feb 05 19:28:08 kernel: CPU2: Package temperature/speed normal
704:Feb 05 19:28:08 kernel: CPU3: Package temperature/speed normal
705:Feb 05 19:28:08 kernel: CPU1: Package temperature/speed normal
706:Feb 05 19:28:08 kernel: CPU0: Package temperature/speed normal

Comment 19 Thomas Müller 2019-02-17 08:36:28 UTC
Created attachment 1535625 [details]
dmesg from 4.18.18 with successful suspend

(In reply to Steve from comment #18)
> These messages could be related. For comparison, could you attach a log for
> 4.18.18-300.fc29.x86_64?
I've attached a log from 4.18.18 for reference. It also contains a successful suspend and resume at the end of the log.

I'm pretty sure those messages are unrelated as I've always been seeing them and they also appear with 4.18.18.
The X1 is quite small and cooling seems to be a bit undersized which is why the cpus get throttled every now and then.

Comment 20 Thomas Müller 2019-02-24 13:23:14 UTC
I have bisected the kernel and found the culprit (or at least something, that triggers the bad behavior):

[be45bf5395e0886a93fc816bbe41a008ec2e42e2] watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
be45bf5395e0886a93fc816bbe41a008ec2e42e2 is the first bad commit
commit be45bf5395e0886a93fc816bbe41a008ec2e42e2
Author: Peter Zijlstra <peterz>
Date:   Fri Jul 13 12:42:08 2018 +0200

    watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug
    
    When scheduling is delayed for longer than the softlockup interrupt
    period it is possible to double-queue the cpu_stop_work, causing list
    corruption.
    
    Cure this by adding a completion to track the cpu_stop_work's
    progress.
    
    Reported-by: kernel test robot <lkp>
    Tested-by: Rong Chen <rong.a.chen>
    Signed-off-by: Peter Zijlstra (Intel) <peterz>
    Cc: Linus Torvalds <torvalds>
    Cc: Peter Zijlstra <peterz>
    Cc: Thomas Gleixner <tglx>
    Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work")
    Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo>

:040000 040000 6aca2dbb84bc33fe442b18b3d0a135c27adff7b9 2710af12d32e4b98df07768716689b213bce45fc M      kernel

Comment 21 Laura Abbott 2019-04-09 20:44:40 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.
 
Fedora XX has now been rebased to 5.0.6  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.
 
If you experience different issues, please open a new bug report for those.

Comment 22 Thomas Müller 2019-04-12 05:30:04 UTC
Good news: starting with 5.0.6 suspend is working again.