Bug 1671504
Summary: | disabling secondary CPU hangs with kernel 4.19+ on Lenovo ThinkPad X1 Carbon 5th (was: Lenovo ThinkPad X1 Carbon 5th fails to suspend with kernel 4.19+) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Thomas Müller <thomas> | ||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | unspecified | Docs Contact: | |||||||||
Priority: | unspecified | ||||||||||
Version: | 29 | CC: | airlied, andy, bskeggs, hdegoede, ichavero, itamar, jarodwilson, jeremy, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved, y9t7sypezp | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
URL: | https://bugzilla.kernel.org/show_bug.cgi?id=202679 | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2019-04-12 05:30:04 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Thomas Müller
2019-01-31 18:33:06 UTC
I've installed kernel-5.0.0-0.rc4.git2.1.fc30.x86_64 from koji... still fails to suspend. Just realized I probably wasn't very clear about the implications of the failed suspend... After the screen goes blank and the led starts blinking there is no way to wake up / recover the system. Only way out is to press the power button several seconds to force it off and power cycle. Several of the entries at the end of the log are for the wireless device, so it might be worth seeing if disabling wireless allows suspending to complete. If the BIOS supports it, you could try disabling wireless there. (In reply to Steve from comment #3) > Several of the entries at the end of the log are for the wireless device, so > it might be worth seeing if disabling wireless allows suspending to > complete. If the BIOS supports it, you could try disabling wireless there. I've just disabled wireless in the BIOS and booted 4.20.6-200.fc29.x86_64, but suspend still fails. The kernel log doesn't show anything interesting at the end, probably because it doesn't hit the disk :( (In reply to Thomas Müller from comment #4) ... > I've just disabled wireless in the BIOS and booted 4.20.6-200.fc29.x86_64, > but suspend still fails. ... Thanks for checking that. There is also a USB Sierra Wireless EM7455 Qualcomm Snapdragon X7 LTE-A device. Is there a way to disable it? Snippet from attached log: $ grep -n 'usb 1-6' dmesg_4.20.5_failedSuspend-1.txt 704:Jan 31 20:11:51 kernel: usb 1-6: new high-speed USB device number 2 using xhci_hcd 709:Jan 31 20:11:51 kernel: usb 1-6: config 1 has an invalid interface number: 12 but max is 1 710:Jan 31 20:11:51 kernel: usb 1-6: config 1 has an invalid interface number: 13 but max is 1 711:Jan 31 20:11:51 kernel: usb 1-6: config 1 has an invalid interface number: 13 but max is 1 712:Jan 31 20:11:51 kernel: usb 1-6: config 1 has no interface number 0 713:Jan 31 20:11:51 kernel: usb 1-6: config 1 has no interface number 1 714:Jan 31 20:11:51 kernel: usb 1-6: New USB device found, idVendor=1199, idProduct=9079, bcdDevice= 0.06 715:Jan 31 20:11:51 kernel: usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3 716:Jan 31 20:11:51 kernel: usb 1-6: Product: Sierra Wireless EM7455 Qualcomm Snapdragon X7 LTE-A 717:Jan 31 20:11:51 kernel: usb 1-6: Manufacturer: Sierra Wireless, Incorporated Does your BIOS have an option to set the sleep state to "Linux"? This is for Carbon, 6th, but it mentions "BIOS version 1.30", and the attached log shows "1.36", so this might be applicable: Lenovo ThinkPad X1 Carbon (Gen 6) Suspend issues https://wiki.archlinux.org/index.php/Lenovo_ThinkPad_X1_Carbon_(Gen_6)#Suspend_issues Snippet from attached log: $ egrep 'DMI:|ACPI.*supports' dmesg_4.20.5_failedSuspend-1.txt Jan 31 20:11:51 kernel: DMI: LENOVO 20HRCTO1WW/20HRCTO1WW, BIOS N1MET51W(1.36) 01/11/2019 Jan 31 20:11:51 kernel: ACPI: (supports S0 S3 S4 S5) (In reply to Steve from comment #5) > (In reply to Thomas Müller from comment #4) > ... > > I've just disabled wireless in the BIOS and booted 4.20.6-200.fc29.x86_64, > > but suspend still fails. > ... > > Thanks for checking that. There is also a USB Sierra Wireless EM7455 > Qualcomm Snapdragon X7 LTE-A device. Is there a way to disable it? Actually it already was disabled alongside wireless lan during the last boot of 4.20.6. I'll add the full kernel log from that unsuccessful experiment for reference. (In reply to Steve from comment #6) > Does your BIOS have an option to set the sleep state to "Linux"? > > This is for Carbon, 6th, but it mentions "BIOS version 1.30", and the > attached log shows "1.36", so this might be applicable: > > Lenovo ThinkPad X1 Carbon (Gen 6) > Suspend issues > https://wiki.archlinux.org/index.php/ > Lenovo_ThinkPad_X1_Carbon_(Gen_6)#Suspend_issues > > Snippet from attached log: > > $ egrep 'DMI:|ACPI.*supports' dmesg_4.20.5_failedSuspend-1.txt > Jan 31 20:11:51 kernel: DMI: LENOVO 20HRCTO1WW/20HRCTO1WW, BIOS > N1MET51W(1.36) 01/11/2019 > Jan 31 20:11:51 kernel: ACPI: (supports S0 S3 S4 S5) No, the 5th does not have this option, even with the current BIOS version. However, S3 is advertised as supported by the firmware according to the kernel log: > $ cat dmesg_4.20.6-200.fc29.x86_64_noWifi_noLTE | grep -i "acpi: (supports" > Feb 05 19:28:08 kernel: ACPI: (supports S0 S3 S4 S5) Created attachment 1527433 [details]
dmesg from failed suspend attempt with 4.20.6 and wifi and LTE disabled
Thanks for your followup report and for attaching the 4.20.6 output. For the record, could you post the output from: $ grep . /sys/power/* (In reply to Steve from comment #9) > Thanks for your followup report and for attaching the 4.20.6 output. For the > record, could you post the output from: > > $ grep . /sys/power/* 4.18.18: > /sys/power/disk:[disabled] > /sys/power/image_size:6609518592 > /sys/power/mem_sleep:s2idle [deep] > /sys/power/pm_async:1 > /sys/power/pm_debug_messages:0 > /sys/power/pm_freeze_timeout:20000 > /sys/power/pm_print_times:0 > /sys/power/pm_test:[none] core processors platform devices freezer > /sys/power/pm_trace:0 > /sys/power/pm_trace_dev_match:acpi > /sys/power/pm_trace_dev_match:memory > grep: /sys/power/pm_wakeup_irq: No data available > /sys/power/reserved_size:1048576 > /sys/power/resume:0:0 > /sys/power/resume_offset:0 > /sys/power/state:freeze mem > /sys/power/wakeup_count:75 4.20.6 > /sys/power/disk:[disabled] > /sys/power/image_size:6607970304 > /sys/power/mem_sleep:s2idle [deep] > /sys/power/pm_async:1 > /sys/power/pm_debug_messages:0 > /sys/power/pm_freeze_timeout:20000 > /sys/power/pm_print_times:0 > /sys/power/pm_test:[none] core processors platform devices freezer > /sys/power/pm_trace:0 > /sys/power/pm_trace_dev_match:memory > grep: /sys/power/pm_wakeup_irq: No data available > /sys/power/reserved_size:1048576 > /sys/power/resume:0:0 > /sys/power/resume_offset:0 > /sys/power/state:freeze mem > /sys/power/wakeup_count:1 (In reply to Thomas Müller from comment #10) ... > > /sys/power/pm_test:[none] core processors platform devices freezer ... Thanks for posting the /sys/power/ output. Here is a possible debugging strategy using "pm_test". In a terminal window, run: $ dmesg -w In a separate terminal window, run as root: # sync # cat /sys/power/pm_test # echo devices > /sys/power/pm_test # Echo "devices" or one of the other strings in pm_test. # cat /sys/power/pm_test # This should show "[devices]" (in square brackets). # echo mem > /sys/power/state Wait for about 10 seconds -- the system should automatically resume. Documentation: Debugging hibernation and suspend https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt Scroll down to "2. Testing suspend to RAM (STR)". I'm not sure how to use this information, but you don't need to do a mount before running: # cat /sys/kernel/debug/suspend_stats This documents the files in /sys/power/, but not "pm_test": https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-power When I execute > echo core > /sys/power/pm_test and then > echo mem > /sys/power/state the system immediately goes blank and freezes just like when I really try to activate suspend... No chance to get anything from `dmesg -w` :( The other options (processors platform devices freezer) worked without any errors. Correction, both "core" and "processors" fail. (In reply to Thomas Müller from comment #13) > Correction, both "core" and "processors" fail. Thanks for your report. The documentation says: 'If the "processors" test fails, the disabling/enabling of nonboot CPUs does not work (of course, this only may be an issue on SMP systems) and the problem should be reported. In that case you can also try to switch the nonboot CPUs off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and see if that works.' https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt Try: # cat /sys/devices/system/cpu/cpu*/online # echo 0 > /sys/devices/system/cpu/cpu1/online # echo 0 > /sys/devices/system/cpu/cpu2/online # echo 0 > /sys/devices/system/cpu/cpu3/online # cat /sys/devices/system/cpu/cpu*/online (NB: There is no "online" file for "cpu0".) After that, try: # echo processors > /sys/power/pm_test # echo mem > /sys/power/state For the record, the Intel i7-7600U has two cores and four threads: $ grep 'smpboot: CPU0:' dmesg_4.20.6-noWifi_noLTE-1.txt Feb 05 19:28:08 kernel: smpboot: CPU0: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz (family: 0x6, model: 0x8e, stepping: 0x9) https://ark.intel.com/products/97466/Intel-Core-i7-7600U-Processor-4M-Cache-up-to-3-90-GHz- Here is a more elegant way to manage CPUs: # lscpu -e # list # chcpu -d 1,2,3 # disable # lscpu -e # chcpu -e 1,2,3 # enable Documentation: $ man lscpu $ man chcpu Well, we are coming closer to the actual problem I guess...
Initially, lscpu -e shows the following
> CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
> 0 0 0 0 0:0:0:0 ja 3900,0000 400,0000
> 1 0 0 1 1:1:1:0 ja 3900,0000 400,0000
> 2 0 0 0 0:0:0:0 ja 3900,0000 400,0000
> 3 0 0 1 1:1:1:0 ja 3900,0000 400,0000
If I try to execute
# chcpu -d 1,2,3
or
# echo 0 > /sys/devices/system/cpu/cpu1/online
on 4.20.8-200.fc29.x86_64, the command blocks, while the system itself remains (mostly) usable.
lscpu still shows the same output (i.e. all cpus online), but if i try to read directly from /sys/devices/system/cpu/cpu1/online (i.e. `cat /sys/devices/system/cpu/cpu1/online`) that command also blocks indefinitely.
Unfortunately, no message whatsoever is shown in the kernel logs. Also, reboot or poweroff no longer works and the system needs a hard reset. :(
On 4.18.18-300.fc29.x86_64 the above commands successfully take a cpu offline (and online again).
Thanks for testing and for your report. I suggest updating the bug summary to say something like this: "disabling secondary CPU hangs with kernel 4.19+ on Lenovo ThinkPad X1 Carbon 5th" These messages could be related. For comparison, could you attach a log for 4.18.18-300.fc29.x86_64? $ grep -n 'CPU.*temp' dmesg_4.20.6-noWifi_noLTE-1.txt 693:Feb 05 19:28:08 kernel: CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) 694:Feb 05 19:28:08 kernel: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1) 695:Feb 05 19:28:08 kernel: CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) 696:Feb 05 19:28:08 kernel: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1) 697:Feb 05 19:28:08 kernel: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1) 698:Feb 05 19:28:08 kernel: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1) 701:Feb 05 19:28:08 kernel: CPU0: Core temperature/speed normal 702:Feb 05 19:28:08 kernel: CPU2: Core temperature/speed normal 703:Feb 05 19:28:08 kernel: CPU2: Package temperature/speed normal 704:Feb 05 19:28:08 kernel: CPU3: Package temperature/speed normal 705:Feb 05 19:28:08 kernel: CPU1: Package temperature/speed normal 706:Feb 05 19:28:08 kernel: CPU0: Package temperature/speed normal Created attachment 1535625 [details] dmesg from 4.18.18 with successful suspend (In reply to Steve from comment #18) > These messages could be related. For comparison, could you attach a log for > 4.18.18-300.fc29.x86_64? I've attached a log from 4.18.18 for reference. It also contains a successful suspend and resume at the end of the log. I'm pretty sure those messages are unrelated as I've always been seeing them and they also appear with 4.18.18. The X1 is quite small and cooling seems to be a bit undersized which is why the cpus get throttled every now and then. I have bisected the kernel and found the culprit (or at least something, that triggers the bad behavior): [be45bf5395e0886a93fc816bbe41a008ec2e42e2] watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug be45bf5395e0886a93fc816bbe41a008ec2e42e2 is the first bad commit commit be45bf5395e0886a93fc816bbe41a008ec2e42e2 Author: Peter Zijlstra <peterz> Date: Fri Jul 13 12:42:08 2018 +0200 watchdog/softlockup: Fix cpu_stop_queue_work() double-queue bug When scheduling is delayed for longer than the softlockup interrupt period it is possible to double-queue the cpu_stop_work, causing list corruption. Cure this by adding a completion to track the cpu_stop_work's progress. Reported-by: kernel test robot <lkp> Tested-by: Rong Chen <rong.a.chen> Signed-off-by: Peter Zijlstra (Intel) <peterz> Cc: Linus Torvalds <torvalds> Cc: Peter Zijlstra <peterz> Cc: Thomas Gleixner <tglx> Fixes: 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads with cpu_stop_work") Link: http://lkml.kernel.org/r/20180713104208.GW2494@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo> :040000 040000 6aca2dbb84bc33fe442b18b3d0a135c27adff7b9 2710af12d32e4b98df07768716689b213bce45fc M kernel We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs. Fedora XX has now been rebased to 5.0.6 Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30. If you experience different issues, please open a new bug report for those. Good news: starting with 5.0.6 suspend is working again. |