Bug 1594675 - after suspend and resume the system cannot be powered off
Summary: after suspend and resume the system cannot be powered off
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 28
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-06-25 07:37 UTC by Peter Lesterhuis
Modified: 2019-02-21 21:11 UTC (History)
20 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-02-21 21:11:06 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
mei error (6.48 MB, image/jpeg)
2018-07-20 20:26 UTC, Georg Müller
no flags Details

Description Peter Lesterhuis 2018-06-25 07:37:44 UTC
Description of problem:
Booting into kernel 4.17.2-200.fc28.x86_64, suspend by closing the lid of the laptop, system resumes normally. When trying to poweroff the screen goes black but the system is not down (fan working, led-indicator on the powerbutton on). 

Version-Release number of selected component (if applicable):

kernel 4.17.2-200.fc28.x86_64

How reproducible:

It happens all the time.


Steps to Reproduce:
1. start fedora 28
2. suspend
3. resume
4. poweroff

Actual results: the system freezes, is not powering off


Expected results: the system powers off


Additional info:
watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [kworker/2:3:583]
Modules linked in: fuse rfcomm ccm xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack devlink ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc vfat fat arc4 snd_soc_skl snd_soc_skl_ipc joydev snd_hda_ext_core snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi snd_hda_codec_hdmi intel_rapl snd_soc_core snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp hid_multitouch
 spi_pxa2xx_platform snd_compress kvm_intel snd_pcm_dmaengine iwlmvm iTCO_wdt ac97_bus iTCO_vendor_support kvm mac80211 irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore intel_rapl_perf iwlwifi btusb btrtl btbcm uvcvideo btintel videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 cfg80211 videobuf2_common asus_nb_wmi videodev asus_wmi snd_hda_intel snd_hda_codec sparse_keymap wmi_bmof media snd_hda_core ecdh_generic rfkill snd_hwdep snd_seq snd_seq_device snd_pcm int3403_thermal snd_timer mei_me snd int3400_thermal acpi_thermal_rel idma64 mei soundcore processor_thermal_device i2c_i801 intel_lpss_pci int340x_thermal_zone intel_lpss intel_pch_thermal intel_soc_dts_iosf shpchp asus_wireless acpi_pad binfmt_misc nouveau i915 ttm i2c_algo_bit drm_kms_helper
 mxm_wmi drm serio_raw i2c_hid crc32c_intel wmi video
CPU: 2 PID: 583 Comm: kworker/2:3 Tainted: G        W    L    4.17.2-200.fc28.x86_64 #1
Hardware name: ASUSTeK COMPUTER INC. X510UQR/X510UQR, BIOS X510UQR.301 09/25/2017
Workqueue: rcu_gp wait_rcu_exp_gp
RIP: 0010:smp_call_function_single+0x88/0xf0
RSP: 0018:ffffc27a81197de0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000001 RBX: ffffffffb525dd00 RCX: 0000000000000000
RDX: ffffffffb525dd00 RSI: ffffffffb411f1f0 RDI: 0000000000000006
RBP: ffffc27a81197e28 R08: ffff9db92eca1e00 R09: 0000000000000040
R10: 0000000000000000 R11: 0000000000000000 R12: ffff9db92eda1b80
R13: 0000000000000040 R14: 0000000000000006 R15: 0000000000000040
FS:  0000000000000000(0000) GS:ffff9db92ec80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fda7e9aec90 CR3: 000000023220a005 CR4: 00000000003606e0
Call Trace:

Comment 1 Peter Lesterhuis 2018-06-25 16:25:48 UTC
Forgot to mention that powering off without suspending first, is working properly.

Comment 2 Georg Müller 2018-06-26 07:21:51 UTC
I have the same problem. A second suspend also causes the issue.

Switching back to 4.16.16-300 fixes it for now.

I would raise the severity of this issue. Not thinking about it I put my laptop into my backpack and went home, only to find it really hot some hours later.

Comment 3 Georg Müller 2018-06-26 07:49:15 UTC
As a side note: I do not get any watchdog output in my logs.

I get several ACPI errors. But I am getting these with both 4.16 and 4.17

Comment 4 Georg Müller 2018-06-26 09:21:58 UTC
I tried some of the techniques from https://www.kernel.org/doc/Documentation/power/basic-pm-debugging.txt

echo devices > /sys/power/pm_test
echo mem > /sys/power/state

This works for the first try, but not for the next one. Sometimes the screen stays enabled, but is frozen.

I will do some more testing later.

My system is a Dell Latitude E7440, intel graphics

Comment 5 za267 2018-06-27 17:44:29 UTC
I have a very similar issue on a Dell XPS 9570 - the symptoms are exactly the same.

Suspend stopped working following the upgrade, as described by the OP.  However:  I am running the third party NVIDIA drivers for the NVIDIA GeForce GTX 1050 included in the laptop.  The drivers were installed from the gnome software center using the third party repo rpmfusion.

As a troubleshooting step, I did a fresh install of Fedora 28 on the laptop.  
From this point on and using the base Intel video card, suspend was working well again.

I upgraded all the packages, including the kernel, but without installing the NVDIA drivers, suspend still worked fine on kernel 4.17.2-200.

I then proceeded to install the NVIDIA drivers as described above, and suspend stopped working.

Just to be clear, this was working fine with the NVIDIA drivers on the previous kernel version: 4.16.3-301.

Even now, when I boot back into 4.16.3-301, the NVIDIA drivers are enabled and suspend works fine.

Comment 6 Georg Müller 2018-07-06 11:58:59 UTC
4.17.4-200 does not solve it.

Comment 7 Georg Müller 2018-07-20 20:20:21 UTC
I got some more info on this.

1. Boot kernel 4.17.7-200.fc28.x86_64
2. switch to VT2 and log in
3. suspend by closing lid
4. wake up by closing lid
-> messages are printed on console:

[  406.976821] mei_wdt mei::05b79a6f-4628-4d7f-899d-a91514cb32ab:01: get hw module failed
[  406.976822] mei_wdt mei::05b79a6f-4628-4d7f-899d-a91514cb32ab:01: Could not enable cl device

5. lsmod | grep mei
pn544_mei              16384  0
mei_phy                16384  1 pn544_mei
pn544                  20480  1 pn544_mei
hci                    53248  2 mei_phy,pn544
mei_wdt                16384  0
mei_me                 45056  -1
mei                   110592  5 mei_wdt,mei_phy,mei_me,pn544_mei

I tried to unload all these modules

When unloading pn544, I get the following error:

[  200.015393] Removing pn544
[  200.015761] WARNING: CPU: 2 PID: 2424 at kernel/module.c:1142 module_put+0x80/0x90
[  200.015764] Modules linked in: ccm xt_CHECKSUM tun ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype br_netfilter qcserial usb_wwan devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables overlay bnep vfat fat arc4 pn544_mei(-) mei_phy pn544 hci nfc iTCO_wdt iTCO_vendor_support mei_wdt intel_rapl dell_wmi wmi_bmof sparse_keymap x86_pkg_temp_thermal intel_powerclamp ppdev coretemp dell_laptop kvm_intel dell_smbios iwlmvm dell_wmi_descriptor dcdbas dell_smm_hwmon
[  200.015883]  kvm mac80211 irqbypass intel_cstate intel_uncore intel_rapl_perf btusb iwlwifi uvcvideo btrtl snd_hda_codec_realtek btbcm btintel videobuf2_vmalloc snd_hda_codec_hdmi videobuf2_memops videobuf2_v4l2 snd_hda_codec_generic videobuf2_common bluetooth cfg80211 snd_hda_intel joydev videodev snd_hda_codec i2c_i801 snd_hda_core ecdh_generic snd_hwdep snd_seq media snd_seq_device shpchp lpc_ich snd_pcm mei_me snd_timer mei snd soundcore wmi parport_pc parport dell_smo8800 dell_rbtn rfkill binfmt_misc dm_crypt cdc_mbim cdc_wdm cdc_ncm usbnet mii i915 uas usb_storage crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit ghash_clmulni_intel drm_kms_helper sdhci_pci cqhci drm sdhci e1000e serio_raw mmc_core video i2c_dev
[  200.016017] CPU: 2 PID: 2424 Comm: rmmod Not tainted 4.17.7-200.fc28.x86_64 #1
[  200.016020] Hardware name: Dell Inc. Latitude E7440/07F3F4, BIOS A25 02/01/2018
[  200.016028] RIP: 0010:module_put+0x80/0x90
[  200.016032] RSP: 0018:ffffb198089d3df0 EFLAGS: 00010297
[  200.016037] RAX: ffffffffc07ae850 RBX: ffff8e5d7b734800 RCX: 00000000ffffffff
[  200.016040] RDX: 0000000000000000 RSI: ffffeff1d016af00 RDI: ffffffffc07aefc0
[  200.016044] RBP: ffff8e5d94993e00 R08: ffff8e5dc5abc360 R09: 00000001802a0028
[  200.016047] R10: ffffeff1d016af00 R11: 00000000ffffff00 R12: ffff8e5dc72c1900
[  200.016050] R13: ffff8e5dc72c1828 R14: 0000000000000000 R15: 0000000000000000
[  200.016056] FS:  00007f64f0f120c0(0000) GS:ffff8e5ddeb00000(0000) knlGS:0000000000000000
[  200.016060] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  200.016070] CR2: 0000559b97861e08 CR3: 0000000409fc0004 CR4: 00000000001606e0
[  200.016073] Call Trace:
[  200.016099]  mei_cldev_disable+0x5d/0xd0 [mei]
[  200.016111]  nfc_mei_phy_free+0x11/0x20 [mei_phy]
[  200.016119]  pn544_mei_remove+0x2b/0x2f [pn544_mei]
[  200.016133]  mei_cl_device_remove+0x37/0x70 [mei]
[  200.016146]  device_release_driver_internal+0x15a/0x220
[  200.016154]  driver_detach+0x32/0x5f
[  200.016162]  bus_remove_driver+0x74/0xc6
[  200.016178]  mei_cldev_driver_unregister+0xe/0x30 [mei]
[  200.016186]  __x64_sys_delete_module+0x139/0x270
[  200.016197]  do_syscall_64+0x5b/0x160
[  200.016208]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  200.016220] RIP: 0033:0x7f64f04199e7
[  200.016223] RSP: 002b:00007ffc43ffcc38 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  200.016229] RAX: ffffffffffffffda RBX: 0000559b97857860 RCX: 00007f64f04199e7
[  200.016232] RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559b978578c8
[  200.016236] RBP: 0000000000000000 R08: 00007ffc43ffbbb1 R09: 0000000000000000
[  200.016239] R10: 00007f64f0489f00 R11: 0000000000000206 R12: 00007ffc43ffce60
[  200.016242] R13: 00007ffc43ffdcdc R14: 0000559b97857260 R15: 0000559b97857860
[  200.016246] Code: 74 23 48 8b 45 00 48 89 fb 48 8b 7d 08 48 83 c5 18 4c 89 e2 48 89 de e8 bf 1f ac 00 48 8b 45 00 48 85 c0 75 e4 5b 5d 41 5c c3 c3 <0f> 0b eb a5 89 c2 eb 8c 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
[  200.016354] ---[ end trace 17ee821c1ea0ca46 ]---


When I unload all the modules (including mei_me) BEFORE doing the suspend, I get no error and the subsequent poweroff works.

Comment 8 Georg Müller 2018-07-20 20:26:53 UTC
Created attachment 1468511 [details]
mei error

Step to reproduce this:

1. Boot
2. Switch to VT2 and login
3. suspend once by closing lid
4. opening lid
5. execute "poweroff"

Comment 9 Georg Müller 2018-07-20 21:03:25 UTC
blacklisting mei_me via /etc/modprobe.d solves this issue with the second suspend or poweroff

But with mei_me disabled, I am running into some other issues after resume, like not starting xhci driver

[   72.614840] xhci_hcd 0000:00:14.0: xHCI host controller not responding, assume dead
[   72.619297] xhci_hcd 0000:00:14.0: HC died; cleaning up

Comment 10 Georg Müller 2018-08-02 07:44:58 UTC
With kernel 4.17.11-200.fc28.x86_64, the oops is gone (and also the USB problems described in my last comment disappeared), but without blacklisting mei_me, the system still hangs on second suspend.

So I will continue blacklisting the mei_me...

Comment 11 Marc Cortinas Val 2018-08-13 22:46:19 UTC
4.17.12-200 does not solve it. I also should switch back to 4.16.16-301 in order to fix it....

Comment 12 Georg Müller 2018-08-14 06:28:47 UTC
Does blacklisting mei_me solves it for you?

cat > /etc/modprobe.d/blacklist-mei.conf << EOF
blacklist mei_me
EOF

Comment 13 Marc Cortinas Val 2018-08-20 14:38:18 UTC
4.17.14-202 does not solve it. 
If I blacklist the module mei_me this fix me the issue, thanks @Georg Müller!!!!

Comment 14 Georg Müller 2018-08-20 14:56:12 UTC
Ok. Good to hear. Question is if it is mei itself or one of the modules depending on it.

With mei_me blacklisted, I also lose pn544 NFC driver which I currently not use.

There were only three commits in the mei subdirectory in the 4.17 development cycle:

$ git shortlog v4.16..v4.17 -- drivers/misc/mei/
Alexander Usyskin (1):
      mei: limit the number of queued writes

Colin Ian King (1):
      mei: remove dev_err message on an unsupported ioctl

Tomas Winkler (1):
      mei: make module referencing local to the bus.c

One of them was just a removed log message, the other two are a bit bigger.

I will try to revert both of them in a local build and see what happens. If this fixes it, I will try to undo only one of them.

My guess would be the commit "mei: limit the number of queued writes"

The commit message states:
"Limit the number of queued writes per client.
Writes above this threshold are blocked till place
in the transmit queue is available."

Maybe it blocks because the transmitting is already suspended. But hey, just a guess. I will check.

Comment 15 Nicolas Trangez 2018-08-21 10:32:23 UTC
Related: https://bugzilla.redhat.com/show_bug.cgi?id=1597481

Comment 16 Georg Müller 2018-08-22 07:22:30 UTC
I think this is not just related, this looks like a duplicate.

As mentioned in bug 1597481, reverting commit 257355a44b9929e55d6fd47bfff66971dc4de948 (mei: make module referencing local to the bus.c) solved it for me.

Comment 17 za267 2018-08-29 18:19:11 UTC
Just a question: I didn't get a chance to try blacklisting mei, but was wondering if the revert of the above commit will be included in a future kernel build?  I assume so...?

Thanks!

Comment 18 Georg Müller 2018-08-29 21:57:03 UTC
There are already patches sent to lkml and stable, so they are hopefully included in 4.17.20.

Please see bug 1597481.

Comment 19 za267 2018-08-30 17:58:46 UTC
Thanks Georg, I am not yet familiar with the method to determine which version of the kernel a patch is included in (I'm reading up on it).  Thanks again for the version number - and for troubleshooting this issue.

Comment 20 Georg Müller 2018-09-10 19:07:37 UTC
The patches are now 2 weeks old and still nobody merged them into the mainline kernel or stable.

Can the patches please be included in the next 4.18.x release of fedora?

https://bugzilla.kernel.org/attachment.cgi?id=278055
https://bugzilla.kernel.org/attachment.cgi?id=278057

Comment 21 Georg Müller 2018-09-26 12:32:42 UTC
Kernel 4.18.10 contains the necessary patches which solve the issue

Comment 22 za267 2018-09-26 14:45:13 UTC
Thanks for the update!

Comment 23 Justin M. Forbes 2019-01-29 16:26:14 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.20.5-100.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.

If you experience different issues, please open a new bug report for those.

Comment 24 Justin M. Forbes 2019-02-21 21:11:06 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.