Bug 1294415 - Removing the nf_conntrack module hangs at shutdown using 100% CPU
Removing the nf_conntrack module hangs at shutdown using 100% CPU
Status: NEW
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
27
x86_64 Linux
unspecified Severity unspecified
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-28 01:57 EST by Daniel Miranda
Modified: 2018-05-11 01:14 EDT (History)
42 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-08-08 08:35:35 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Outputs of lsmod and lshw (21.77 KB, text/plain)
2015-12-28 01:57 EST, Daniel Miranda
no flags Details
conntrack errors in dmesg (26.32 KB, text/plain)
2015-12-28 02:16 EST, Daniel Miranda
no flags Details
ftrace module nf_conntrack (1.11 MB, application/x-gzip)
2017-02-15 14:21 EST, fulminemizzega
no flags Details
100% cpu on rmmod nf_conntrack (636.83 KB, image/png)
2017-06-16 05:51 EDT, sandeep
no flags Details

  None (edit)
Description Daniel Miranda 2015-12-28 01:57:48 EST
Created attachment 1109959 [details]
Outputs of lsmod and lshw

Description of problem:

firewalld's cleanup code, as enabled in the default Fedora 23 configuration, does `rmmod nf_conntrack`. When done on shutdown, the process hangs indefinitely using 100% CPU and holding up clean unmounting of filesystems. I can't reproduce it reliably, but it seems uptime is correlated somehow: shutting down after a longer time has a larger probability of failure.


Version-Release number of selected component (if applicable):

Fedora 23 x86_64
kernel-4.4.0-0.rc6.git1.2.fc24.x86_64 from the rawhide-kernel-nodebug repository
firewalld 0.3.14.2 (I don't think it matter though: rmmod is the command that hangs)


How reproducible:

Intermittently


Steps to Reproduce:

Hard to know if configuration-dependent, but for me, shutting down the system with firewalld's cleanup enabled. Stopping firewalld manually or calling rmmod manually outside of shutdown does not hang. Disabling the cleanup also does not hang, with seemly no ill effects.


Actual results:

rmmod hangs and holds up shutdown. Setting up a systemd debug console let me see that it hangs using 100% of a single core. I tried attaching GDB to the process but it does not succeed: GDB hangs before I get a prompt.


Expected results:

rmmod does not hang


Additional info:

I tried to reproduce with kernel-4.2.6-301.fc23.x86_64 to no avail. Unfortunately it also changes the hardware setup: the Wi-fi chipset (Broadcom BCM4350) only works in kernel 4.4. It is possible it is involved somehow, but I couldn't tell.

There are no messages on dmesg of any interest, other than the ebtables unload messaging indicating that the rmmod did not hang (as it is unloaded right afterwards).

I tried installing the conntrack-tools package, but I cannot run the `conntrack` command in the debug console: it fails with a bunch of messages about unknown symbols related to netlink.

I attached the output of lsmod with firewalld running, and a hardware description from lshw. I'm using is a Dell XPS 13 laptop (Late 2015) edition.
Comment 1 Daniel Miranda 2015-12-28 02:16:22 EST
Just had firewalld's restart hang without shutting down. As before, I cannot attach GDB to the process. Running the `conntrack` utility hangs without output (although there seem to be some errors produced in the kernel log). Fortunately I can get a bit of information from /proc.

$ cat /proc/5701/cmdline
/sbin/rmmodnf_conntrack

$ cat /proc/5701/stack
[<ffffffffa0890478>] nf_conntrack_cleanup_net_list+0x48/0x110 [nf_conntrack]
[<ffffffffa0890bb0>] nf_conntrack_pernet_exit+0x70/0x80 [nf_conntrack]
[<ffffffff81677c82>] ops_exit_list.isra.4+0x52/0x60
[<ffffffff81678198>] unregister_pernet_operations+0x78/0xd0
[<ffffffff81678211>] unregister_pernet_subsys+0x21/0x30
[<ffffffffa0897fac>] nf_conntrack_standalone_fini+0x15/0x69 [nf_conntrack]
[<ffffffff811266e5>] SyS_delete_module+0x1b5/0x210
[<ffffffff8179932e>] entry_SYSCALL_64_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff
Comment 2 Daniel Miranda 2015-12-28 02:16 EST
Created attachment 1109963 [details]
conntrack errors in dmesg
Comment 3 Gamaliel 2016-02-08 09:08:42 EST
I am able to reproduce this bug in my Dell XPS 13 Windows Edition (Broadcom 4350 Wireless). It just continues to say "a stop job is running for firewalld", and keeps adding up hang time from 90 seconds upwards.
Comment 4 Ward 2016-02-22 03:47:07 EST
I'm also seeing this bug on a MacBook Pro 13" (2015) using a Broadcom BCM43602.

It is not consistently reproducible but long uptimes seem to trigger it.
Comment 5 Luca Giuzzi 2016-03-02 03:48:45 EST
I have the same problem on a Dell XPS 15 where the wireless controller is a broadcom BCM43602. I have not altered the firewall configuration from stock (apart from opening port 22).
Comment 6 Jason Birch 2016-04-24 12:16:08 EDT
I've noticed this twice under Fedora 24 Alpha, on a Dell XPS 13 (9350) with a Broadcom BCM4350 rev08, during normal day-to-day usage. I suspect the `rmmod` is being triggered as part of upgrading packages (last `dnf upgrade` picked up firewalld-0.4.1-1.fc24.noarch). Impact being that instead of my laptop being unable to shutdown, I instead on upgrade of firewalld chew through battery as a core is pegged at 100%.

I can't really do much with the  process. It's unkillable, _trace/gdb hang when attaching to it, and the stack is empty (which I don't know is useful, but someone else has given those details so I thought I would as well):

# cat /proc/25939/stack 
[<ffffffffffffffff>] 0xffffffffffffffff

`dmesg` and `journalctl` are absolutely littered with unknown symbols for nf_nat and xt_conntrack as in comment #2.

# uname -a
Linux sleepygary 4.5.1-300.fc24.x86_64 #1 SMP Tue Apr 12 18:55:06 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Comment 7 Matthew Garrett 2016-05-13 13:47:42 EDT
Is anybody seeing this *not* using the brcmfmac driver?
Comment 8 Ward 2016-05-18 15:14:47 EDT
All the people I know suffering from this problem are using brcmfmac. So yes, it's probably an issue with this driver.

Currently (on my up to dat F23 system), it happens in 99% of the cases if the uptime is long enough (10 minutes is enough).
Comment 9 dylanxmackenzie 2016-05-21 17:25:58 EDT
Also experiencing this problem on a Dell XPS 15 with the brcmfmac driver on both Fedora 23 and Fedora 24 Beta (Kernel 4.5.4-300.fc24.x86_64. Since I never stop firewalld except on shutdown and my laptop isn't exactly mission critical, I removed the two lines which call unload_firewall_modules() from /usr/lib/python3.5/site-packages/firewall/core/fw.py, which prevents firewalld from calling rmmod when it is stopped or restarted. According to https://bugzilla.redhat.com/show_bug.cgi?id=1031102#c22, firewalld removes these modules to provide a small performance boost. This fixes the problem, but is obviously not ideal.
Comment 10 Daniel Miranda 2016-05-21 19:15:50 EDT
dylanxmackenzie: it's not strictly necessary to modify firewalld's code, there is a CleanupOnExit option in /etc/firewalld/firewalld.conf that you can set to `no`. It also has the side effect of not resetting the iptables rules when the daemon exits, but if you're not usually stopping it manually it shouldn't make a difference.
Comment 11 dylanxmackenzie 2016-05-21 19:28:35 EDT
Daniel Miranda (In reply to Daniel Miranda from comment #10)
> dylanxmackenzie: it's not strictly necessary to modify firewalld's code,
> there is a CleanupOnExit option in /etc/firewalld/firewalld.conf that you
> can set to `no`. It also has the side effect of not resetting the iptables
> rules when the daemon exits, but if you're not usually stopping it manually
> it shouldn't make a difference.

Whoops, that makes things much simpler. Thanks.
Comment 12 dylanxmackenzie 2016-05-21 19:29:16 EDT
Daniel Miranda (In reply to Daniel Miranda from comment #10)
> dylanxmackenzie: it's not strictly necessary to modify firewalld's code,
> there is a CleanupOnExit option in /etc/firewalld/firewalld.conf that you
> can set to `no`. It also has the side effect of not resetting the iptables
> rules when the daemon exits, but if you're not usually stopping it manually
> it shouldn't make a difference.

Whoops, that makes things much simpler. Thanks.
Comment 13 Jason Birch 2016-07-05 23:14:50 EDT
Still noticing this one under 4.6.3-300.fc24.x86_64 for kernel/net/netfilter/nf_conntrack.ko.xz. Is there any other additional information that I can provide or things I can try to help diagnose the cause?
Comment 14 d.engelbarts 2016-09-21 18:43:42 EDT
I actually see rmmod nf_conntrack eating 100% CPU even before rebooting, fresh install on a MBP 15" (recent model). Installed a bunch of stuff and did an update, dnf started to slow down and "top" showed rmmod hogging one core. dnf timed out after a long wait. I first reran the dnf command, this second run it performed like expected, rmmod was still hogging the core. I then rebooted which took a long time but it did eventually proceed.
Comment 15 Laura Abbott 2016-09-23 15:09:59 EDT
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.
 
Fedora 24 has now been rebased to 4.7.4-200.fc24.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 25, and are still experiencing this issue, please change the version to Fedora 25.
 
If you experience different issues, please open a new bug report for those.
Comment 16 OlliC 2016-10-04 09:32:49 EDT
Bug still appears on my Thinkpad L460 with kernel 4.7.5-200.fc24.
Comment 17 OlliC 2016-10-09 13:49:29 EDT
Also still happens with 4.7.6-200.fc24.
Comment 18 Dan Siemon 2016-10-16 22:12:24 EDT
I still see this with Fedora 25 beta and kernel 4.8.1-1.fc25.x86_64.
Comment 19 fulminemizzega 2016-11-20 14:04:28 EST
I still se this with F25 RC1.2 and F24 on a dell XPS 15 9550
(same wifi/bt chip as others have already reported:
https://wikidevi.com/wiki/Dell_Wireless_1830_(DW1830) )
Comment 20 Ben Konrath 2016-12-30 07:05:29 EST
I'm seeing this on a MacBook Pro 13" (2015) in CentOS now that the BCM43602 is supported in 7.3. It wasn't an issue for CentOS 7.2 because wifi card wasn't supported in that version. I realize this bug is filed against Fedora, not RHEL or CentOS but I wanted to add this note in case it's helpful for diagnosing and solving the problem.
Comment 21 Trevor Flynn 2017-01-04 19:46:08 EST
Hello I also have this issue on a Dell XPS 9550 with a Broadcom Limited BCM43602. I am running Fedora 25 4.8.15-300.fc25.x86_64.
Comment 22 Maël Lavault 2017-01-12 04:15:00 EST
Same issue here Macbook pro mid 2015 with BCM43602. Haven't been able to do a clean shutdown since i installed fedora (which i use every day).
Comment 23 Enrico Tagliavini 2017-01-12 04:24:31 EST
(In reply to Maël Lavault from comment #22)
> Same issue here Macbook pro mid 2015 with BCM43602. Haven't been able to do
> a clean shutdown since i installed fedora (which i use every day).

As mentioned in comment #10 you can set CleanupOnExit=no in firewalld.conf as a temporary workaround.

Btw is there an upstream bug for this? That's a kernel bug in the end, should not hang up when removing a module.
Comment 24 Thomas Cameron 2017-01-17 11:15:39 EST
I'm pretty sure bugs 1294415, 1293041, and 1397274 are all related. I'm having the same issue on my RHEL7 installations (six of them).
Comment 25 fulminemizzega 2017-02-15 14:21 EST
Created attachment 1250707 [details]
ftrace module nf_conntrack

Hello, I've run some more tests, now with F25 + kernel 4.9.9-200.fc25.x86_64 on XPS 9550 (it's using brcmfmac), I've found that if I stop and start firewalld just after boot everything works fine, instead if I stop it after some time I can reproduce the issue. I've tried to dig deeper but I don't really know what I'm doing. This is what I've found, I hope it can help:
firewalld running -> nf_conntrack has used by = 10 (from lsmod)
firewalld stop works -> lsmod | grep nf_ is empty
firewalld failing to stop -> nf_conntrack has used by = -1 (from lsmod)
used by = -1 means that the module is unloading (function module_refcount from kernel/module.c, there's a comment above); the output of lsmod | grep nf_ in this case is:
nf_reject_ipv6         16384  1 ip6t_REJECT
nf_defrag_ipv6         36864  0
nf_defrag_ipv4         16384  0
nf_conntrack          106496  -1

I booted F25 in rescue mode (add 1 to kernel cmdline), then run
 trace-cmd start -e module -f 'name == nf_conntrack'
then resumed boot, then I run systemctl start/stop firewalld a few times, when it failed I stopped ftrace and saved the report, which I've attached here.
Is this bug also filed upstream?
Comment 26 Trevor Flynn 2017-02-15 15:04:56 EST
There is a bug filed for this for F25 here: bug 1397274
Comment 27 Justin M. Forbes 2017-04-11 10:40:56 EDT
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.

Fedora 25 has now been rebased to 4.10.9-100.fc24.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.

If you experience different issues, please open a new bug report for those.
Comment 28 OlliC 2017-04-14 14:13:23 EDT
Still happens on Fedora 25 with 4.10.8-200.fc25.x86_64.
Comment 29 Cajus Pollmeier 2017-04-24 03:25:44 EDT
Same on 4.10.10-200.fc25.x86_64. Additionally rmmod eats up 99% of my CPU time and I've to reboot.
Comment 30 Jeroen Tietema 2017-04-29 03:50:23 EDT
Still there on 4.10.12-200.fc25.x86_64

I also tested on the Fedora 26 Alpha live usb, and it is present there as well.

I am running on a MacbookPro Mid 2015 15".
Comment 31 sandeep 2017-06-16 05:51 EDT
Created attachment 1288289 [details]
100% cpu on rmmod nf_conntrack

I have similar issue on Fedora 25

LSB Version:	:core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID:	Fedora
Description:	Fedora release 25 (Twenty Five)
Release:	25
Codename:	TwentyFive

Linux menzoberranzan 4.11.4-200.fc25.x86_64 #1 SMP Wed Jun 7 18:28:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

my machine hangs sometimes with 100% cpu and overheating. It is always usually an rmmod nf_conntrack. Screenshot attached
Comment 32 Paolo Abeni 2017-06-21 06:29:00 EDT
Hi,

this should be fixed in upstream kernel v4.12-rc1 and later - so the rawhide kernel should be good. Can you please have a run at it?

Thanks,

Paolo
Comment 33 Jan Schmidt 2017-06-26 08:13:12 EDT
Just encountered it on 4.12.0-0.rc2.git2.2.fc27.x86_64
Comment 34 Phea Duch 2017-07-20 06:31:06 EDT
Still encountering this in 4.13.0-0.rc1.git0.1.fc27. 

I am another Dell XPS 9350 user with the Broadcom chip and have been dealing with this issue for close to a year on Rawhide. The machine does eventually poweroff after firewalld shutdown process times out in ~2 minutes.
Comment 35 Paolo Abeni 2017-07-20 07:03:08 EDT
(In reply to Phea Duch from comment #34)
> Still encountering this in 4.13.0-0.rc1.git0.1.fc27. 

This is quite unexpected: 4.13.0-0.rc1.git0.1.fc27 definitely does not contain the code path triggered by the issue, as described e.g. in comment#1.

Can you please provide the output of:

cat /proc/`pidof rmmod`/stack

?
(to be run when the issue occurs, as root)

Thanks,

Paolo
Comment 36 Fedora End Of Life 2017-07-25 15:41:15 EDT
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Comment 37 Fedora End Of Life 2017-08-08 08:35:35 EDT
Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.
Comment 38 Uwe Köcher 2017-08-11 13:10:15 EDT
(In reply to Paolo Abeni from comment #32)
> Hi,
> 
> this should be fixed in upstream kernel v4.12-rc1 and later - so the rawhide
> kernel should be good. Can you please have a run at it?
> 
> Thanks,
> 
> Paolo

Hej Paolo,

I have a XPS 13 (9359 / late 2015 with Broadcom Wifi). The system had the issue on F24 and now also on a freshly installed F26. For testing purpose, I've installed
* kernel-4.12.5-300.fc26.x86_64
* kernel-core-4.12.5-300.fc26.x86_64 
* kernel-modules-4.12.5-300.fc26.x86_64
from koji. But this does not resolve the issue with the non-stopping dynamic firewalld on reboot or shutdown.

Only the workaround, i.e. setting CleanupOnExit=no in /etc/firewalld/firewalld.conf resolves the issue.

Maybe it is not a kernel issue, instead I think it is something around firewalld and the broadcom wifi.

Note: on the current XPS 13 (late 2016 with Killer Wifi) the issue is non-present.
Comment 39 OlliC 2017-12-21 09:07:21 EST
Still having this issue on Fedora 27 with a Thinkpad L460 and Broadcom Wifi. 

Does it really need to be reopened?
Comment 40 Martin Bříza 2018-02-02 07:02:50 EST
Still happening consistently with Fedora 27, I guess about the same hardware as everyone else in this bug:

kernel-4.14.14-300.fc27.x86_64

02:00.0 Network controller: Broadcom Limited BCM43602 802.11ac Wireless LAN SoC (rev 01)


[root@localhost ~]# cat /proc/`pidof rmmod`/stack 
[<ffffffffffffffff>] 0xffffffffffffffff
Comment 41 Cajus Pollmeier 2018-05-11 01:14:52 EDT
F28 still has the issue, too.

Note You need to log in before you can comment on or make changes to this bug.