Red Hat Bugzilla – Bug 1294415
Removing the nf_conntrack module hangs at shutdown using 100% CPU
Last modified: 2018-05-11 01:14:52 EDT
Created attachment 1109959 [details]
Outputs of lsmod and lshw
Description of problem:
firewalld's cleanup code, as enabled in the default Fedora 23 configuration, does `rmmod nf_conntrack`. When done on shutdown, the process hangs indefinitely using 100% CPU and holding up clean unmounting of filesystems. I can't reproduce it reliably, but it seems uptime is correlated somehow: shutting down after a longer time has a larger probability of failure.
Version-Release number of selected component (if applicable):
Fedora 23 x86_64
kernel-4.4.0-0.rc6.git1.2.fc24.x86_64 from the rawhide-kernel-nodebug repository
firewalld 0.3.14.2 (I don't think it matter though: rmmod is the command that hangs)
Steps to Reproduce:
Hard to know if configuration-dependent, but for me, shutting down the system with firewalld's cleanup enabled. Stopping firewalld manually or calling rmmod manually outside of shutdown does not hang. Disabling the cleanup also does not hang, with seemly no ill effects.
rmmod hangs and holds up shutdown. Setting up a systemd debug console let me see that it hangs using 100% of a single core. I tried attaching GDB to the process but it does not succeed: GDB hangs before I get a prompt.
rmmod does not hang
I tried to reproduce with kernel-4.2.6-301.fc23.x86_64 to no avail. Unfortunately it also changes the hardware setup: the Wi-fi chipset (Broadcom BCM4350) only works in kernel 4.4. It is possible it is involved somehow, but I couldn't tell.
There are no messages on dmesg of any interest, other than the ebtables unload messaging indicating that the rmmod did not hang (as it is unloaded right afterwards).
I tried installing the conntrack-tools package, but I cannot run the `conntrack` command in the debug console: it fails with a bunch of messages about unknown symbols related to netlink.
I attached the output of lsmod with firewalld running, and a hardware description from lshw. I'm using is a Dell XPS 13 laptop (Late 2015) edition.
Just had firewalld's restart hang without shutting down. As before, I cannot attach GDB to the process. Running the `conntrack` utility hangs without output (although there seem to be some errors produced in the kernel log). Fortunately I can get a bit of information from /proc.
$ cat /proc/5701/cmdline
$ cat /proc/5701/stack
[<ffffffffa0890478>] nf_conntrack_cleanup_net_list+0x48/0x110 [nf_conntrack]
[<ffffffffa0890bb0>] nf_conntrack_pernet_exit+0x70/0x80 [nf_conntrack]
[<ffffffffa0897fac>] nf_conntrack_standalone_fini+0x15/0x69 [nf_conntrack]
Created attachment 1109963 [details]
conntrack errors in dmesg
I am able to reproduce this bug in my Dell XPS 13 Windows Edition (Broadcom 4350 Wireless). It just continues to say "a stop job is running for firewalld", and keeps adding up hang time from 90 seconds upwards.
I'm also seeing this bug on a MacBook Pro 13" (2015) using a Broadcom BCM43602.
It is not consistently reproducible but long uptimes seem to trigger it.
I have the same problem on a Dell XPS 15 where the wireless controller is a broadcom BCM43602. I have not altered the firewall configuration from stock (apart from opening port 22).
I've noticed this twice under Fedora 24 Alpha, on a Dell XPS 13 (9350) with a Broadcom BCM4350 rev08, during normal day-to-day usage. I suspect the `rmmod` is being triggered as part of upgrading packages (last `dnf upgrade` picked up firewalld-0.4.1-1.fc24.noarch). Impact being that instead of my laptop being unable to shutdown, I instead on upgrade of firewalld chew through battery as a core is pegged at 100%.
I can't really do much with the process. It's unkillable, _trace/gdb hang when attaching to it, and the stack is empty (which I don't know is useful, but someone else has given those details so I thought I would as well):
# cat /proc/25939/stack
`dmesg` and `journalctl` are absolutely littered with unknown symbols for nf_nat and xt_conntrack as in comment #2.
# uname -a
Linux sleepygary 4.5.1-300.fc24.x86_64 #1 SMP Tue Apr 12 18:55:06 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
Is anybody seeing this *not* using the brcmfmac driver?
All the people I know suffering from this problem are using brcmfmac. So yes, it's probably an issue with this driver.
Currently (on my up to dat F23 system), it happens in 99% of the cases if the uptime is long enough (10 minutes is enough).
Also experiencing this problem on a Dell XPS 15 with the brcmfmac driver on both Fedora 23 and Fedora 24 Beta (Kernel 4.5.4-300.fc24.x86_64. Since I never stop firewalld except on shutdown and my laptop isn't exactly mission critical, I removed the two lines which call unload_firewall_modules() from /usr/lib/python3.5/site-packages/firewall/core/fw.py, which prevents firewalld from calling rmmod when it is stopped or restarted. According to https://bugzilla.redhat.com/show_bug.cgi?id=1031102#c22, firewalld removes these modules to provide a small performance boost. This fixes the problem, but is obviously not ideal.
dylanxmackenzie: it's not strictly necessary to modify firewalld's code, there is a CleanupOnExit option in /etc/firewalld/firewalld.conf that you can set to `no`. It also has the side effect of not resetting the iptables rules when the daemon exits, but if you're not usually stopping it manually it shouldn't make a difference.
Daniel Miranda (In reply to Daniel Miranda from comment #10)
> dylanxmackenzie: it's not strictly necessary to modify firewalld's code,
> there is a CleanupOnExit option in /etc/firewalld/firewalld.conf that you
> can set to `no`. It also has the side effect of not resetting the iptables
> rules when the daemon exits, but if you're not usually stopping it manually
> it shouldn't make a difference.
Whoops, that makes things much simpler. Thanks.
Still noticing this one under 4.6.3-300.fc24.x86_64 for kernel/net/netfilter/nf_conntrack.ko.xz. Is there any other additional information that I can provide or things I can try to help diagnose the cause?
I actually see rmmod nf_conntrack eating 100% CPU even before rebooting, fresh install on a MBP 15" (recent model). Installed a bunch of stuff and did an update, dnf started to slow down and "top" showed rmmod hogging one core. dnf timed out after a long wait. I first reran the dnf command, this second run it performed like expected, rmmod was still hogging the core. I then rebooted which took a long time but it did eventually proceed.
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.
Fedora 24 has now been rebased to 4.7.4-200.fc24. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 25, and are still experiencing this issue, please change the version to Fedora 25.
If you experience different issues, please open a new bug report for those.
Bug still appears on my Thinkpad L460 with kernel 4.7.5-200.fc24.
Also still happens with 4.7.6-200.fc24.
I still see this with Fedora 25 beta and kernel 4.8.1-1.fc25.x86_64.
I still se this with F25 RC1.2 and F24 on a dell XPS 15 9550
(same wifi/bt chip as others have already reported:
I'm seeing this on a MacBook Pro 13" (2015) in CentOS now that the BCM43602 is supported in 7.3. It wasn't an issue for CentOS 7.2 because wifi card wasn't supported in that version. I realize this bug is filed against Fedora, not RHEL or CentOS but I wanted to add this note in case it's helpful for diagnosing and solving the problem.
Hello I also have this issue on a Dell XPS 9550 with a Broadcom Limited BCM43602. I am running Fedora 25 4.8.15-300.fc25.x86_64.
Same issue here Macbook pro mid 2015 with BCM43602. Haven't been able to do a clean shutdown since i installed fedora (which i use every day).
(In reply to Maël Lavault from comment #22)
> Same issue here Macbook pro mid 2015 with BCM43602. Haven't been able to do
> a clean shutdown since i installed fedora (which i use every day).
As mentioned in comment #10 you can set CleanupOnExit=no in firewalld.conf as a temporary workaround.
Btw is there an upstream bug for this? That's a kernel bug in the end, should not hang up when removing a module.
I'm pretty sure bugs 1294415, 1293041, and 1397274 are all related. I'm having the same issue on my RHEL7 installations (six of them).
Created attachment 1250707 [details]
ftrace module nf_conntrack
Hello, I've run some more tests, now with F25 + kernel 4.9.9-200.fc25.x86_64 on XPS 9550 (it's using brcmfmac), I've found that if I stop and start firewalld just after boot everything works fine, instead if I stop it after some time I can reproduce the issue. I've tried to dig deeper but I don't really know what I'm doing. This is what I've found, I hope it can help:
firewalld running -> nf_conntrack has used by = 10 (from lsmod)
firewalld stop works -> lsmod | grep nf_ is empty
firewalld failing to stop -> nf_conntrack has used by = -1 (from lsmod)
used by = -1 means that the module is unloading (function module_refcount from kernel/module.c, there's a comment above); the output of lsmod | grep nf_ in this case is:
nf_reject_ipv6 16384 1 ip6t_REJECT
nf_defrag_ipv6 36864 0
nf_defrag_ipv4 16384 0
nf_conntrack 106496 -1
I booted F25 in rescue mode (add 1 to kernel cmdline), then run
trace-cmd start -e module -f 'name == nf_conntrack'
then resumed boot, then I run systemctl start/stop firewalld a few times, when it failed I stopped ftrace and saved the report, which I've attached here.
Is this bug also filed upstream?
There is a bug filed for this for F25 here: bug 1397274
*********** MASS BUG UPDATE **************
We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 24 kernel bugs.
Fedora 25 has now been rebased to 4.10.9-100.fc24. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
If you have moved on to Fedora 26, and are still experiencing this issue, please change the version to Fedora 26.
If you experience different issues, please open a new bug report for those.
Still happens on Fedora 25 with 4.10.8-200.fc25.x86_64.
Same on 4.10.10-200.fc25.x86_64. Additionally rmmod eats up 99% of my CPU time and I've to reboot.
Still there on 4.10.12-200.fc25.x86_64
I also tested on the Fedora 26 Alpha live usb, and it is present there as well.
I am running on a MacbookPro Mid 2015 15".
Created attachment 1288289 [details]
100% cpu on rmmod nf_conntrack
I have similar issue on Fedora 25
LSB Version: :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: Fedora
Description: Fedora release 25 (Twenty Five)
Linux menzoberranzan 4.11.4-200.fc25.x86_64 #1 SMP Wed Jun 7 18:28:00 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
my machine hangs sometimes with 100% cpu and overheating. It is always usually an rmmod nf_conntrack. Screenshot attached
this should be fixed in upstream kernel v4.12-rc1 and later - so the rawhide kernel should be good. Can you please have a run at it?
Just encountered it on 4.12.0-0.rc2.git2.2.fc27.x86_64
Still encountering this in 4.13.0-0.rc1.git0.1.fc27.
I am another Dell XPS 9350 user with the Broadcom chip and have been dealing with this issue for close to a year on Rawhide. The machine does eventually poweroff after firewalld shutdown process times out in ~2 minutes.
(In reply to Phea Duch from comment #34)
> Still encountering this in 4.13.0-0.rc1.git0.1.fc27.
This is quite unexpected: 4.13.0-0.rc1.git0.1.fc27 definitely does not contain the code path triggered by the issue, as described e.g. in comment#1.
Can you please provide the output of:
cat /proc/`pidof rmmod`/stack
(to be run when the issue occurs, as root)
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora 'version'
Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.
Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.
Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.
Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.
If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
Thank you for reporting this bug and we are sorry it could not be fixed.
(In reply to Paolo Abeni from comment #32)
> this should be fixed in upstream kernel v4.12-rc1 and later - so the rawhide
> kernel should be good. Can you please have a run at it?
I have a XPS 13 (9359 / late 2015 with Broadcom Wifi). The system had the issue on F24 and now also on a freshly installed F26. For testing purpose, I've installed
from koji. But this does not resolve the issue with the non-stopping dynamic firewalld on reboot or shutdown.
Only the workaround, i.e. setting CleanupOnExit=no in /etc/firewalld/firewalld.conf resolves the issue.
Maybe it is not a kernel issue, instead I think it is something around firewalld and the broadcom wifi.
Note: on the current XPS 13 (late 2016 with Killer Wifi) the issue is non-present.
Still having this issue on Fedora 27 with a Thinkpad L460 and Broadcom Wifi.
Does it really need to be reopened?
Still happening consistently with Fedora 27, I guess about the same hardware as everyone else in this bug:
02:00.0 Network controller: Broadcom Limited BCM43602 802.11ac Wireless LAN SoC (rev 01)
[root@localhost ~]# cat /proc/`pidof rmmod`/stack
F28 still has the issue, too.