Red Hat Bugzilla – Bug 1495893
ebtables --concurrent may leave a stale lock file
Last modified: 2018-04-10 13:34:58 EDT
Description of problem: we just recently updated NetworkManager CI infrastructure to RHEL7.4 and we see some failures connected to firewalld: ERROR:dbus.proxies:Introspect error on :1.10172:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. And firewalld doesn't seem to respond back to NM anymore. We don't have proper reproducer but it cripples quite some portion of our nightly tests once in few days. Version-Release number of selected component (if applicable): kernel-3.10.0-693.2.1.el7.x86_64 NetworkManager-1.8.5-17607.b363a6cfad.el7.x86_64 firewalld-0.4.4.4-6.el7.noarch systemd-219-42.el7_4.1.x86_64 How reproducible: random Steps to Reproduce: N/A
Can you check the firewalld logs? # systemctl status -n 50 firewalld
nothing helpful there [root@gsm-r5s2-01 ~]# systemctl status -n 50 firewalld -l ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled) Active: active (running) since Út 2017-09-26 12:07:49 EDT; 4h 57min ago Docs: man:firewalld(1) Main PID: 6444 (firewalld) CGroup: /system.slice/firewalld.service ├─6444 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid └─6462 /usr/sbin/ebtables --concurrent -L zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon... zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon. [root@gsm-r5s2-01 ~]# journalctl -u firewalld -- Logs begin at Út 2017-09-26 10:48:25 EDT, end at Út 2017-09-26 17:05:10 EDT. -- zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon... zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon. zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Stopping firewalld - dynamic firewall daemon... zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon... zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.
When the issue happens, each firewall-cmd would hang. Also something like: # firewall-cmd -V or # firewall-cmd --state would hang indefinitely. After a while the: "ERROR:dbus.proxies:Introspect error on :1.10172:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken." msg is printed on screen (firewall-cmd stays blocked). Restarting the firewalld service via systemctl does not help. Restarting the DBUS service via systemctl does not help too.
I am able to reproduce the issue on an up-to-date RHEL-7.4 VM with the following script: #!/bin/bash i=1; while true; do echo "--- $i ---"; systemctl unmask firewalld; systemctl start firewalld; usleep 40000 systemctl restart firewalld; firewall-cmd -V; service firewalld stop; i=$((i+1)); done It hanged on my VM on "firewall-cmd -V" after some iterations... it happened after 2, 6, 10, 15 and 36 (needed reboot after each hang). I then got error like: "ERROR:dbus.proxies:Introspect error on :1.84:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken." and firewall-cmd is no longer usable. No useful logs in journal. Let me know if there is some log I can collect for you. Note that changing the timing (e.g., completely removing the usleep or changing it with a sleep 1) would prevent the issue. Also putting a "sleep 1" somewhere in the loop will prevent from seeing the issue. That timing works on my VM only: tried the script on another (beaker) machine.. it was really much harder to reproduce.
I was able to reproduce this in my VM as per comment 4. Thanks Francesco. This is an issue with ebtables concurrent locking. My guess is that when firewalld is killed by systemd ebtables is also killed, but the lock file is left laying around. Further invocations just spin waiting on the lock file. As such, firewalld is left waiting on ebtables to return and is therefore not able to process dbus requests. Since the lock file is left laying around restarting firewalld/dbus/etc has no effect. Only removing the stale lock file, /run/ebtables.lock, or a reboot resolves the issue. Quick work around: # systemctl stop firewalld # rm /run/ebtables.lock # systemctl start firewalld --->8--- # systemctl status -n 50 -l firewalld ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2017-09-28 15:05:25 EDT; 4min 51s ago Docs: man:firewalld(1) Main PID: 6751 (firewalld) CGroup: /system.slice/firewalld.service ├─6751 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid --debug 5 └─6762 /usr/sbin/ebtables --concurrent -L # strace -p 6762 strace: Process 6762 attached restart_syscall(<... resuming interrupted nanosleep ...>) = 0 rt_sigprocmask(SIG_BLOCK, [INT], NULL, 8) = 0 open("/run/ebtables.lock", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists) rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0 write(2, "Trying to obtain lock /run/ebtab"..., 41) = 41 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 nanosleep({1, 0}, 0x7ffdd7d5cab0) = 0
Reproducing with ebtables alone: In first terminal: # while true; do killall ebtables &> /dev/null; done In second terminal: # ebtables --concurrent -L Terminated Then ctrl+c first terminal loop. Then try ebtables again: # ebtables --concurrent -L Trying to obtain lock /run/ebtables.lock Trying to obtain lock /run/ebtables.lock # rm /run/ebtables.lock rm: remove regular empty file ‘/run/ebtables.lock’? y [root@dev-rhel7 ~]# ebtables --concurrent -L Bridge table: filter Bridge chain: INPUT, entries: 1, policy: ACCEPT -j INPUT_direct Bridge chain: FORWARD, entries: 1, policy: ACCEPT -j FORWARD_direct Bridge chain: OUTPUT, entries: 1, policy: ACCEPT -j OUTPUT_direct Bridge chain: INPUT_direct, entries: 0, policy: RETURN Bridge chain: OUTPUT_direct, entries: 0, policy: RETURN Bridge chain: FORWARD_direct, entries: 0, policy: RETURN
Locking in ebtables wasn't atomic, so that it was possible that the lock stayed in place after the process was killed. I fixed that by using flock() (took locking in iptables as an example): https://marc.info/?l=netfilter-devel&m=150728694430435&w=2
Upstream accepted: commit 6a826591878db3fa9e2a94b87a3d5edd8e0fc442 Author: Phil Sutter <phil@nwl.cc> Date: Fri Oct 6 12:48:50 2017 +0200 Use flock() for --concurrent option The previous locking mechanism was not atomic, hence it was possible that a killed ebtables process would leave the lock file in place which in turn made future ebtables processes wait indefinitely for the lock to become free. Fix this by using flock(). This also simplifies code quite a bit because there is no need for a custom signal handler or an __exit routine anymore. Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Anita, could you please provide pm_ack+ for this ticket? Thanks, Phil
Ah, nevermind - all set now.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0941