Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1495893 - ebtables --concurrent may leave a stale lock file
ebtables --concurrent may leave a stale lock file
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: ebtables (Show other bugs)
7.4
Unspecified Unspecified
high Severity high
: rc
: ---
Assigned To: Phil Sutter
Tomas Dolezal
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-09-26 09:01 EDT by Vladimir Benes
Modified: 2018-04-10 13:34 EDT (History)
6 users (show)

See Also:
Fixed In Version: ebtables-2.0.10-16.el7
Doc Type: Bug Fix
Doc Text:
Cause: When being killed, ebtables process may leave a stale lock file in place. Consequence: Consecutive calls to ebtables would hang indefinitely, waiting for the lock file to disappear and occasionally printing: "Trying to obtain lock /run/ebtables.lock". Fix: Ebtables code was changed to use flock() for lock file handling which makes the lock disappear automatically upon program termination. Result: Unexpected ebtables program termination does not lead to stale lock file anymore.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2018-04-10 13:34:55 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:0941 None None None 2018-04-10 13:34 EDT

  None (edit)
Description Vladimir Benes 2017-09-26 09:01:33 EDT
Description of problem:
we just recently updated NetworkManager CI infrastructure to RHEL7.4 and we see some failures connected to firewalld:

ERROR:dbus.proxies:Introspect error on :1.10172:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


And firewalld doesn't seem to respond back to NM anymore. We don't have proper reproducer but it cripples quite some portion of our nightly tests once in few days.

Version-Release number of selected component (if applicable):
kernel-3.10.0-693.2.1.el7.x86_64
NetworkManager-1.8.5-17607.b363a6cfad.el7.x86_64
firewalld-0.4.4.4-6.el7.noarch
systemd-219-42.el7_4.1.x86_64

How reproducible:
random

Steps to Reproduce:
N/A
Comment 1 Eric Garver 2017-09-26 14:04:42 EDT
Can you check the firewalld logs?

  # systemctl status -n 50 firewalld
Comment 2 Vladimir Benes 2017-09-26 17:07:57 EDT
nothing helpful there
[root@gsm-r5s2-01 ~]# systemctl status -n 50 firewalld -l
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: active (running) since Út 2017-09-26 12:07:49 EDT; 4h 57min ago
     Docs: man:firewalld(1)
 Main PID: 6444 (firewalld)
   CGroup: /system.slice/firewalld.service
           ├─6444 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
           └─6462 /usr/sbin/ebtables --concurrent -L

zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.

[root@gsm-r5s2-01 ~]# journalctl -u firewalld
-- Logs begin at Út 2017-09-26 10:48:25 EDT, end at Út 2017-09-26 17:05:10 EDT. --
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Stopping firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.
Comment 3 Francesco Giudici 2017-09-28 03:54:47 EDT
When the issue happens, each firewall-cmd would hang.
Also something like:
# firewall-cmd -V
or
# firewall-cmd --state
would hang indefinitely. After a while the:
"ERROR:dbus.proxies:Introspect error on :1.10172:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."
msg is printed on screen (firewall-cmd stays blocked).

Restarting the firewalld service via systemctl does not help.
Restarting the DBUS service via systemctl does not help too.
Comment 4 Francesco Giudici 2017-09-28 08:05:44 EDT
I am able to reproduce the issue on an up-to-date RHEL-7.4 VM with the following script:

 #!/bin/bash

i=1;
while true; do
	echo "---  $i  ---";
	systemctl unmask firewalld;
	systemctl start firewalld;
	usleep 40000
	systemctl restart firewalld;
	firewall-cmd -V;
	service firewalld stop;
	i=$((i+1));
done

It hanged on my VM on "firewall-cmd -V" after some iterations... it happened after 2, 6, 10, 15 and 36 (needed reboot after each hang).
I then got error like:
"ERROR:dbus.proxies:Introspect error on :1.84:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."

and firewall-cmd is no longer usable.
No useful logs in journal.
Let me know if there is some log I can collect for you.

Note that changing the timing (e.g., completely removing the usleep or changing it with a sleep 1) would prevent the issue.
Also putting a "sleep 1" somewhere in the loop will prevent from seeing the issue.
That timing works on my VM only: tried the script on another (beaker) machine.. it was really much harder to reproduce.
Comment 5 Eric Garver 2017-09-28 15:23:55 EDT
I was able to reproduce this in my VM as per comment 4. Thanks Francesco.

This is an issue with ebtables concurrent locking. My guess is that when firewalld is killed by systemd ebtables is also killed, but the lock file is left laying around. Further invocations just spin waiting on the lock file. As such, firewalld is left waiting on ebtables to return and is therefore not able to process dbus requests.

Since the lock file is left laying around restarting firewalld/dbus/etc has no effect. Only removing the stale lock file, /run/ebtables.lock, or a reboot resolves the issue.

Quick work around:
  # systemctl stop firewalld
  # rm /run/ebtables.lock
  # systemctl start firewalld

--->8---

#  systemctl status -n 50 -l firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-28 15:05:25 EDT; 4min 51s ago
     Docs: man:firewalld(1)
 Main PID: 6751 (firewalld)
   CGroup: /system.slice/firewalld.service
           ├─6751 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid --debug 5
           └─6762 /usr/sbin/ebtables --concurrent -L


# strace -p 6762
strace: Process 6762 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
rt_sigprocmask(SIG_BLOCK, [INT], NULL, 8) = 0
open("/run/ebtables.lock", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0
write(2, "Trying to obtain lock /run/ebtab"..., 41) = 41
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffdd7d5cab0)       = 0
Comment 7 Eric Garver 2017-09-28 15:54:27 EDT
Reproducing with ebtables alone:

In first terminal:
  #  while true; do killall ebtables &> /dev/null; done

In second terminal:
  # ebtables --concurrent -L
  Terminated

Then ctrl+c first terminal loop.

Then try ebtables again:

  # ebtables --concurrent -L
  Trying to obtain lock /run/ebtables.lock

  Trying to obtain lock /run/ebtables.lock

# rm /run/ebtables.lock
rm: remove regular empty file ‘/run/ebtables.lock’? y

[root@dev-rhel7 ~]# ebtables --concurrent -L
Bridge table: filter

Bridge chain: INPUT, entries: 1, policy: ACCEPT
-j INPUT_direct

Bridge chain: FORWARD, entries: 1, policy: ACCEPT
-j FORWARD_direct

Bridge chain: OUTPUT, entries: 1, policy: ACCEPT
-j OUTPUT_direct

Bridge chain: INPUT_direct, entries: 0, policy: RETURN

Bridge chain: OUTPUT_direct, entries: 0, policy: RETURN

Bridge chain: FORWARD_direct, entries: 0, policy: RETURN
Comment 8 Phil Sutter 2017-10-06 06:58:19 EDT
Locking in ebtables wasn't atomic, so that it was possible that the lock stayed in place after the process was killed. I fixed that by using flock() (took locking in iptables as an example):

https://marc.info/?l=netfilter-devel&m=150728694430435&w=2
Comment 9 Phil Sutter 2017-10-24 13:45:11 EDT
Upstream accepted:

commit 6a826591878db3fa9e2a94b87a3d5edd8e0fc442
Author: Phil Sutter <phil@nwl.cc>
Date:   Fri Oct 6 12:48:50 2017 +0200

    Use flock() for --concurrent option
    
    The previous locking mechanism was not atomic, hence it was possible
    that a killed ebtables process would leave the lock file in place which
    in turn made future ebtables processes wait indefinitely for the lock to
    become free.
    
    Fix this by using flock(). This also simplifies code quite a bit because
    there is no need for a custom signal handler or an __exit routine
    anymore.
    
    Signed-off-by: Phil Sutter <phil@nwl.cc>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Comment 11 Phil Sutter 2017-10-25 09:03:45 EDT
Anita, could you please provide pm_ack+ for this ticket?

Thanks, Phil
Comment 12 Phil Sutter 2017-10-25 14:02:19 EDT
Ah, nevermind - all set now.
Comment 20 errata-xmlrpc 2018-04-10 13:34:55 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0941

Note You need to log in before you can comment on or make changes to this bug.