1495893 – ebtables --concurrent may leave a stale lock file

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1495893 - ebtables --concurrent may leave a stale lock file

Summary: ebtables --concurrent may leave a stale lock file

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	ebtables
Sub Component:
Version:	7.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Phil Sutter
QA Contact:	Tomas Dolezal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-26 13:01 UTC by Vladimir Benes
Modified:	2018-04-10 17:34 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ebtables-2.0.10-16.el7
Doc Type:	Bug Fix
Doc Text:	Cause: When being killed, ebtables process may leave a stale lock file in place. Consequence: Consecutive calls to ebtables would hang indefinitely, waiting for the lock file to disappear and occasionally printing: "Trying to obtain lock /run/ebtables.lock". Fix: Ebtables code was changed to use flock() for lock file handling which makes the lock disappear automatically upon program termination. Result: Unexpected ebtables program termination does not lead to stale lock file anymore.
Clone Of:
Environment:
Last Closed:	2018-04-10 17:34:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:0941	0	None	None	None	2018-04-10 17:34:57 UTC

Description Vladimir Benes 2017-09-26 13:01:33 UTC

Description of problem:
we just recently updated NetworkManager CI infrastructure to RHEL7.4 and we see some failures connected to firewalld:

ERROR:dbus.proxies:Introspect error on :1.10172:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.


And firewalld doesn't seem to respond back to NM anymore. We don't have proper reproducer but it cripples quite some portion of our nightly tests once in few days.

Version-Release number of selected component (if applicable):
kernel-3.10.0-693.2.1.el7.x86_64
NetworkManager-1.8.5-17607.b363a6cfad.el7.x86_64
firewalld-0.4.4.4-6.el7.noarch
systemd-219-42.el7_4.1.x86_64

How reproducible:
random

Steps to Reproduce:
N/A

Comment 1 Eric Garver 2017-09-26 18:04:42 UTC

Can you check the firewalld logs?

  # systemctl status -n 50 firewalld

Comment 2 Vladimir Benes 2017-09-26 21:07:57 UTC

nothing helpful there
[root@gsm-r5s2-01 ~]# systemctl status -n 50 firewalld -l
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; disabled; vendor preset: enabled)
   Active: active (running) since Út 2017-09-26 12:07:49 EDT; 4h 57min ago
     Docs: man:firewalld(1)
 Main PID: 6444 (firewalld)
   CGroup: /system.slice/firewalld.service
           ├─6444 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid
           └─6462 /usr/sbin/ebtables --concurrent -L

zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.

[root@gsm-r5s2-01 ~]# journalctl -u firewalld
-- Logs begin at Út 2017-09-26 10:48:25 EDT, end at Út 2017-09-26 17:05:10 EDT. --
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Stopping firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Starting firewalld - dynamic firewall daemon...
zář 26 12:07:49 gsm-r5s2-01.wlan.rhts.eng.bos.redhat.com systemd[1]: Started firewalld - dynamic firewall daemon.

Comment 3 Francesco Giudici 2017-09-28 07:54:47 UTC

When the issue happens, each firewall-cmd would hang.
Also something like:
# firewall-cmd -V
or
# firewall-cmd --state
would hang indefinitely. After a while the:
"ERROR:dbus.proxies:Introspect error on :1.10172:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."
msg is printed on screen (firewall-cmd stays blocked).

Restarting the firewalld service via systemctl does not help.
Restarting the DBUS service via systemctl does not help too.

Comment 4 Francesco Giudici 2017-09-28 12:05:44 UTC

I am able to reproduce the issue on an up-to-date RHEL-7.4 VM with the following script:

 #!/bin/bash

i=1;
while true; do
	echo "---  $i  ---";
	systemctl unmask firewalld;
	systemctl start firewalld;
	usleep 40000
	systemctl restart firewalld;
	firewall-cmd -V;
	service firewalld stop;
	i=$((i+1));
done

It hanged on my VM on "firewall-cmd -V" after some iterations... it happened after 2, 6, 10, 15 and 36 (needed reboot after each hang).
I then got error like:
"ERROR:dbus.proxies:Introspect error on :1.84:/org/fedoraproject/FirewallD1: dbus.exceptions.DBusException: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken."

and firewall-cmd is no longer usable.
No useful logs in journal.
Let me know if there is some log I can collect for you.

Note that changing the timing (e.g., completely removing the usleep or changing it with a sleep 1) would prevent the issue.
Also putting a "sleep 1" somewhere in the loop will prevent from seeing the issue.
That timing works on my VM only: tried the script on another (beaker) machine.. it was really much harder to reproduce.

Comment 5 Eric Garver 2017-09-28 19:23:55 UTC

I was able to reproduce this in my VM as per comment 4. Thanks Francesco.

This is an issue with ebtables concurrent locking. My guess is that when firewalld is killed by systemd ebtables is also killed, but the lock file is left laying around. Further invocations just spin waiting on the lock file. As such, firewalld is left waiting on ebtables to return and is therefore not able to process dbus requests.

Since the lock file is left laying around restarting firewalld/dbus/etc has no effect. Only removing the stale lock file, /run/ebtables.lock, or a reboot resolves the issue.

Quick work around:
  # systemctl stop firewalld
  # rm /run/ebtables.lock
  # systemctl start firewalld

--->8---

#  systemctl status -n 50 -l firewalld
● firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2017-09-28 15:05:25 EDT; 4min 51s ago
     Docs: man:firewalld(1)
 Main PID: 6751 (firewalld)
   CGroup: /system.slice/firewalld.service
           ├─6751 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid --debug 5
           └─6762 /usr/sbin/ebtables --concurrent -L


# strace -p 6762
strace: Process 6762 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
rt_sigprocmask(SIG_BLOCK, [INT], NULL, 8) = 0
open("/run/ebtables.lock", O_WRONLY|O_CREAT|O_EXCL, 0600) = -1 EEXIST (File exists)
rt_sigprocmask(SIG_UNBLOCK, [INT], NULL, 8) = 0
write(2, "Trying to obtain lock /run/ebtab"..., 41) = 41
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({1, 0}, 0x7ffdd7d5cab0)       = 0

Comment 7 Eric Garver 2017-09-28 19:54:27 UTC

Reproducing with ebtables alone:

In first terminal:
  #  while true; do killall ebtables &> /dev/null; done

In second terminal:
  # ebtables --concurrent -L
  Terminated

Then ctrl+c first terminal loop.

Then try ebtables again:

  # ebtables --concurrent -L
  Trying to obtain lock /run/ebtables.lock

  Trying to obtain lock /run/ebtables.lock

# rm /run/ebtables.lock
rm: remove regular empty file ‘/run/ebtables.lock’? y

[root@dev-rhel7 ~]# ebtables --concurrent -L
Bridge table: filter

Bridge chain: INPUT, entries: 1, policy: ACCEPT
-j INPUT_direct

Bridge chain: FORWARD, entries: 1, policy: ACCEPT
-j FORWARD_direct

Bridge chain: OUTPUT, entries: 1, policy: ACCEPT
-j OUTPUT_direct

Bridge chain: INPUT_direct, entries: 0, policy: RETURN

Bridge chain: OUTPUT_direct, entries: 0, policy: RETURN

Bridge chain: FORWARD_direct, entries: 0, policy: RETURN

Comment 8 Phil Sutter 2017-10-06 10:58:19 UTC

Locking in ebtables wasn't atomic, so that it was possible that the lock stayed in place after the process was killed. I fixed that by using flock() (took locking in iptables as an example):

https://marc.info/?l=netfilter-devel&m=150728694430435&w=2

Comment 9 Phil Sutter 2017-10-24 17:45:11 UTC

Upstream accepted:

commit 6a826591878db3fa9e2a94b87a3d5edd8e0fc442
Author: Phil Sutter <phil>
Date:   Fri Oct 6 12:48:50 2017 +0200

    Use flock() for --concurrent option
    
    The previous locking mechanism was not atomic, hence it was possible
    that a killed ebtables process would leave the lock file in place which
    in turn made future ebtables processes wait indefinitely for the lock to
    become free.
    
    Fix this by using flock(). This also simplifies code quite a bit because
    there is no need for a custom signal handler or an __exit routine
    anymore.
    
    Signed-off-by: Phil Sutter <phil>
    Signed-off-by: Pablo Neira Ayuso <pablo>

Comment 11 Phil Sutter 2017-10-25 13:03:45 UTC

Anita, could you please provide pm_ack+ for this ticket?

Thanks, Phil

Comment 12 Phil Sutter 2017-10-25 18:02:19 UTC

Ah, nevermind - all set now.

Comment 20 errata-xmlrpc 2018-04-10 17:34:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0941

Note You need to log in before you can comment on or make changes to this bug.