Bug 1099031

Summary: firewall-cmd hangs after install of polkit
Product: [Fedora] Fedora Reporter: Ian Wienand <iwienand>
Component: dbusAssignee: Colin Walters <walters>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: high    
Version: 20CC: agedosier, berrange, clalancette, itamar, iwienand, jeff, jforbes, jpopelka, kchamart, laine, libvirt-maint, lpoetter, mitr, nphilipp, twoerner, vedran, veillard, virt-maint, walters
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1098866
: 1109513 (view as bug list) Environment:
Last Closed: 2015-06-29 20:42:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1098866    
Bug Blocks: 1109513    

Description Ian Wienand 2014-05-19 10:47:28 UTC
+++ This bug was initially created as a clone of Bug #1098866 +++

Description of problem:

I am having this problem on a Racksapce "Fedora 20 (Heisenbug) (PVHVM)" image.

I can see "firewall-cmd --state" is hanging, and it causes libvirtd startup to hang as well.

You can replicate this with 

"firewall-cmd --state && yum install -y libvirt && firewall-cmd --state"

The last firewall-cmd command will hang.

I've found that restarting dbus *or* firewalld after installing libvirt will solve the problem, so something about the libvirt install gets it into a funny state.  

The next weird thing to happen was when trying to diagnose this, I strace'd firewalld while I ran firewall-cmd and saw it catching an exception.  After killing the hung firewall-cmd process and running it again, I started getting that exception on the command line: e.g.

---
[root@cloud-server-01 ~]# firewall-cmd --state


^\Quit
[root@cloud-server-01 ~]# firewall-cmd --state
Error: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/dbus/service.py", line 707, in _message_cb
    retval = candidate_method(self, *args, **keywords)
  File "/usr/lib/python2.7/site-packages/slip/dbus/service.py", line 153, in wrapped_method
    reply_handler=reply_handler, error_handler=error_handler)
  File "/usr/lib/python2.7/site-packages/slip/dbus/polkit.py", line 292, in IsSystemBusNameAuthorizedAsync
    details)
  File "/usr/lib/python2.7/site-packages/slip/dbus/polkit.py", line 276, in IsSystemBusNameAuthorizedAsync
    timeout=method_call_no_timeout)
  File "/usr/lib/python2.7/site-packages/dbus/proxies.py", line 137, in __call__
    **keywords)
  File "/usr/lib/python2.7/site-packages/dbus/connection.py", line 584, in call_async
    message.append(signature=signature, *args)
ValueError: Unable to guess signature from an empty dict

---

I will attach an strace of the first "good" run and one of the run that hangs after libvirt install.  I'll also attach lsof output of firewalld before & after

--- Additional comment from Ian Wienand on 2014-05-19 01:42:59 EDT ---



--- Additional comment from Ian Wienand on 2014-05-19 03:34:08 EDT ---



--- Additional comment from Ian Wienand on 2014-05-19 03:35:11 EDT ---



--- Additional comment from Ian Wienand on 2014-05-19 03:35:45 EDT ---



--- Additional comment from Daniel Berrange on 2014-05-19 06:01:53 EDT ---

Can you check whether there are any suspect kernel messages in dmesg. There have been some kernel bugs I've seen in F20, which cause firewalld (and indeed iptables in general) to hang - usually there is a kernel bug logged to dmesg when this happens.

--- Additional comment from Ian Wienand on 2014-05-19 06:30:10 EDT ---

(In reply to Daniel Berrange from comment #5)
> Can you check whether there are any suspect kernel messages in dmesg.

Nothing in dmesg, but yes there is something in the logs that looks suspicious

---
May 19 10:24:23 cloud-server-01 dbus-daemon[282]: dbus[282]: [system] Rejected send message, 1 matched rules; type="method_call", sender=":1.4" (uid=0 pid=277 comm="/usr/bin/python /usr/sbin/firewalld --nofork --nop") interface="org.freedesktop.DBus.Introspectable" member="Introspect" error name="(unset)" requested_reply="0" destination=":1.16" (uid=998 pid=2314 comm="/usr/lib/polkit-1/polkitd --no-debug ")
May 19 10:24:23 cloud-server-01 dbus[282]: [system] Rejected send message, 1 matched rules; type="method_call", sender=":1.4" (uid=0 pid=277 comm="/usr/bin/python /usr/sbin/firewalld --nofork --nop") interface="org.freedesktop.DBus.Introspectable" member="Introspect" error name="(unset)" requested_reply="0" destination=":1.16" (uid=998 pid=2314 comm="/usr/lib/polkit-1/polkitd --no-debug ")
---

--- Additional comment from Ian Wienand on 2014-05-19 06:46:11 EDT ---

Ok, so trying again, installing the polkit package causes the problem to appear

---
[root@cloud-server-01 ~]# yum install polkit

 ...

  Installing : mozjs17-17.0.0-8.fc20.x86_64                                                                                          1/3 
  Installing : polkit-0.112-2.fc20.x86_64                                                                                            2/3 
  Installing : polkit-pkla-compat-0.1-3.fc20.x86_64                                                                                  3/3 
[root@cloud-server-01 ~]# firewall-cmd --state
 ... hang...
---

to be clear -- replication is 

1) boot Racksapce "Fedora 20 (Heisenbug) (PVHVM)"
2) yum install polkit
3) firewall-cmd --state will hang after this until firewalld or dbus is restarted

Comment 1 Miloslav Trmač 2014-05-19 22:56:02 UTC
Thanks for your report.

This is something specific to firewalld (or the python-slip logic):

Directly after installing polkit, (pkaction) works fine (i.e. polkit basically works), and then
> dbus-send  --system --dest=org.freedesktop.PolicyKit1 --print-reply /org/freedesktop/PolicyKit1/Authority org.freedesktop.DBus.Introspectable.Introspect
works fine, and the same call using --dest=:1.$the_right_unique_name works as well.

OTOH firewalld/python-slip detects NameOwnerChanged, determines $the_right_unique_name, but AFAICS the same Introspect method call fails with an error.

(For reproducing, note that even a minimal Fedora install includes polkit: you’ll have to remove it (and possibly set up init.d/network instead of NM)).

Comment 2 Jiri Popelka 2014-05-20 10:12:37 UTC
Looks like python-slip problem, we've seen similar bug #895067 in past.

Comment 3 Nils Philippsen 2014-05-20 14:59:54 UTC
I don't think this is a bug in python-slip -- merely triggered by its trying to find out if polkitd exists on the dbus system bus.

Here's my take on what likely happens:

(In reply to Ian Wienand from comment #0)
> --- Additional comment from Ian Wienand on 2014-05-19 06:30:10 EDT ---
> 
> (In reply to Daniel Berrange from comment #5)
> > Can you check whether there are any suspect kernel messages in dmesg.
> 
> Nothing in dmesg, but yes there is something in the logs that looks
> suspicious
> 
> ---
> May 19 10:24:23 cloud-server-01 dbus-daemon[282]: dbus[282]: [system]
> Rejected send message, 1 matched rules; type="method_call", sender=":1.4"
> (uid=0 pid=277 comm="/usr/bin/python /usr/sbin/firewalld --nofork --nop")
> interface="org.freedesktop.DBus.Introspectable" member="Introspect" error
> name="(unset)" requested_reply="0" destination=":1.16" (uid=998 pid=2314
> comm="/usr/lib/polkit-1/polkitd --no-debug ")
> May 19 10:24:23 cloud-server-01 dbus[282]: [system] Rejected send message, 1
> matched rules; type="method_call", sender=":1.4" (uid=0 pid=277
> comm="/usr/bin/python /usr/sbin/firewalld --nofork --nop")
> interface="org.freedesktop.DBus.Introspectable" member="Introspect" error
> name="(unset)" requested_reply="0" destination=":1.16" (uid=998 pid=2314
> comm="/usr/lib/polkit-1/polkitd --no-debug ")
> ---

Dbus-python usually introspects methods before invoking them -- see http://dbus.freedesktop.org/doc/dbus-python/doc/tutorial.txt "Data Types" -- to figure out how to convert the python arguments.

This introspection is prohibited by dbus-daemon and I can only guess why:

1) When firewalld is first started, /etc/dbus-1/system.d/org.freedesktop.PolicyKit1.conf doesn't yet exist (because polkitd isn't installed).
2) Therefore dbus-daemon doesn't yet know about the org.freedesktop.PolicyKit1 destination and that anybody may send to it.
3) In order to be able to provide a fallback if polkit is not available (root may do everything, other users nothing), firewalld (via slip.dbus) "pings" the org.freedesktop.PolicyKit1.Authority interface which polkitd would provide, were it running.
4) Assumption: dbus-daemon prohibits that ping, and caches the result ("<firewalld process> may not send to <polkit destination>").
5) The polkit package gets installed, /etc/dbus-1/system.d/org.freedesktop.PolicyKit1.conf is available.
6) Firewalld (again via slip.dbus) notices polkitd appearing on the bus (via the NameOwnerChanged signal), successfully attempts to get hold of the org.freedesktop.PolicyKit1.Authority interface object (that may not even touch the bus, I'm not sure about dbus internals here), but the subsequent introspection of the method fails due to 4.
7) Restarting a) dbus-daemon or b) firewalld causes a) the complete bus policy to get reevaluated or b) firewalld being a new process makes dbus-daemon not use any cached information, therefore things work again.

If this is what happens (and it's the best explanation I have ;-), then dbus-daemon should forget about the cached permissions from step 4 when it applies the polkit dbus policy (/etc/dbus-1/system.d/org.freedesktop.PolicyKit1.conf) elsewhere. Changing component to dbus.

Comment 4 Colin Walters 2014-05-20 15:18:45 UTC
Did you try sending SIGHUP to dbus-daemon after #5 ?  

The messy thing here is that dbus tries to use inotify but it's going to race with the new polkit process starting.

Comment 5 Miloslav Trmač 2014-05-20 15:56:20 UTC
(In reply to Nils Philippsen from comment #3)
> 4) Assumption: dbus-daemon prohibits that ping, and caches the result
> ("<firewalld process> may not send to <polkit destination>").
> 5) The polkit package gets installed,
> /etc/dbus-1/system.d/org.freedesktop.PolicyKit1.conf is available.
> 6) Firewalld (again via slip.dbus) notices polkitd appearing on the bus (via
> the NameOwnerChanged signal), successfully attempts to get hold of the
> org.freedesktop.PolicyKit1.Authority interface object (that may not even
> touch the bus, I'm not sure about dbus internals here), but the subsequent
> introspection of the method fails due to 4.

The 4) cache might be an explanation.  I was looking at strace() of firewalld, and firewalld does get a NameOwnerChanged signal and does look up the “new” owner of PolicyKit1 by calling GetNameOwner, and only _then_ calls Introspect.

And I was able to use dbus-send to _successfully_ call Introspect even before invoking the firewalld command=>server code, which rules out a race in rereading the configuration or in name ownership; there was something different between a dbus-send caller and firewalld caller[1], and that cache might be an explanation.

[1] Or perhaps my reading of the strace was wrong and I wasn’t sending the right dbus-send command.

Comment 6 Ian Wienand 2014-06-11 22:30:23 UTC
Sorry but this has become a high-priority issue.

The problem is that the rackspace image used by upstream openstack has firewalld installed, while other cloud images don't.  Thus we need work-arounds in devstack to work in all situations.

Upstream are not too happy about having fedora specific work-arounds [1] and this is causing breakage

[1] https://review.openstack.org/#/c/99047/1

Comment 7 Nils Philippsen 2014-06-12 09:30:12 UTC
Let me ask the heretical question: why is polkit installed after the fact and not shipped with it?

Comment 8 Ian Wienand 2014-06-12 10:02:16 UTC
(In reply to Nils Philippsen from comment #7)
> Let me ask the heretical question: why is polkit installed after the fact
> and not shipped with it?

I don't know; it's the standard Rackspace F20 image ("Fedora 20 (Heisenbug) (PVHVM)").  It seems to differ in a few ways to the standard, perhaps it has something to do with issues they found running on xen

Comment 9 Vedran Miletić 2015-01-23 11:42:27 UTC
I had a similar problem on Fedora 21 x86_64 on DigitalOcean, a lot of messages like

Sij 23 12:36:24 gedora dbus[309]: [system] Rejected send message, 0 matched rules; type="error", sender=":1.2" (uid=0 pid=304 comm="/usr/bin/python -Es /usr/sbin/firewalld --nofork -") interface="(unset)" member="(unset)" error name="org.freedesktop.DBus.Error.ServiceUnknown" requested_reply="0" destination=":1.16" (uid=0 pid=1059 comm="/usr/bin/python -Es /usr/bin/firewall-cmd --list-a")
Sij 23 12:36:24 gedora dbus[309]: [system] Rejected send message, 0 matched rules; type="error", sender=":1.2" (uid=0 pid=304 comm="/usr/bin/python -Es /usr/sbin/firewalld --nofork -") interface="(unset)" member="(unset)" error name="org.freedesktop.DBus.Error.ServiceUnknown" requested_reply="0" destination=":1.16" (uid=0 pid=1059 comm="/usr/bin/python -Es /usr/bin/firewall-cmd --list-a")

were in the journal after running firewall-cmd. However, after installing polkit, I get

Sij 23 12:38:22 gedora groupadd[1072]: group added to /etc/group: name=polkitd, GID=991
Sij 23 12:38:22 gedora groupadd[1072]: group added to /etc/gshadow: name=polkitd
Sij 23 12:38:22 gedora groupadd[1072]: new group: name=polkitd, GID=991
Sij 23 12:38:22 gedora useradd[1077]: new user: name=polkitd, UID=994, GID=991, home=/, shell=/sbin/nologin
Sij 23 12:38:22 gedora dbus[309]: [system] Reloaded configuration
Sij 23 12:38:22 gedora dbus[309]: [system] Reloaded configuration
Sij 23 12:38:22 gedora dbus[309]: [system] Reloaded configuration
Sij 23 12:38:22 gedora dbus[309]: [system] Reloaded configuration
Sij 23 12:38:22 gedora dbus[309]: [system] Reloaded configuration
Sij 23 12:38:22 gedora systemd[1]: Reloading.
Sij 23 12:38:22 gedora systemd[1]: Configuration file /usr/lib/systemd/system/auditd.service is marked world-inaccessible. This has no effect as configuration data is accessible via APIs without restrictions. Proceeding anyway.
Sij 23 12:38:22 gedora yum[1066]: Installed: polkit-0.112-7.fc21.x86_64
Sij 23 12:38:22 gedora yum[1066]: Installed: polkit-pkla-compat-0.1-5.fc21.x86_64
Sij 23 12:38:25 gedora dbus[309]: [system] Activating via systemd: service name='org.freedesktop.PolicyKit1' unit='polkit.service'
Sij 23 12:38:25 gedora systemd[1]: Starting Authorization Manager...
Sij 23 12:38:25 gedora polkitd[1098]: Started polkitd version 0.112
Sij 23 12:38:25 gedora polkitd[1098]: Loading rules from directory /etc/polkit-1/rules.d
Sij 23 12:38:25 gedora polkitd[1098]: Loading rules from directory /usr/share/polkit-1/rules.d
Sij 23 12:38:25 gedora polkitd[1098]: Finished loading, compiling and executing 2 rules
Sij 23 12:38:25 gedora dbus[309]: [system] Successfully activated service 'org.freedesktop.PolicyKit1'
Sij 23 12:38:25 gedora systemd[1]: Started Authorization Manager.
Sij 23 12:38:25 gedora polkitd[1098]: Acquired the name org.freedesktop.PolicyKit1 on the system bus

and firewall-cmd works instantly, no more messages. So am I to understand this is fixed in F21?

Comment 10 Fedora End Of Life 2015-05-29 11:53:27 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 11 Fedora End Of Life 2015-06-29 20:42:37 UTC
Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.