Bug 1531486 - Make connection to dbus asynchronous [NEEDINFO]
Summary: Make connection to dbus asynchronous
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: systemd
Version: 7.4
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: systemd-maint
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On: 1530721
Blocks: 1716963 1551061 1643104 1719445
TreeView+ depends on / blocked
 
Reported: 2018-01-05 09:54 UTC by Michal Sekletar
Modified: 2020-08-07 15:36 UTC (History)
29 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-14 08:46:10 UTC
Target Upstream Version:
mbliss: needinfo? (systemd-maint)


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1654779 None None None 2019-12-09 02:40:22 UTC
Red Hat Bugzilla 1698969 None None None 2019-12-09 02:40:22 UTC
Red Hat Knowledge Base (Solution) 3900301 None None None 2019-02-11 15:19:54 UTC

Internal Links: 1654779

Description Michal Sekletar 2018-01-05 09:54:42 UTC
Description of problem:

systemd is a highly asynchronous, event-driven daemon. However, there are operations in systemd which are synchronous. One very important one is establishing a connection to the system bus. In particular, authentication, name and match registrations are synchronous. In a case when dbus-daemon, in response to our connection attempt, ends up waiting on systemd we have a deadlock. The system should eventually recover (usually after a 25s timeout on systemd side) but systemd is then not connected to the bus which makes many of its APIs unavailable.

Version-Release number of selected component (if applicable):
systemd-219-42.el7_4.4

How reproducible:
sometimes

Steps to Reproduce:
There are no clear steps to reproduce, but setups running RHEL-7.4, where Network Information Service (NIS) is used, are very susceptible to hit this issue. This is because in RHEL-7.4 rpcbind was made socket activatable via TCP socket used by glibc's NIS module.

Deadlock then looks following,

systemd->dbus->nis(glibc)->rpcbind->systemd 

Note that rpcbind is supposed to be started due to socket activation, but systemd can't start it since it is blocked on DBus.

Actual results:
Boot timeout of 25 seconds. After boot up, service org.freedesktop.systemd1 is not available on the system bus. 

Expected results:
No timeout. Systemd is properly connected to dbus.

Additional info:
Having dbus connection code async, we could do some work (e.g. activate service due to socket traffic) while we are waiting for a response from dbus-daemon).

Comment 2 Jan Synacek 2018-01-05 14:38:24 UTC
My take on this is that this is going to make things super complex to reason about and even harder to debug than they are now. I would suggest first to try thinking about why the dependency cycle is there and maybe redesigning things a bit. Just a thought.

Comment 5 Michal Sekletar 2018-01-05 15:23:08 UTC
(In reply to Jan Synacek from comment #3)
> My take on this is that this is going to make things super complex to reason
> about and even harder to debug than they are now. I would suggest first to
> try thinking about why the dependency cycle is there and maybe redesigning
> things a bit. Just a thought.

Are you suggesting that we should push dbus down to the kernel level ;) ? 

On a more serious note. We've already discussed this w/ Lennart on multiple occasions and making that code async seems to be the only viable solution. 

The problem here is that while systemd deliberately avoid NSS stack calls, dbus-daemon doesn't (for valid reasons btw). And since we don't control either random NSS modules users might use or /etc/nsswitch.conf configuration then we can't do much about the deadlock. IOW, we can't guarantee that those modules don't wait on systemd while called from dbus-daemon on which systemd is blocked. Unless we bite the bullet and do the move towards async code for connection establishment.

Comment 10 Lukáš Nykrýn 2018-06-21 10:44:37 UTC
This one still needs a solution, but we can't deliver it in 7.6 timeframe.

Comment 16 Lukáš Nykrýn 2019-03-04 14:13:03 UTC
This is not fully fixed in upstream yet, so it will not be fixed in 7.7 timeframe.

Comment 18 Tom Stocker 2019-04-05 17:16:59 UTC
11 months and still unfixed. It's a shame. This is exactly why systemd should not have been pushed as default until it is working as it should. Together with the various bugs alone in the installer (limited swap space, luks fail), one could say, that RH really have had better times in terms of reliability. I really hope the release 8 will be stable again.

Comment 19 Tom Stocker 2019-04-05 17:18:43 UTC
even 15 months.

Comment 22 RHEL Program Management 2019-08-14 08:46:10 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 24 Bernd Schubert 2020-01-14 14:09:13 UTC
And you can't just make the workaround given here the default? https://access.redhat.com/discussions/3536621

Comment 25 Bernd Schubert 2020-01-14 14:11:57 UTC
I mean here: https://access.redhat.com/solutions/3900301

Comment 27 Jim Reuter 2020-02-12 19:30:37 UTC
FWIW, this issue becomes an order of magnitude worse when running multiple systems as VMs and rebooting them en masse.

I am converting a complex, unmanaged, and unmaintained VM mess to be Ansible installed and managed.  Currently I have deployed about 60 test VMs.  If I do updates with Ansible and it needs to reboot a bunch of these systems at the same time (remember they are VMs so they are really running on some shared hardware, which can impact scheduling) several of them will come up in the broken polkit/systemd state (about 5 of 30 in a few trials).  Similarly, if I use Ansible to clone multiple new VMs from a template and turn them on, which it seems to do five at a time, some will come up in the broken polkit/systemd state.

This isn't a configuration issue, it seems like a timing issue.

Going through and manually rebooting the broken systems one by one brings them back. This is a giant pain in the rear when managing a large farm of VMs.

Running up to date CentOS 7.7 (minimal installs plus my own list of packages, no desktop env) with systemd 67.el7_7.3.

Comment 30 Jeff Blaine 2020-06-24 19:02:01 UTC
We're seeing this on RHEL 7.8. The workaround presented in https://access.redhat.com/solutions/3900301 does not apply anymore as there ARE NO BindIPv6Only,ListenStream,ListenDatagram,ListenStream,ListenDatagram lines even in /usr/lib/systemd/system/rpcbind.socket in 7.8. There is no NIS or NIS+ mentioned in /etc/nsswitch.conf as the KB article alludes to.

Comment 31 Sadashiva Murthy M 2020-06-25 09:07:37 UTC
It is still happening on RHEL7.8 as well. Any fix for this?

Comment 32 Jeff Blaine 2020-06-25 13:24:12 UTC
Ours happened (reported above) EXACTLY right after this:

Jun 24 05:36:45 Updated: systemd-libs-219-73.el7_8.8.x86_64
Jun 24 05:37:52 Updated: systemd-219-73.el7_8.8.x86_64
Jun 24 05:37:54 Updated: systemd-sysv-219-73.el7_8.8.x86_64
Jun 24 05:38:37 Updated: systemd-python-219-73.el7_8.8.x86_64
Jun 24 05:38:37 Updated: systemd-devel-219-73.el7_8.8.x86_64
Jun 24 05:39:21 Updated: systemd-libs-219-73.el7_8.8.i686

[m26560@neon ~]$ sudo grep freedesktop /var/log/messages
Jun 24 04:00:46 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.hostname1'
Jun 24 04:02:12 neon dbus[1195]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 24 04:02:12 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.problems'
Jun 24 05:02:18 neon dbus[1195]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
Jun 24 05:02:18 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.hostname1'
Jun 24 05:04:45 neon dbus[1195]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 24 05:04:45 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.problems'
Jun 24 05:52:23 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:52:48 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:53:13 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:53:38 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:54:03 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
...


Note You need to log in before you can comment on or make changes to this bug.