|Summary:||Make connection to dbus asynchronous|
|Product:||Red Hat Enterprise Linux 7||Reporter:||Michal Sekletar <msekleta>|
|Status:||ASSIGNED ---||QA Contact:||qe-baseos-daemons|
|Version:||7.4||CC:||akaiser, amarirom, andreas.luik, apmukher, ayadav, bschubert, carsten.grohmann, cbesson, cwarfiel, jblaine, jmagrini, jreuter, jsynacek, kwalker, mbliss, mmezynsk, msekleta, mssmurthy.tech, neo, pdwyer, qguo, rmetrich, sbroz, steved, swhiteho, systemd-maint-list, systemd-maint, tcrider, yoyang|
|Fixed In Version:||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|Last Closed:||2019-08-14 08:46:10 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
|Bug Depends On:||1530721|
|Bug Blocks:||1716963, 1551061, 1643104, 1719445|
Description Michal Sekletar 2018-01-05 09:54:42 UTC
Description of problem: systemd is a highly asynchronous, event-driven daemon. However, there are operations in systemd which are synchronous. One very important one is establishing a connection to the system bus. In particular, authentication, name and match registrations are synchronous. In a case when dbus-daemon, in response to our connection attempt, ends up waiting on systemd we have a deadlock. The system should eventually recover (usually after a 25s timeout on systemd side) but systemd is then not connected to the bus which makes many of its APIs unavailable. Version-Release number of selected component (if applicable): systemd-219-42.el7_4.4 How reproducible: sometimes Steps to Reproduce: There are no clear steps to reproduce, but setups running RHEL-7.4, where Network Information Service (NIS) is used, are very susceptible to hit this issue. This is because in RHEL-7.4 rpcbind was made socket activatable via TCP socket used by glibc's NIS module. Deadlock then looks following, systemd->dbus->nis(glibc)->rpcbind->systemd Note that rpcbind is supposed to be started due to socket activation, but systemd can't start it since it is blocked on DBus. Actual results: Boot timeout of 25 seconds. After boot up, service org.freedesktop.systemd1 is not available on the system bus. Expected results: No timeout. Systemd is properly connected to dbus. Additional info: Having dbus connection code async, we could do some work (e.g. activate service due to socket traffic) while we are waiting for a response from dbus-daemon).
Comment 2 Jan Synacek 2018-01-05 14:38:24 UTC
My take on this is that this is going to make things super complex to reason about and even harder to debug than they are now. I would suggest first to try thinking about why the dependency cycle is there and maybe redesigning things a bit. Just a thought.
Comment 5 Michal Sekletar 2018-01-05 15:23:08 UTC
(In reply to Jan Synacek from comment #3) > My take on this is that this is going to make things super complex to reason > about and even harder to debug than they are now. I would suggest first to > try thinking about why the dependency cycle is there and maybe redesigning > things a bit. Just a thought. Are you suggesting that we should push dbus down to the kernel level ;) ? On a more serious note. We've already discussed this w/ Lennart on multiple occasions and making that code async seems to be the only viable solution. The problem here is that while systemd deliberately avoid NSS stack calls, dbus-daemon doesn't (for valid reasons btw). And since we don't control either random NSS modules users might use or /etc/nsswitch.conf configuration then we can't do much about the deadlock. IOW, we can't guarantee that those modules don't wait on systemd while called from dbus-daemon on which systemd is blocked. Unless we bite the bullet and do the move towards async code for connection establishment.
Comment 10 Lukáš Nykrýn 2018-06-21 10:44:37 UTC
This one still needs a solution, but we can't deliver it in 7.6 timeframe.
Comment 16 Lukáš Nykrýn 2019-03-04 14:13:03 UTC
This is not fully fixed in upstream yet, so it will not be fixed in 7.7 timeframe.
Comment 18 Tom Stocker 2019-04-05 17:16:59 UTC
11 months and still unfixed. It's a shame. This is exactly why systemd should not have been pushed as default until it is working as it should. Together with the various bugs alone in the installer (limited swap space, luks fail), one could say, that RH really have had better times in terms of reliability. I really hope the release 8 will be stable again.
Comment 19 Tom Stocker 2019-04-05 17:18:43 UTC
even 15 months.
Comment 22 RHEL Program Management 2019-08-14 08:46:10 UTC
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
Comment 24 Bernd Schubert 2020-01-14 14:09:13 UTC
And you can't just make the workaround given here the default? https://access.redhat.com/discussions/3536621
Comment 25 Bernd Schubert 2020-01-14 14:11:57 UTC
I mean here: https://access.redhat.com/solutions/3900301
Comment 27 Jim Reuter 2020-02-12 19:30:37 UTC
FWIW, this issue becomes an order of magnitude worse when running multiple systems as VMs and rebooting them en masse. I am converting a complex, unmanaged, and unmaintained VM mess to be Ansible installed and managed. Currently I have deployed about 60 test VMs. If I do updates with Ansible and it needs to reboot a bunch of these systems at the same time (remember they are VMs so they are really running on some shared hardware, which can impact scheduling) several of them will come up in the broken polkit/systemd state (about 5 of 30 in a few trials). Similarly, if I use Ansible to clone multiple new VMs from a template and turn them on, which it seems to do five at a time, some will come up in the broken polkit/systemd state. This isn't a configuration issue, it seems like a timing issue. Going through and manually rebooting the broken systems one by one brings them back. This is a giant pain in the rear when managing a large farm of VMs. Running up to date CentOS 7.7 (minimal installs plus my own list of packages, no desktop env) with systemd 67.el7_7.3.
Comment 30 Jeff Blaine 2020-06-24 19:02:01 UTC
We're seeing this on RHEL 7.8. The workaround presented in https://access.redhat.com/solutions/3900301 does not apply anymore as there ARE NO BindIPv6Only,ListenStream,ListenDatagram,ListenStream,ListenDatagram lines even in /usr/lib/systemd/system/rpcbind.socket in 7.8. There is no NIS or NIS+ mentioned in /etc/nsswitch.conf as the KB article alludes to.
Comment 31 Sadashiva Murthy M 2020-06-25 09:07:37 UTC
It is still happening on RHEL7.8 as well. Any fix for this?
Comment 32 Jeff Blaine 2020-06-25 13:24:12 UTC
Ours happened (reported above) EXACTLY right after this: Jun 24 05:36:45 Updated: systemd-libs-219-73.el7_8.8.x86_64 Jun 24 05:37:52 Updated: systemd-219-73.el7_8.8.x86_64 Jun 24 05:37:54 Updated: systemd-sysv-219-73.el7_8.8.x86_64 Jun 24 05:38:37 Updated: systemd-python-219-73.el7_8.8.x86_64 Jun 24 05:38:37 Updated: systemd-devel-219-73.el7_8.8.x86_64 Jun 24 05:39:21 Updated: systemd-libs-219-73.el7_8.8.i686 [m26560@neon ~]$ sudo grep freedesktop /var/log/messages Jun 24 04:00:46 neon dbus: [system] Successfully activated service 'org.freedesktop.hostname1' Jun 24 04:02:12 neon dbus: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Jun 24 04:02:12 neon dbus: [system] Successfully activated service 'org.freedesktop.problems' Jun 24 05:02:18 neon dbus: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' Jun 24 05:02:18 neon dbus: [system] Successfully activated service 'org.freedesktop.hostname1' Jun 24 05:04:45 neon dbus: [system] Activating service name='org.freedesktop.problems' (using servicehelper) Jun 24 05:04:45 neon dbus: [system] Successfully activated service 'org.freedesktop.problems' Jun 24 05:52:23 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out Jun 24 05:52:48 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out Jun 24 05:53:13 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out Jun 24 05:53:38 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out Jun 24 05:54:03 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out ...