Description of problem:
systemd is a highly asynchronous, event-driven daemon. However, there are operations in systemd which are synchronous. One very important one is establishing a connection to the system bus. In particular, authentication, name and match registrations are synchronous. In a case when dbus-daemon, in response to our connection attempt, ends up waiting on systemd we have a deadlock. The system should eventually recover (usually after a 25s timeout on systemd side) but systemd is then not connected to the bus which makes many of its APIs unavailable.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
There are no clear steps to reproduce, but setups running RHEL-7.4, where Network Information Service (NIS) is used, are very susceptible to hit this issue. This is because in RHEL-7.4 rpcbind was made socket activatable via TCP socket used by glibc's NIS module.
Deadlock then looks following,
Note that rpcbind is supposed to be started due to socket activation, but systemd can't start it since it is blocked on DBus.
Boot timeout of 25 seconds. After boot up, service org.freedesktop.systemd1 is not available on the system bus.
No timeout. Systemd is properly connected to dbus.
Having dbus connection code async, we could do some work (e.g. activate service due to socket traffic) while we are waiting for a response from dbus-daemon).
My take on this is that this is going to make things super complex to reason about and even harder to debug than they are now. I would suggest first to try thinking about why the dependency cycle is there and maybe redesigning things a bit. Just a thought.
(In reply to Jan Synacek from comment #3)
> My take on this is that this is going to make things super complex to reason
> about and even harder to debug than they are now. I would suggest first to
> try thinking about why the dependency cycle is there and maybe redesigning
> things a bit. Just a thought.
Are you suggesting that we should push dbus down to the kernel level ;) ?
On a more serious note. We've already discussed this w/ Lennart on multiple occasions and making that code async seems to be the only viable solution.
The problem here is that while systemd deliberately avoid NSS stack calls, dbus-daemon doesn't (for valid reasons btw). And since we don't control either random NSS modules users might use or /etc/nsswitch.conf configuration then we can't do much about the deadlock. IOW, we can't guarantee that those modules don't wait on systemd while called from dbus-daemon on which systemd is blocked. Unless we bite the bullet and do the move towards async code for connection establishment.
This one still needs a solution, but we can't deliver it in 7.6 timeframe.
This is not fully fixed in upstream yet, so it will not be fixed in 7.7 timeframe.
11 months and still unfixed. It's a shame. This is exactly why systemd should not have been pushed as default until it is working as it should. Together with the various bugs alone in the installer (limited swap space, luks fail), one could say, that RH really have had better times in terms of reliability. I really hope the release 8 will be stable again.
even 15 months.
Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.
And you can't just make the workaround given here the default? https://access.redhat.com/discussions/3536621
I mean here: https://access.redhat.com/solutions/3900301
FWIW, this issue becomes an order of magnitude worse when running multiple systems as VMs and rebooting them en masse.
I am converting a complex, unmanaged, and unmaintained VM mess to be Ansible installed and managed. Currently I have deployed about 60 test VMs. If I do updates with Ansible and it needs to reboot a bunch of these systems at the same time (remember they are VMs so they are really running on some shared hardware, which can impact scheduling) several of them will come up in the broken polkit/systemd state (about 5 of 30 in a few trials). Similarly, if I use Ansible to clone multiple new VMs from a template and turn them on, which it seems to do five at a time, some will come up in the broken polkit/systemd state.
This isn't a configuration issue, it seems like a timing issue.
Going through and manually rebooting the broken systems one by one brings them back. This is a giant pain in the rear when managing a large farm of VMs.
Running up to date CentOS 7.7 (minimal installs plus my own list of packages, no desktop env) with systemd 67.el7_7.3.
We're seeing this on RHEL 7.8. The workaround presented in https://access.redhat.com/solutions/3900301 does not apply anymore as there ARE NO BindIPv6Only,ListenStream,ListenDatagram,ListenStream,ListenDatagram lines even in /usr/lib/systemd/system/rpcbind.socket in 7.8. There is no NIS or NIS+ mentioned in /etc/nsswitch.conf as the KB article alludes to.
It is still happening on RHEL7.8 as well. Any fix for this?
Ours happened (reported above) EXACTLY right after this:
Jun 24 05:36:45 Updated: systemd-libs-219-73.el7_8.8.x86_64
Jun 24 05:37:52 Updated: systemd-219-73.el7_8.8.x86_64
Jun 24 05:37:54 Updated: systemd-sysv-219-73.el7_8.8.x86_64
Jun 24 05:38:37 Updated: systemd-python-219-73.el7_8.8.x86_64
Jun 24 05:38:37 Updated: systemd-devel-219-73.el7_8.8.x86_64
Jun 24 05:39:21 Updated: systemd-libs-219-73.el7_8.8.i686
[m26560@neon ~]$ sudo grep freedesktop /var/log/messages
Jun 24 04:00:46 neon dbus: [system] Successfully activated service 'org.freedesktop.hostname1'
Jun 24 04:02:12 neon dbus: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 24 04:02:12 neon dbus: [system] Successfully activated service 'org.freedesktop.problems'
Jun 24 05:02:18 neon dbus: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
Jun 24 05:02:18 neon dbus: [system] Successfully activated service 'org.freedesktop.hostname1'
Jun 24 05:04:45 neon dbus: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 24 05:04:45 neon dbus: [system] Successfully activated service 'org.freedesktop.problems'
Jun 24 05:52:23 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:52:48 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:53:13 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:53:38 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:54:03 neon dbus: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
As stated by - [ Jeff Blaine 2020-06-24 19:02:01 UTC ] Redhat Fix is not applicable . I have just upgraded a Centos 7.8 server, and if failed to come back after reboot.
It bounces back into Emergency mode.
Error message listed. is [ Authorization not available. Check if polkit service is running or see debug message for more information.
polkitd reports that the [lost the name org.freedesktop.PolicyKit1 ]
I would try and give more info , but as this is emergency mode , and the fact that no services seem to have permission to do anything there no logs.
Removed all entries in nsswitch.conf regarding NIS .
Tried RHEL fix  but this is not relevant in 7.8 centos.
Tried booking from a different kernel , but same error persists.
Same upgrade was completed on 70 other likewise machines an no issues.
Checking /usr/share/polkit-1/ on both machines shows identical files.
same for /etc/polkit as well.
RPMS Installed on machines is :
Any help to understand why this is happening would be great , i can rebuild this machine . And Since this patch run was done on UAT ,not going forward with Live till i know how i can fix this issue.
Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed.
From the RHEL life cycle page:
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."
If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product.
Please reach out to the applicable Product Experience Engineer if you have any questions or concerns.