Bug 1531486

Summary:	Make connection to dbus asynchronous
Product:	Red Hat Enterprise Linux 7	Reporter:	Michal Sekletar <msekleta>
Component:	systemd	Assignee:	systemd-maint
Status:	CLOSED WONTFIX	QA Contact:	qe-baseos-daemons
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	7.4	CC:	akaiser, amarirom, andreas.luik, apmukher, arjan.oosting, ayadav, bschubert, carsten.grohmann, cbesson, cwarfiel, jblaine, jmagrini, jreuter, kwalker, mbliss, mmezynsk, msekleta, mssmurthy.tech, neo, nermolov1, pdwyer, qguo, rmetrich, robert.weaver, sbroz, steved, swhiteho, systemd-maint-list, systemd-maint, tcrider, yoyang
Target Milestone:	rc	Keywords:	Reopened
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-11 21:54:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1530721
Bug Blocks:	1551061, 1643104, 1716963, 1719445

Description Michal Sekletar 2018-01-05 09:54:42 UTC

Description of problem:

systemd is a highly asynchronous, event-driven daemon. However, there are operations in systemd which are synchronous. One very important one is establishing a connection to the system bus. In particular, authentication, name and match registrations are synchronous. In a case when dbus-daemon, in response to our connection attempt, ends up waiting on systemd we have a deadlock. The system should eventually recover (usually after a 25s timeout on systemd side) but systemd is then not connected to the bus which makes many of its APIs unavailable.

Version-Release number of selected component (if applicable):
systemd-219-42.el7_4.4

How reproducible:
sometimes

Steps to Reproduce:
There are no clear steps to reproduce, but setups running RHEL-7.4, where Network Information Service (NIS) is used, are very susceptible to hit this issue. This is because in RHEL-7.4 rpcbind was made socket activatable via TCP socket used by glibc's NIS module.

Deadlock then looks following,

systemd->dbus->nis(glibc)->rpcbind->systemd

Note that rpcbind is supposed to be started due to socket activation, but systemd can't start it since it is blocked on DBus.

Actual results:
Boot timeout of 25 seconds. After boot up, service org.freedesktop.systemd1 is not available on the system bus.

Expected results:
No timeout. Systemd is properly connected to dbus.

Additional info:
Having dbus connection code async, we could do some work (e.g. activate service due to socket traffic) while we are waiting for a response from dbus-daemon).

Comment 2 Jan Synacek 2018-01-05 14:38:24 UTC

My take on this is that this is going to make things super complex to reason about and even harder to debug than they are now. I would suggest first to try thinking about why the dependency cycle is there and maybe redesigning things a bit. Just a thought.

Comment 5 Michal Sekletar 2018-01-05 15:23:08 UTC

(In reply to Jan Synacek from comment #3)
> My take on this is that this is going to make things super complex to reason
> about and even harder to debug than they are now. I would suggest first to
> try thinking about why the dependency cycle is there and maybe redesigning
> things a bit. Just a thought.

Are you suggesting that we should push dbus down to the kernel level ;) ? 

On a more serious note. We've already discussed this w/ Lennart on multiple occasions and making that code async seems to be the only viable solution. 

The problem here is that while systemd deliberately avoid NSS stack calls, dbus-daemon doesn't (for valid reasons btw). And since we don't control either random NSS modules users might use or /etc/nsswitch.conf configuration then we can't do much about the deadlock. IOW, we can't guarantee that those modules don't wait on systemd while called from dbus-daemon on which systemd is blocked. Unless we bite the bullet and do the move towards async code for connection establishment.

Comment 10 Lukáš Nykrýn 2018-06-21 10:44:37 UTC

This one still needs a solution, but we can't deliver it in 7.6 timeframe.

Comment 16 Lukáš Nykrýn 2019-03-04 14:13:03 UTC

This is not fully fixed in upstream yet, so it will not be fixed in 7.7 timeframe.

Comment 18 Tom Stocker 2019-04-05 17:16:59 UTC

11 months and still unfixed. It's a shame. This is exactly why systemd should not have been pushed as default until it is working as it should. Together with the various bugs alone in the installer (limited swap space, luks fail), one could say, that RH really have had better times in terms of reliability. I really hope the release 8 will be stable again.

Comment 19 Tom Stocker 2019-04-05 17:18:43 UTC

even 15 months.

Comment 22 RHEL Program Management 2019-08-14 08:46:10 UTC

Development Management has reviewed and declined this request. You may appeal this decision by reopening this request.

Comment 24 Bernd Schubert 2020-01-14 14:09:13 UTC

And you can't just make the workaround given here the default? https://access.redhat.com/discussions/3536621

Comment 25 Bernd Schubert 2020-01-14 14:11:57 UTC

I mean here: https://access.redhat.com/solutions/3900301

Comment 27 Jim Reuter 2020-02-12 19:30:37 UTC

FWIW, this issue becomes an order of magnitude worse when running multiple systems as VMs and rebooting them en masse.

I am converting a complex, unmanaged, and unmaintained VM mess to be Ansible installed and managed.  Currently I have deployed about 60 test VMs.  If I do updates with Ansible and it needs to reboot a bunch of these systems at the same time (remember they are VMs so they are really running on some shared hardware, which can impact scheduling) several of them will come up in the broken polkit/systemd state (about 5 of 30 in a few trials).  Similarly, if I use Ansible to clone multiple new VMs from a template and turn them on, which it seems to do five at a time, some will come up in the broken polkit/systemd state.

This isn't a configuration issue, it seems like a timing issue.

Going through and manually rebooting the broken systems one by one brings them back. This is a giant pain in the rear when managing a large farm of VMs.

Running up to date CentOS 7.7 (minimal installs plus my own list of packages, no desktop env) with systemd 67.el7_7.3.

Comment 30 Jeff Blaine 2020-06-24 19:02:01 UTC

We're seeing this on RHEL 7.8. The workaround presented in https://access.redhat.com/solutions/3900301 does not apply anymore as there ARE NO BindIPv6Only,ListenStream,ListenDatagram,ListenStream,ListenDatagram lines even in /usr/lib/systemd/system/rpcbind.socket in 7.8. There is no NIS or NIS+ mentioned in /etc/nsswitch.conf as the KB article alludes to.

Comment 31 Sadashiva Murthy M 2020-06-25 09:07:37 UTC

It is still happening on RHEL7.8 as well. Any fix for this?

Comment 32 Jeff Blaine 2020-06-25 13:24:12 UTC

Ours happened (reported above) EXACTLY right after this:

Jun 24 05:36:45 Updated: systemd-libs-219-73.el7_8.8.x86_64
Jun 24 05:37:52 Updated: systemd-219-73.el7_8.8.x86_64
Jun 24 05:37:54 Updated: systemd-sysv-219-73.el7_8.8.x86_64
Jun 24 05:38:37 Updated: systemd-python-219-73.el7_8.8.x86_64
Jun 24 05:38:37 Updated: systemd-devel-219-73.el7_8.8.x86_64
Jun 24 05:39:21 Updated: systemd-libs-219-73.el7_8.8.i686

[m26560@neon ~]$ sudo grep freedesktop /var/log/messages
Jun 24 04:00:46 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.hostname1'
Jun 24 04:02:12 neon dbus[1195]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 24 04:02:12 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.problems'
Jun 24 05:02:18 neon dbus[1195]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service'
Jun 24 05:02:18 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.hostname1'
Jun 24 05:04:45 neon dbus[1195]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Jun 24 05:04:45 neon dbus[1195]: [system] Successfully activated service 'org.freedesktop.problems'
Jun 24 05:52:23 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:52:48 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:53:13 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:53:38 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
Jun 24 05:54:03 neon dbus[1195]: [system] Failed to activate service 'org.freedesktop.systemd1': timed out
...

Comment 33 robert 2020-10-03 23:17:22 UTC

Hi , 

  As stated by - [ Jeff Blaine 2020-06-24 19:02:01 UTC ] Redhat Fix is not applicable . I have just upgraded a Centos 7.8 server, and if failed to come back after reboot. 
  It bounces back into Emergency mode. 
   Error message listed. is [ Authorization not available. Check if polkit service is running or see debug message for more information.

   polkitd reports that the [lost the  name org.freedesktop.PolicyKit1 ]

  I would try and give more info , but as this is emergency mode , and the fact that no services seem to have permission to do anything there no logs. 

   Things tried.
      Removed all entries in nsswitch.conf regarding NIS .
      Tried RHEL fix [3900301] but this is not relevant in 7.8 centos.
      Tried booking from a different kernel , but same error persists.
  
Same upgrade was completed on 70 other likewise machines an no issues. 

   Checking /usr/share/polkit-1/ on both machines shows identical files. 
   same for /etc/polkit as well. 

 RPMS Installed on machines is :
   
 polkit-pkla-compat-0.1-4.el7.x86_64
 polkit-0.112-26.el7.x86_64

Any help to understand why this is happening would be great , i can rebuild this machine . And Since this patch run was done on UAT ,not going forward with Live till i know how i can fix this issue. 

 help. ?

Thanks

Rob.

Comment 35 Chris Williams 2020-11-11 21:54:37 UTC

Red Hat Enterprise Linux 7 shipped it's final minor release on September 29th, 2020. 7.9 was the last minor releases scheduled for RHEL 7.
From intial triage it does not appear the remaining Bugzillas meet the inclusion criteria for Maintenance Phase 2 and will now be closed. 

From the RHEL life cycle page:
https://access.redhat.com/support/policy/updates/errata#Maintenance_Support_2_Phase
"During Maintenance Support 2 Phase for Red Hat Enterprise Linux version 7,Red Hat defined Critical and Important impact Security Advisories (RHSAs) and selected (at Red Hat discretion) Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available."

If this BZ was closed in error and meets the above criteria please re-open it flag for 7.9.z, provide suitable business and technical justifications, and follow the process for Accelerated Fixes:
https://source.redhat.com/groups/public/pnt-cxno/pnt_customer_experience_and_operations_wiki/support_delivery_accelerated_fix_release_handbook  

Feature Requests can re-opened and moved to RHEL 8 if the desired functionality is not already present in the product. 

Please reach out to the applicable Product Experience Engineer[0] if you have any questions or concerns.  

[0] https://bugzilla.redhat.com/page.cgi?id=agile_component_mapping.html&product=Red+Hat+Enterprise+Linux+7

Comment 37 Plumber Bot 2022-01-21 15:39:01 UTC

Dropping the stale needinfo. If our input is still needed, please set the needinfo again.