Bug 1187695

Summary: systemd "Looping too fast" after reexec with broken dbus socket
Product: [Fedora] Fedora Reporter: Tomáš Bžatek <tbzatek>
Component: systemdAssignee: Michal Schmidt <mschmidt>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: johannbg, jsynacek, lnykryn, lsu, mschmidt, msekleta, s, systemd-maint, trevor, tsmetana, vpavlin, zbyszek
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-20 13:11:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
strace -p 1 fragment
none
journalctl -b none

Description Tomáš Bžatek 2015-01-30 17:04:47 UTC
Created attachment 986084 [details]
strace -p 1 fragment

Description of problem:
This is the first time I've seen that problem, the system was working properly before. Routinely spawned "dnf update" on my rawhide system, with about 700 updates (~ 2 months since last update). The update process got stuck eventually, ps showed stuck "firewall-cmd --reload --quiet". So I killed that with SIGTERM. The update process continued and eventually stuck again few packages later. Ps showed stuck "systemctl condrestart xxxx.service". Killing that with SIGTERM did nothing, killing with SIGKILL did nothing either. It succeeded after some time and the update process continued until it got stuck on another "systemctl condrestart yyyyy.service"...

From that point on, any request to systemd incl. "reboot" gets stuck.

Version-Release number of selected component (if applicable):
systemd-216-11.fc22.x86_64 (running)
systemd-218-3.fc22.x86_64 (freshly updated by dnf update)
kernel-3.18.0-0.rc0.git9.1.fc22.x86_64

Additional info:
This is a VM under KVM, 8 logical CPUs, x86_64, Haswell, started by "qemu-system-x86_64 -enable-kvm -cpu host -smp 8 -m 8G"

The symptopms fit to http://lists.freedesktop.org/archives/systemd-devel/2014-December/025867.html except of the memory usage, which is low. Looking at top, systemd eats between 10 - 50 % CPU.

Have a look at the attached "strace -p 1" fragment. This is reading from fds 4 and 14:
  File: ‘/proc/1/fd/4’ -> ‘anon_inode:[eventpoll]’
  File: ‘/proc/1/fd/14’ -> ‘anon_inode:[timerfd]’

Comment 1 Tomáš Bžatek 2015-01-30 17:05:51 UTC
Created attachment 986086 [details]
journalctl -b

Comment 2 Jan Synacek 2015-02-02 07:22:19 UTC

*** This bug has been marked as a duplicate of bug 1186018 ***

Comment 3 Jan Synacek 2015-02-02 07:24:02 UTC
I'm about 90% sure that this is a dup of bug 1186018. If killing "dbus-send" doesn't work around the issue, please, reopen this bugzilla.

Comment 4 Michal Schmidt 2015-03-13 17:54:35 UTC
Unduplicating. Bug 1186018 is related, but it's not the same bug.

I can now reproduce it using these steps:
1. Lose the DBus system bus socket by following the steps from
   https://bugzilla.redhat.com/show_bug.cgi?id=1186018#c7
2. Tell systemd to reexec:
   kill -TERM 1

After 25 seconds I see the first error message:
   Failed to register match for Disconnected message: Connection timed out
This repeats a couple more times. Eventually the "Looping too fast" start appearing too.

(I have no reproducer or explanation for the unkillable systemctl observed by Tomáš. It should not be possible to cause that from userspace.)

Comment 5 Michal Sekletar 2015-03-17 17:04:38 UTC
It seems to me that this bug was caused by updating dbus (to version which changed unit file option ListenStream) and systemd in the single dnf transaction. 

Also I think that patch for bug 1186018 should also fix this one because we would not end up with two sockets after daemon reload in the first place.

It would be great if systemd could handle cases like this one in more robust manner than entering "busy" loop after reexec. However with kdbus this entire sort of issues should go away entirely.

Comment 6 Michal Schmidt 2015-03-18 12:07:15 UTC
(In reply to Michal Sekletar from comment #5)
Michal, I agree with everything you wrote there. Still, I find the failure mode interesting enough to make me want to look into it further. And that's why I reopened this BZ. I'm assigning it to myself so you don't have to waste your time on it.

Comment 8 Jan Kurik 2015-07-15 14:34:57 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 9 Fedora End Of Life 2016-11-24 11:24:28 UTC
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 10 Fedora End Of Life 2016-12-20 13:11:27 UTC
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.