Description of problem:
systemd sends SIGKILL immediately after SIGTERM to cockpit child processes when shutting down or stopping the unit.
The children of a cockpit login session all get SIGKILL immediately after SIGTERM (less than a tenth of a second apart). cockpit-agent and cockpit-session takes more than a tenth of a second to shutdown cleanly.
The easiest way to reproduce this here, is a system shutdown. Even the 'reboot' that started the system shutdown (executed via ssh) gets a SIGKILL before it can exit().
Here's some output from a simple systemtap probe which shows this:
You can see how a cockpit unit, and its login session scope looks here:
Version-Release number of selected component (if applicable):
Pretty easily repreducible using our cockpit integration test system.
Steps to Reproduce:
If necessary email@example.com would be willing to help someone duplicate the problem on their machine.
SIGKILL received immediately after SIGTERM
Wait for the timeout (default is 1min 30sec) before sending SIGKILL.
This commit breaks cockpit orderly shutdown:
Author: Lennart Poettering <firstname.lastname@example.org>
Date: Fri Feb 7 16:12:09 2014 +0100
core: one step back again, for nspawn we actually can't wait for
cgroups running empty since systemd will get exactly zero
notifications about it
This commit was introduced in v209, and the problem is present in Fedora 21. Reverting the commit resolves the problem.
I've been noticing for some time that bits of my bash history are missing. I got really annoyed by it today and tried to track it down a bit.
I think it happens when you shut down / reboot with GNOME Terminal running; history from whatever sessions you have open is sometimes saved but sometimes not. I've done several test reboots with 'canary' commands (strings of random words) in the history buffer and checked if they were there on reboot. Each time I had two terminal tabs open. Sometimes the history from neither would be saved, sometimes the history from one but not the other, sometimes both.
I'm wondering if this bug could be affecting more than just cockpit. Shutdown sure seems to happen *fast*, which I've noticed people calling a 'feature', but maybe it's a bug - things are getting SIGKILLed too fast and not having a chance to shut down in an orderly way?
Some discussion upstream in this thread: http://lists.freedesktop.org/archives/systemd-devel/2014-October/024452.html
Yes, I've seen that, but I'm really not sure it justifies 'shutting down' by basically kill -9ing half the system! This is presumably why you quite often get the Firefox 'something went wrong' screen when you do a perfectly normal reboot, too?
I'm pretty certain this bug and bug 1183194 are one and the same.
I really think having to wait a full stop timeout because the empty cgroup notification was missed is a better alternative than immediately KILLing scope units when they're stopped. I'm surprised the latter behaviour hasn't caused more bug reports.
I would certainly appreciate having that commit reverted in Fedora systemd's package.
See also: https://bugs.launchpad.net/ubuntu/+source/mosh/+bug/1446982
FTR, in Debian and Ubuntu we also reverted 743970d to fix this. Immediately KILLing everything in the user session causes mayhem with bash, mosh, and presumably lots of other things that run in sessions.
What is the (or, is there a) drawback to reverting the commit?
Does reverting just mean shutting down is slower?
Or does reverting risk another type of data loss?
(In reply to jamespharvey20 from comment #8)
> What is the (or, is there a) drawback to reverting the commit?
> Does reverting just mean shutting down is slower?
> Or does reverting risk another type of data loss?
http://lists.freedesktop.org/archives/systemd-devel/2014-November/025734.html has the best description of the problem. But as my reply at http://lists.freedesktop.org/archives/systemd-devel/2014-November/025755.html indicates, I think it can be solved so it works sufficiently well in the majority of cases.
Anybody want to try bringing it up on the systemd mailing list for a fourth time? Having each and every distro work around the bug is silly.
Do you agree with the assessment from here under "Regression Potential"
>The original commit was applied because of an inherent race condition with cgroup's release_agent -- in rare corner cases an nspawn container (probably also LXC) can miss them. In that case it's possible that you instead get a 90s timeout on the unit that is shutting down. But this does not mean data loss, just a rare shutdown hang from containers (for the record, I never actually saw that hanging with LXC), so I think it's a good trade-off.
So is this issue ONLY in containers that have spawned a session , and a rare corner case as suggested?
If so I think the data loss from outweighs the possible race condition in a specific configuration.
This commit reverts it
What is the process with Fedora to move this bug along? Who has final say or has to sign off on it? Does it have to be directly fixed by systemd as the primary maintainers and what they package is directly incorporated into Fedora? (vs say Redhat that has its own fork?)
(In reply to sforsyt from comment #10)
> Do you agree with the assessment from here under "Regression Potential"
I'm not sure if the race condition described there applies only to containers or whether it can also happen with *any* scope unit. I suspect the latter.
Nevertheless, I still would definitely prefer an occasional pause when stopping a scope because systemd misses the notification over an almost-guaranteed data loss because systemd decides to SIGKILL everything.
(In reply to sforsyt from comment #10)
> What is the process with Fedora to move this bug along? Who has final say
> or has to sign off on it?
As with many other critical Fedora packages, the Fedora maintainers are also upstream maintainers and they typically only backport patches to Fedora that have already applied upstream, so the best place to start would be to get this fixed upstream: https://github.com/systemd/systemd
Forgive my ignorance, I'm still new to the governance/politics of the project; but where are the backport patches kept, is there a separate repo/branch for fedora release?
From reading the below issue, and other comments, Poettering seems to disdain or distance himself from "Distributions" and pushes or attributes issues due to configuration/defaults to them.
>Systemd upstream is not a product, we shouldnt register it as one. distributions such as fedora have their own pool.
How does that mesh with Fedora if he is the maintainer? Who picks the "sane" defaults like for ntp server or any myriad of other settings I've no idea about?
How can I get the "Fedora" specific configs (if there are any) to recompile on my own? Would just downloading the srpm and apply one liner patch be sufficient?
The Fedora package has several maintainers, including Lennart, Zbigniew Jędrzejewski-Szmek and Jan Synáček.
As with all Fedora packages, you can check the build out of git:
There's a guide to working with the git-based packaging system in the wiki:
You can of course just grab an SRPM from the repos and mess with that too.
The main patch series in the package usually comes from the matching systemd-stable tree in upstream git:
which is where upstream keeps backported fixes for stable releases of systemd. as of right now, there isn't a 220 stable tree, so there's no main patch series in the spec, the current Rawhide package is in sync with systemd 220 as released.
Patches specific to Fedora are Patch1000 onwards; currently there's only one, related to kernel management.
The default NTP pool is set with an option to 'configure' in the spec:
Thanks for the info, the links were very helpful.
Is the lack of a 220 stable tree because they've moved to github?
Can diffs be synced from here I'm assuming?
When I was a kid, my dad used to say: "Never kill -9 a process, unless it is a real emergency, and always after trying all other signals (1, 2, 3, 15, ...), with due wait between them".
Since 1973 it is well known that "kill -9" is "terminate without prejudice" and it will almost always cause data loss.
Now, I see the following (which might be connected to this "feature" or not) after a shutdown / reboot:
firefox does report a crash and loses its tabs (always, it is quite known that FF is pretty slow to react, but this is relatively new)
xfce4-weather-plugin loses its configuration (often) and I have to reconfigure again (pretty annoying)
xfce4-sensors-plugin loses its configuration (sometimes) and I have to reconfigure again (pretty annoying, too)
bash, as already mentioned, loses the history (sometimes), which is really annoying, if you use it systematically (very annoying)
thunderbird was behaving strangely (seldom), maybe unrelated
I suspect, xfce4 might have issues in saving the situation before closing, not really sure and I do not want to test it
It seems to me quite critical issue, for a good "user experience" (mildly put).
If this "kill -9" is only to avoid some longer waiting later, than it is a *wrong* solution, since you've already to wait before using "kill -9" anyway.
A better thing to do would be to have a timeout (long) and then "kill -9" whatever did not stop in time.
So, I join the chorus and I would like to ask to revert this commit or to add a more sensible timeout before using SIGKILL. Better not using SIGKILL at all, since, AFAIK, it is *not* supposed to be used in normal situations (like a reboot).
Some further comment, even if I'm not sure it's relevant, you'll tell me.
Since a couple of weeks, instead of shutting down directly from the xfce4 menu, I first log out and then I shut down from the GDM interface.
The idea is that all the applications (like bash, plugins and so on) are closed at log out in a cleaner way.
As written above, I'm not sure if this is the case, but I *never* had any problem using this method since I started.
No history lost, no plugins losing configuration.
I did not test firefox.
Maybe I just got lucky, or the hypothesis is correct and giving a chance to the process to close properly helps here.
Hope this helps,
Piergiorgio, I would welcome you to post your data loss experiences upstream at systemd. I think most from Redhat agree with you, and elsewhere, that this behavior is the wrong solution. Upstream disagrees. So far, for the most part, upstream reports have been about bash history.
Upstream bug report at: https://github.com/systemd/systemd/issues/317
Upstream pull request to revert the behavior, preventing data loss, with risk of slower shutdown at: https://github.com/jamespharvey20/systemd/commit/2bd2850d73046a45b6bfa574ac1dc5cd298ea072
(In reply to jamespharvey20 from comment #18)
> Piergiorgio, I would welcome you to post your data loss experiences upstream
> at systemd. I think most from Redhat agree with you, and elsewhere, that
> this behavior is the wrong solution. Upstream disagrees. So far, for the
> most part, upstream reports have been about bash history.
> Upstream bug report at: https://github.com/systemd/systemd/issues/317
> Upstream pull request to revert the behavior, preventing data loss, with
> risk of slower shutdown at:
your suggestion is very sensible, but:
1) I'm not in the mood to subscribe to an other bug reporting system or mailing list.
2) I've installed Fedora, not systemd, so I think Fedora is the place to report issues.
3) Fedora people can revert the patch too, making upstream the only place with this "feature" and, possibly, forcing them to revert their decision.
4) Fedora people can, with some more effort, remove completely systemd. I guess if upstream is not cooperative, this is a possible choice.
5) This is not the first time systemd people introduce problems with the mentality "we know better". If I will post something upstream it will be difficult for me to refrain insulting them (heavily). Personally I think they lack of a strong technical leadership, over-viewing carefully the changes. And this is very unfortunate, since systemd is the second key component, after the kernel.
6) I had already experience with this "report to the distribution" -> "report upstream" -> "report to distribution" -> ... ping-pong, so, no thanks.
I would prefer Fedora developers / engineers revert the "feature" and communicate this upstream. Possibly with all the references. If Fedora or upstream ask me something, I'll be happy to support them, of course.
Anyway, thanks for the suggestion,
This issue will be discussed at the Fedora Workstation Working Group meeting beginning at 14:00 UTC (10:00 EDT) on Wednesday, September 2, in #fedora-meeting on freenode, as it was requested that we revert 743970d in Fedora, as many other distros have already done. systemd maintainers and developers are invited to attend and provide input. Our understanding is that reverting 743970d will avoid many cases of data loss. (To be clear, we don't need anyone to attend and tell us that data loss is bad.)
Adam, if you're planning to put this through the blocker process, then I don't think the WG needs to discuss this. But if no blocker criterion applies, then we will need to.
Michael: https://bugzilla.redhat.com/show_bug.cgi?id=1170765 is already an accepted final blocker. I'm not sure why we still have two bugs for this.
(In reply to Michael Catanzaro from comment #20)
> This issue will be discussed at the Fedora Workstation Working Group meeting
> beginning at 14:00 UTC (10:00 EDT) on Wednesday, September 2, in
> #fedora-meeting on freenode
FYI: not planning to discuss this anymore, since the QA folks have this under control in bug #1170765.
*** This bug has been marked as a duplicate of bug 1170765 ***