1141137 – systemd sends SIGKILL imediately after SIGTERM during shutdown

Bug 1141137 - systemd sends SIGKILL imediately after SIGTERM during shutdown

Summary: systemd sends SIGKILL imediately after SIGTERM during shutdown

Keywords:
Status:	CLOSED DUPLICATE of bug 1170765
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1139380
TreeView+	depends on / blocked

Reported:	2014-09-12 09:48 UTC by Stef Walter
Modified:	2015-08-31 23:30 UTC (History)
CC List:	35 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-08-31 23:30:15 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1448259	0	None	None	None	Never

Description Stef Walter 2014-09-12 09:48:48 UTC

Description of problem:

systemd sends SIGKILL immediately after SIGTERM to cockpit child processes when shutting down or stopping the unit.

The children of a cockpit login session all get SIGKILL immediately after SIGTERM (less than a tenth of a second apart). cockpit-agent and cockpit-session takes more than a tenth of a second to shutdown cleanly.

The easiest way to reproduce this here, is a system shutdown. Even the 'reboot' that started the system shutdown (executed via ssh) gets a SIGKILL before it can exit().

Here's some output from a simple systemtap probe which shows this:

https://github.com/cockpit-project/cockpit/issues/1155#issuecomment-55374240

You can see how a cockpit unit, and its login session scope looks here:

https://github.com/cockpit-project/cockpit/issues/1155#issuecomment-55381385

Version-Release number of selected component (if applicable):

systemd-215-14.fc21.x86_64

How reproducible:

Pretty easily repreducible using our cockpit integration test system. 

Steps to Reproduce:

If necessary stefw would be willing to help someone duplicate the problem on their machine.

Actual results:

SIGKILL received immediately after SIGTERM

Expected results:

Wait for the timeout (default is 1min 30sec) before sending SIGKILL.

Comment 1 Stef Walter 2014-09-12 09:49:49 UTC

This commit breaks cockpit orderly shutdown:

  commit 743970d2ea6d08aa7c7bff8220f6b7702f2b1db7
  Author: Lennart Poettering <lennart>
  Date:   Fri Feb 7 16:12:09 2014 +0100
  
      core: one step back again, for nspawn we actually can't wait for
  cgroups running empty since systemd will get exactly zero
  notifications about it

This commit was introduced in v209, and the problem is present in Fedora 21. Reverting the commit resolves the problem.

Comment 2 DO NOT USE account not monitored (old adamwill) 2015-01-16 22:09:08 UTC

I've been noticing for some time that bits of my bash history are missing. I got really annoyed by it today and tried to track it down a bit.

I think it happens when you shut down / reboot with GNOME Terminal running; history from whatever sessions you have open is sometimes saved but sometimes not. I've done several test reboots with 'canary' commands (strings of random words) in the history buffer and checked if they were there on reboot. Each time I had two terminal tabs open. Sometimes the history from neither would be saved, sometimes the history from one but not the other, sometimes both.

I'm wondering if this bug could be affecting more than just cockpit. Shutdown sure seems to happen *fast*, which I've noticed people calling a 'feature', but maybe it's a bug - things are getting SIGKILLed too fast and not having a chance to shut down in an orderly way?

Comment 3 Stef Walter 2015-02-05 14:39:44 UTC

Some discussion upstream in this thread: http://lists.freedesktop.org/archives/systemd-devel/2014-October/024452.html

Comment 4 Adam Williamson 2015-02-05 17:34:09 UTC

Yes, I've seen that, but I'm really not sure it justifies 'shutting down' by basically kill -9ing half the system! This is presumably why you quite often get the Firefox 'something went wrong' screen when you do a perfectly normal reboot, too?

Comment 5 Michael Chapman 2015-02-06 08:06:58 UTC

I'm pretty certain this bug and bug 1183194 are one and the same.

I really think having to wait a full stop timeout because the empty cgroup notification was missed is a better alternative than immediately KILLing scope units when they're stopped. I'm surprised the latter behaviour hasn't caused more bug reports.

I would certainly appreciate having that commit reverted in Fedora systemd's package.

Comment 6 Olli Niemi 2015-04-24 17:51:38 UTC

See also: https://bugs.launchpad.net/ubuntu/+source/mosh/+bug/1446982

Comment 7 Martin Pitt 2015-06-02 06:13:30 UTC

FTR, in Debian and Ubuntu we also reverted 743970d to fix this. Immediately KILLing everything in the user session causes mayhem with bash, mosh, and presumably lots of other things that run in sessions.

Comment 8 jamespharvey20 2015-06-24 17:41:16 UTC

What is the (or, is there a) drawback to reverting the commit?

Does reverting just mean shutting down is slower?

Or does reverting risk another type of data loss?

Comment 9 Michael Chapman 2015-06-25 08:39:35 UTC

(In reply to jamespharvey20 from comment #8)
> What is the (or, is there a) drawback to reverting the commit?
> 
> Does reverting just mean shutting down is slower?
> 
> Or does reverting risk another type of data loss?

http://lists.freedesktop.org/archives/systemd-devel/2014-November/025734.html has the best description of the problem. But as my reply at http://lists.freedesktop.org/archives/systemd-devel/2014-November/025755.html indicates, I think it can be solved so it works sufficiently well in the majority of cases. 

Anybody want to try bringing it up on the systemd mailing list for a fourth time? Having each and every distro work around the bug is silly.

Comment 10 sforsyt 2015-06-26 01:15:26 UTC

Michael, 
Do you agree with the assessment from here under "Regression Potential"

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1448259

>The original commit was applied because of an inherent race condition with cgroup's release_agent -- in rare corner cases an nspawn container (probably also LXC) can miss them. In that case it's possible that you instead get a 90s timeout on the unit that is shutting down. But this does not mean data loss, just a rare shutdown hang from containers (for the record, I never actually saw that hanging with LXC), so I think it's a good trade-off.

So is this issue ONLY in containers that have spawned a session , and a rare corner case as suggested?

If so I think the data loss from outweighs the possible race condition in a specific configuration.  

This commit reverts it
https://github.com/lnykryn/systemd-rhel/commit/647a7761e2fa423c6e1bd6785b043dbe7b525e3c

What is the process with Fedora to move this bug along?  Who has final say or has to sign off on it? Does it have to be directly fixed by systemd as the primary maintainers and what they package is directly incorporated into Fedora?  (vs say Redhat that has its own fork?)

Comment 11 Michael Chapman 2015-06-26 01:40:06 UTC

(In reply to sforsyt from comment #10)
> Michael, 
> Do you agree with the assessment from here under "Regression Potential"

I do.

I'm not sure if the race condition described there applies only to containers or whether it can also happen with *any* scope unit. I suspect the latter.

Nevertheless, I still would definitely prefer an occasional pause when stopping a scope because systemd misses the notification over an almost-guaranteed data loss because systemd decides to SIGKILL everything.

Comment 12 Michael Catanzaro 2015-06-26 13:27:20 UTC

(In reply to sforsyt from comment #10)
> What is the process with Fedora to move this bug along?  Who has final say
> or has to sign off on it?

As with many other critical Fedora packages, the Fedora maintainers are also upstream maintainers and they typically only backport patches to Fedora that have already applied upstream, so the best place to start would be to get this fixed upstream: https://github.com/systemd/systemd

Comment 13 sforsyt 2015-07-01 13:18:05 UTC

Forgive my ignorance, I'm still new to the governance/politics of the project; but where are the backport patches kept, is there a separate repo/branch for fedora release?

From reading the below issue, and other comments, Poettering seems to disdain or distance himself from "Distributions" and pushes or attributes issues due to configuration/defaults to them. 

https://github.com/systemd/systemd/issues/437
>Systemd upstream is not a product, we shouldnt register it as one. distributions such as fedora have their own pool.

How does that mesh with Fedora if he is the maintainer? Who picks the "sane" defaults like for ntp server or any myriad of other settings I've no idea about?

How can I get the "Fedora" specific configs (if there are any) to recompile on my own?  Would just downloading the srpm and apply one liner patch be sufficient?

Comment 14 Adam Williamson 2015-07-01 16:16:52 UTC

The Fedora package has several maintainers, including Lennart, Zbigniew Jędrzejewski-Szmek and Jan Synáček.

As with all Fedora packages, you can check the build out of git:

http://pkgs.fedoraproject.org/cgit/systemd.git

There's a guide to working with the git-based packaging system in the wiki:

https://fedoraproject.org/wiki/Package_maintenance_guide

You can of course just grab an SRPM from the repos and mess with that too.

The main patch series in the package usually comes from the matching systemd-stable tree in upstream git:

http://cgit.freedesktop.org/systemd/systemd-stable/

which is where upstream keeps backported fixes for stable releases of systemd. as of right now, there isn't a 220 stable tree, so there's no main patch series in the spec, the current Rawhide package is in sync with systemd 220 as released.

Patches specific to Fedora are Patch1000 onwards; currently there's only one, related to kernel management.

The default NTP pool is set with an option to 'configure' in the spec:

pkgs.fedoraproject.org/cgit/systemd.git/tree/systemd.spec#n280

Comment 15 sforsyt 2015-07-01 16:35:58 UTC

Thanks for the info, the links were very helpful.

Is the lack of a 220 stable tree because they've moved to github?
http://lists.freedesktop.org/archives/systemd-devel/2015-June/032652.html

Can diffs be synced from here I'm assuming?
https://github.com/systemd/systemd/tree/v220

Comment 16 Piergiorgio Sartor 2015-08-12 20:32:37 UTC

When I was a kid, my dad used to say: "Never kill -9 a process, unless it is a real emergency, and always after trying all other signals (1, 2, 3, 15, ...), with due wait between them".

Since 1973 it is well known that "kill -9" is "terminate without prejudice" and it will almost always cause data loss.

Now, I see the following (which might be connected to this "feature" or not) after a shutdown / reboot:

firefox does report a crash and loses its tabs (always, it is quite known that FF is pretty slow to react, but this is relatively new)
xfce4-weather-plugin loses its configuration (often) and I have to reconfigure again (pretty annoying)
xfce4-sensors-plugin loses its configuration (sometimes) and I have to reconfigure again (pretty annoying, too)
bash, as already mentioned, loses the history (sometimes), which is really annoying, if you use it systematically (very annoying)
thunderbird was behaving strangely (seldom), maybe unrelated
I suspect, xfce4 might have issues in saving the situation before closing, not really sure and I do not want to test it

It seems to me quite critical issue, for a good "user experience" (mildly put).

If this "kill -9" is only to avoid some longer waiting later, than it is a *wrong* solution, since you've already to wait before using "kill -9" anyway.
A better thing to do would be to have a timeout (long) and then "kill -9" whatever did not stop in time.

So, I join the chorus and I would like to ask to revert this commit or to add a more sensible timeout before using SIGKILL. Better not using SIGKILL at all, since, AFAIK, it is *not* supposed to be used in normal situations (like a reboot).

Thanks,

bye,

pg

Comment 17 Piergiorgio Sartor 2015-08-27 20:53:49 UTC

Hi again,

Some further comment, even if I'm not sure it's relevant, you'll tell me.

Since a couple of weeks, instead of shutting down directly from the xfce4 menu, I first log out and then I shut down from the GDM interface.

The idea is that all the applications (like bash, plugins and so on) are closed at log out in a cleaner way.
As written above, I'm not sure if this is the case, but I *never* had any problem using this method since I started.
No history lost, no plugins losing configuration.
I did not test firefox.

Maybe I just got lucky, or the hypothesis is correct and giving a chance to the process to close properly helps here.

Hope this helps,

bye,

pg

Comment 18 jamespharvey20 2015-08-29 03:26:30 UTC

Piergiorgio, I would welcome you to post your data loss experiences upstream at systemd.  I think most from Redhat agree with you, and elsewhere, that this behavior is the wrong solution.  Upstream disagrees.  So far, for the most part, upstream reports have been about bash history.

Upstream bug report at: https://github.com/systemd/systemd/issues/317

Upstream pull request to revert the behavior, preventing data loss, with risk of slower shutdown at: https://github.com/jamespharvey20/systemd/commit/2bd2850d73046a45b6bfa574ac1dc5cd298ea072

Comment 19 Piergiorgio Sartor 2015-08-29 10:23:08 UTC

(In reply to jamespharvey20 from comment #18)
> Piergiorgio, I would welcome you to post your data loss experiences upstream
> at systemd.  I think most from Redhat agree with you, and elsewhere, that
> this behavior is the wrong solution.  Upstream disagrees.  So far, for the
> most part, upstream reports have been about bash history.
> 
> Upstream bug report at: https://github.com/systemd/systemd/issues/317
> 
> Upstream pull request to revert the behavior, preventing data loss, with
> risk of slower shutdown at:
> https://github.com/jamespharvey20/systemd/commit/
> 2bd2850d73046a45b6bfa574ac1dc5cd298ea072

Hi James,

your suggestion is very sensible, but:

1) I'm not in the mood to subscribe to an other bug reporting system or mailing list.
2) I've installed Fedora, not systemd, so I think Fedora is the place to report issues.
3) Fedora people can revert the patch too, making upstream the only place with this "feature" and, possibly, forcing them to revert their decision.
4) Fedora people can, with some more effort, remove completely systemd. I guess if upstream is not cooperative, this is a possible choice.
5) This is not the first time systemd people introduce problems with the mentality "we know better". If I will post something upstream it will be difficult for me to refrain insulting them (heavily). Personally I think they lack of a strong technical leadership, over-viewing carefully the changes. And this is very unfortunate, since systemd is the second key component, after the kernel.
6) I had already experience with this "report to the distribution" -> "report upstream" -> "report to distribution" -> ... ping-pong, so, no thanks.

I would prefer Fedora developers / engineers revert the "feature" and communicate this upstream. Possibly with all the references. If Fedora or upstream ask me something, I'll be happy to support them, of course.

Anyway, thanks for the suggestion,

bye,

pg

Comment 20 Michael Catanzaro 2015-08-29 15:27:38 UTC

This issue will be discussed at the Fedora Workstation Working Group meeting beginning at 14:00 UTC (10:00 EDT) on Wednesday, September 2, in #fedora-meeting on freenode, as it was requested that we revert 743970d in Fedora, as many other distros have already done. systemd maintainers and developers are invited to attend and provide input. Our understanding is that reverting 743970d will avoid many cases of data loss. (To be clear, we don't need anyone to attend and tell us that data loss is bad.)

Comment 21 Michael Catanzaro 2015-08-31 19:50:12 UTC

Adam, if you're planning to put this through the blocker process, then I don't think the WG needs to discuss this. But if no blocker criterion applies, then we will need to.

Comment 22 Adam Williamson 2015-08-31 22:16:30 UTC

Michael: https://bugzilla.redhat.com/show_bug.cgi?id=1170765 is already an accepted final blocker. I'm not sure why we still have two bugs for this.

Comment 23 Michael Catanzaro 2015-08-31 23:30:15 UTC

OK then.

(In reply to Michael Catanzaro from comment #20)
> This issue will be discussed at the Fedora Workstation Working Group meeting
> beginning at 14:00 UTC (10:00 EDT) on Wednesday, September 2, in
> #fedora-meeting on freenode

FYI: not planning to discuss this anymore, since the QA folks have this under control in bug #1170765.

*** This bug has been marked as a duplicate of bug 1170765 ***

Note You need to log in before you can comment on or make changes to this bug.

awilliam
bugzilla.redhat.com
buysse
chris+rhbugzilla
dustymabe
elad
fche
fweimer
jamespharvey20
johannbg
jswensso
kparal
lnykryn
ltoscano
mcatanzaro+wrong-account-do-not-cc
mcdanlj
michele
mpitt
msekleta
mvollmer
myroslav
olliniem
pdwyer
piergiorgio.sartor
ppisar
redhat-bugzilla
rharwood
sforsyt
sgallagh
s
stepglenn
steveboss111
systemd-maint
vpavlin
zbyszek