Bug 1394937 - Running processes killed (SIGKILL) without a chance to shut down cleanly (SIGTERM) on logout or shutdown
Summary: Running processes killed (SIGKILL) without a chance to shut down cleanly (SIG...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: systemd
Version: 29
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: systemd-maint
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: https://fedoraproject.org/wiki/Common...
: 1652312 (view as bug list)
Depends On:
Blocks: F25FinalBlocker
TreeView+ depends on / blocked
 
Reported: 2016-11-14 19:58 UTC by Adam Williamson
Modified: 2019-05-26 22:31 UTC (History)
28 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:


Attachments (Terms of Use)
journal output from the test run (265.57 KB, text/plain)
2016-11-15 02:33 UTC, Adam Williamson
no flags Details
journal output from KDE test run (181.24 KB, text/plain)
2016-11-15 06:45 UTC, Adam Williamson
no flags Details
for Patrick Dung's db shutdown script problem (systemd) (334 bytes, text/plain)
2016-11-15 22:49 UTC, Patrick Dung
no flags Details
for Patrick Dung's db shutdown script problem (actual script to stop db) (336 bytes, application/x-shellscript)
2016-11-15 22:49 UTC, Patrick Dung
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1366897 None CLOSED Many apps crash in gdk_event_source_check when logging out of GNOME 2019-08-27 11:16:46 UTC
Red Hat Bugzilla 1395389 None CLOSED gnome-terminal crashes with SIGTRAP when gnome-wayland is shutting down 2019-08-27 11:16:46 UTC

Internal Links: 1366897 1395389

Description Adam Williamson 2016-11-14 19:58:51 UTC
Several people have reported recently that on Fedora 25 - and also on updated Fedora 23 and Fedora 24 systems, apparently - processes running as part of a GNOME session are SIGKILLed when they logout or shutdown. The most obvious symptom of this is when you shut down or log out and, on logging in again, Firefox shows its 'whoops, I crashed last time I ran' screen when you run it.

We accepted a previous bug along these lines - https://bugzilla.redhat.com/show_bug.cgi?id=1170765 - as a blocker. This bug started out a a re-opening of that bug, but I have investigated and determined that the original fix for that bug is still present in Fedora 23 and F24+ have somewhat different code, so while the symptom is similarly this clearly seems to be a different actual bug.

Zbigniew has just remarked, as a comment on that bug:

"I think the following is happening:
systemd will not wait for processes to terminate under certain conditions. In
this case the relevant one is that Delegate=yes is set for the unit.
Currently, gnome-terminal runs as systemd --user unit, which means that from
the POV of PID1, it is part of user@.service, which has Delegate=yes.

So... we have the following conundrum: current code SGIKILLs graphical terminal
sessions. If we simply always wait, systemd will "hang" (for 180s probably) in
some cases during shutdown. Dunno, there's no nice solution here, but it's
probably better to hang (if it happens rarely) than the current situation. I'll
prep a patch to drop the Delegate=yes part of the condition for F25."

I think he means "gnome-shell" rather than "gnome-terminal" there, but not sure.

Comment 1 Adam Williamson 2016-11-14 19:59:18 UTC
Transferring proposed blocker status here.

Comment 2 Adam Williamson 2016-11-14 20:04:40 UTC
To be clear, Zbigniew, we care about *all* processes, not just terminal ones (the original 'bash history lost' bug is considered only one possible symptom among many here). Is it the case that all apps running inside a regular user GNOME session will be considered part of the user@.service and killed when the user logs out or shuts down?

Comment 3 Adam Williamson 2016-11-15 02:32:06 UTC
So I was able to reproduce this quite easily with a fresh F25 RC-1.2 install. Procedure:

1. Clean install of Fedora 25 RC-1.2 Workstation live image, create user 'test' who is an admin
2. Boot installed system, log in as 'test', run through gnome-initial-setup
3. Run Firefox, close all the 'first run' tabs, open a new tab, browse to www.bluesnews.com
4. Log out

At this point, the system flips to gdm basically *instantly*, it's really a very fast transition.

5. Log back in
6. Open Firefox again

At this point I did not see the 'Well, this is embarrassing.' Firefox screen indicating a crash, but I got a single tab which went to one of the startup URLs, rather than bluesnews.com.

7. Browse to www.fedoraproject.org/wiki
8. Log out again (once again, very fast transition to gdm)
9. Log in again
10. Run Firefox again

At this point I did get the "Well, this is embarrassing." Firefox screen indicating a previous crash, with the list of open tabs showing the Fedora wiki page I had previously opened.

11. Hit Restore
12. Browse to https://docs.fedoraproject.org/en-US/index.html
13. Shut down (note shutdown happens extremely fast)
14. Power up, log in as test
15. Run Firefox again
16. Get the "Well, this is embarrassing." screen again, this time with the list of open tabs showing the docs page
17. Run journalctl -b-1, note it shows abrt catching three GNOME processes 'crashing' but ignoring the crashes because abrtd is not running (probably because it's already been killed/shut down itself)

Here are those journal lines:

Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3343]: Process 3161 (gnome-calendar) of user 1000 killed by SIGTRAP - ignoring (abrtd is not running)
Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3345]: Process 2899 (abrt-applet) of user 1000 killed by SIGTRAP - ignoring ('DebugLevel' == 0)
Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3346]: Process 2887 (gnome-software) of user 1000 killed by SIGTRAP - ignoring (abrtd is not running)
Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3346]: If abrtd crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core'
Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3344]: Process 2901 (evolution-alarm-notify) of user 1000 killed by SIGTRAP - ignoring (abrtd is not running)
Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3344]: If abrtd crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core'
Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3343]: If abrtd crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core'

Will attach the entire journal contents from the test.

Comment 4 Adam Williamson 2016-11-15 02:33:16 UTC
Created attachment 1220655 [details]
journal output from the test run

Comment 5 Adam Williamson 2016-11-15 02:34:06 UTC
Sorry, this step:

7. Browse to www.fedoraproject.org/wiki

should be just:

7. Browse to https://fedoraproject.org/wiki

Comment 6 Michael Chapman 2016-11-15 03:24:11 UTC
(In reply to Adam Williamson from comment #0)
[...]
> So... we have the following conundrum: current code SGIKILLs graphical
> terminal sessions. If we simply always wait, systemd will "hang" (for
> 180s probably) in some cases during shutdown.

I've always wondered just how likely this hang is.

The systemd in RHEL 7, for instance, has been patched to always run with wait_for_exit = true (it doesn't support Delegate= at all). Has that ever been a problem with RHEL 7 users?

As far as I can see, a 180s hang can only occur if the *last* process to exit in the cgroup is not parented by PID 1. After all, any time systemd receives a SIGCHLD it can manually check to see whether the cgroup is empty -- it doesn't need to rely solely on the cgroup emptiness notification.

Comment 7 Adam Williamson 2016-11-15 04:13:05 UTC
Oh, I've seen it happen. We had KDE lives failing to shut down until a systemd timeout expired pretty recently.

Comment 8 Adam Williamson 2016-11-15 06:45:04 UTC
Tested with KDE (which uses sddm as its login manager) and got more or less the same results. I don't *think* KDE has a user session like GNOME - at least, 'journalctl -b' as user and root show the same thing.

On KDE test I got the right page without a 'crash' page on the *first* logout/login test, but I got a crash page on both the *second* logout/login test and the shutdown/startup test, as with GNOME.

Will attach journal.

Comment 9 Adam Williamson 2016-11-15 06:45:52 UTC
Created attachment 1220693 [details]
journal output from KDE test run

Comment 10 Adam Williamson 2016-11-15 06:50:23 UTC
Hrrrrrm. Welp. It appears there may be a bit of a problem with my test procedure, namely:

https://bugzilla.mozilla.org/show_bug.cgi?id=336193

apparently Firefox doesn't do the right bloody thing when you send it a SIGTERM in any case.

Will re-test tomorrow with something else.

Comment 11 Adam Williamson 2016-11-15 19:15:20 UTC
I tried my test procedure from https://bugzilla.redhat.com/show_bug.cgi?id=1170765#c13 here:

1. Install a clean Fedora 21
2. Open a terminal, open two tabs, 'su' in one
3. Run 'test01' in one terminal, 'test02' in the other
4. Reboot (by running 'reboot' in either terminal or using the GNOME 'Restart' button, it doesn't seem to matter)
5. Open a terminal, check history for both user and root and see if you see 'test01' and 'test02'
6. Open two tabs, su in one, run 'test03' in one, 'test04' in the other
7. GOTO 4 (at step 5 check for 'test03' and 'test04', obviously)

I tried several iterations of that on clean installs of F25 RC-1.2 Workstation and KDE, and could not produce history loss in either case.

Can any of the folks who reported that they were seeing this issue again in #1170765 (now CCed) provide a precise reproducer? Thanks.

Comment 12 Bojan Smojver 2016-11-15 20:40:32 UTC
It doesn't happen every time, but what you did in comment #11 occasionally leads to it. I only saw it on a VM I have thus far. I did not see it yet on my ThinkPad T450s, for instance.

Comment 13 Patrick Dung 2016-11-15 21:39:26 UTC
1. For the loss of bash history

I have this observation:

SSH to the server (session 1)
type date
type date
type history

10046  2016-11-16 00:29:58 date
10047  2016-11-16 00:30:00 date
10048  2016-11-16 00:30:01 history

$ echo $$
7793

The PID 7793 is the bash PID
Parent Process ID of 7793 is the sshd login process of session 1.

In another ssh session to the server. (session 2)
Killing the PID of the bash process (7793) or the ssh login process of session 1 would destroy session 1. By default, SIGTERM is used for the kill command if no signal is specified.

I think the loss of bash history is due to the SIGTERM being sent when the server is shutdown.


2. For my case in #Bug 1170765, comment 82
https://bugzilla.redhat.com/show_bug.cgi?id=1170765#c82

I had shutdown script for the database. From my observation, when the server is shutdown, SIGTERM is send out immediately.
The shutdown script do not have enough time to finish, and the SIGTERM killed the database process and cause problem.

Comment 14 Zbigniew Jędrzejewski-Szmek 2016-11-15 21:43:06 UTC
So... after some testing and putting log_unit_debug in systemd in some crucial places, I'm not convinced that there's a bug in systemd.

Before I wrote:
(In reply to Adam Williamson from comment #0)
> "I think the following is happening:
> systemd will not wait for processes to terminate under certain conditions. In
> this case the relevant one is that Delegate=yes is set for the unit.
> Currently, gnome-terminal runs as systemd --user unit, which means that from
> the POV of PID1, it is part of user@.service, which has Delegate=yes.

While user@.service has Delegate=yes, it also has MainPID= set. The condition mentioned above is that Delegate=yes, but also that Main is NOT set. So this quick killing path does not apply in this case. gnome-terminal will get a normal SIGTERM + SIGHUP combo, and then systemd will wait for it to terminate.

When I logout by clicking the logout button, I see that gnome-terminal "crashes" fairly reliably. I filed #1395389.

So what I think is happening:
- the logout procedure is started (for example by the user clicking "logout")
- systemd notices that the session should terminate, so it sends SIGSTERM and SIGHUP to various processes in session-N.scope
- various processes start exiting
- in particular gnome-shell exits, terminating the wayland server
- gnome-terminal tries to communicate with wayland server, it cannot, and crashes

I'm not certain why that would lead to history loss: I'm not sure what happens to the bash process when the terminal emulator does SIGTRAP. It seems like it is able to save history, but maybe occasionally it fails.

Looking at the logs in comment #c4, various other wayland-enabled stuff crashes in a similar way: gnome-calendar, gnome-software, evolution-alarm, abrt-applet, gnome-settings-daemon.

There's also this tidbit:
Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: (EE)
Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: Fatal server error:
Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: (EE) failed to read Wayland events: Connection reset by peer
Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: (EE)
When looking at the journalctl output, this comes from gnome-shell itself, those messages are tagged with _COMM=gnome-shell. I'm not sure what is going on here, maybe gnome-wayland is not aware that it's shutting down.

tl;dr
I don't see evidence of systemd killing things, just of various processes dying messily in the shutdown sequence. For example, gnome-terminal which is socket-activated must be ready to terminate cleanly in the display server goes away.

Comment 15 Zbigniew Jędrzejewski-Szmek 2016-11-15 21:44:42 UTC
(In reply to Patrick Dung from comment #13)
> I think the loss of bash history is due to the SIGTERM being sent when the
> server is shutdown.

SIGTERM is *the* signal to terminate processes. A process should exit cleanly when it receives SIGTERM. So the fact that a SIGTERM is sent is not a sign of anything being wrong, it should be a normal part of the shutdown sequence.

Comment 16 Patrick Dung 2016-11-15 21:48:50 UTC
And update on my last comment in part 1 (loss of bash history)

The missing bash history is not visible in session 2.
If you open a new ssh session (after session 1 is killed(by SIGTERM)), the bash history is there...

10051  2016-11-16 00:29:58 date
10052  2016-11-16 00:30:00 date
10053  2016-11-16 00:30:01 history

If SIGKILL was used, those bash history would be gone.

Sorry for the confusion, and thanks Zbigniew for the explanation.

Comment 17 Adam Williamson 2016-11-15 21:49:25 UTC
Patrick Dung: did you observe any kind of *change* in behaviour here recently? It's hard to tell whether your case is a bug or not without details on what database you're using and what exactly these 'shutdown scripts' are and how they work.

Comment 18 Adam Williamson 2016-11-15 21:52:52 UTC
Zbigniew: thanks for the investigation. One question that remains for me is why several people in the original bug think they saw something change in F23/F24 recently, since F23/F24 have not switched to Wayland. (But then, there hasn't been a relevant systemd update either, AFAICT).

Comment 19 Zbigniew Jędrzejewski-Szmek 2016-11-15 21:56:26 UTC
gnome-terminal started running in the systemd --user session in F24 (I believe, it is certainly doing that in F24, but I don't have a F23 installation at hand to check, but I think it wasn't then). This changes behaviour and timing.

Comment 20 Christian Stadelmann 2016-11-15 22:26:13 UTC
I think this bug and #1366897 are related. One of them might be mistaken for the other if they are different at all.

Comment 21 Bojan Smojver 2016-11-15 22:27:04 UTC
(In reply to Adam Williamson from comment #18)
> Zbigniew: thanks for the investigation. One question that remains for me is
> why several people in the original bug think they saw something change in
> F23/F24 recently, since F23/F24 have not switched to Wayland. (But then,
> there hasn't been a relevant systemd update either, AFAICT).

No wayland on the VM where it occasionally happens for me. I'm running xrdp there, so Xvnc is running behind the scenes. Just FYI.

Comment 22 Zbigniew Jędrzejewski-Szmek 2016-11-15 22:37:27 UTC
Yeah, #1366897 is what I was getting at.

Comment 23 Patrick Dung 2016-11-15 22:49:00 UTC
Created attachment 1220977 [details]
for Patrick Dung's db shutdown script problem (systemd)

This is for systemd

Comment 24 Patrick Dung 2016-11-15 22:49:46 UTC
Created attachment 1220978 [details]
for Patrick Dung's db shutdown script problem (actual script to stop db)

for Patrick Dung's db shutdown script problem (actual script to stop db)

Comment 25 Adam Williamson 2016-11-15 22:53:28 UTC
Patrick: OK, I don't think your case really has much to do with anyone else's. I still can't tell without information on how the Oracle database itself is started and shut down, but most likely you've simply got a race going on there. It'd probably be best to follow up with a separate bug (although I suspect in the end this won't be a bug in Fedora at all but a bug in your design) or a thread on a mailing list about how to properly do something like this.

Comment 26 Adam Williamson 2016-11-15 22:55:56 UTC
I agree this and #1366897 seem similar, but the discussion in #1366897 seems fairly heavily based on the idea that it's specific to Wayland, and we do have people here claiming they're seeing similar behaviour on X.

Comment 27 Patrick Dung 2016-11-15 22:58:49 UTC
(In reply to Adam Williamson from comment #25)
> Patrick: OK, I don't think your case really has much to do with anyone
> else's. I still can't tell without information on how the Oracle database
> itself is started and shut down, but most likely you've simply got a race
> going on there. It'd probably be best to follow up with a separate bug
> (although I suspect in the end this won't be a bug in Fedora at all but a
> bug in your design) or a thread on a mailing list about how to properly do
> something like this.

Ok, I may open a separate bug report or check in mailing list.

FYI
Observed behaviour: When shutdown is called, the DB process is terminated (by SIGTERM).
The shutdown script do not have enough time to run.
I see SIGKILL later on the console, so the DB is probably not killed by SIGKILL.

If you want to go further, you can append ps and redirect the current process list right after the sqlplus shutdown command is run. In my case, I found that the DB processes are already gone.

Comment 28 Zbigniew Jędrzejewski-Szmek 2016-11-15 23:28:14 UTC
Yeah, please open a separate bug, this one is already messy enough. In particular I'm not seeing any After/Before relationship in your shutdown service. In the new bug, please also post the logs from the shutdown, preferably after 'systemd set-log-level debug'.

Comment 29 Matthew Miller 2016-11-16 15:27:28 UTC
-1 to this as a blocker. It might be a candidate for our nascent "important non-blocking bugs" process.

Comment 30 Paul W. Frields 2016-11-17 16:37:43 UTC
Also -1 for blocker but do think it warrants attention.  I may not be able to attend opening of go/no-go today but wanted to have that in record.

Comment 31 Matthew Miller 2016-11-17 17:12:29 UTC
Rationale for -1 to blocker, in case it's relevant: since the problem exists in updated F23 and F24, blocking F25 does not seem to be the right tool.

Comment 32 Adam Williamson 2016-11-17 17:27:33 UTC
Matthew: I'm honestly reluctant to claim that. The #1366897 case is the only really clear one we have here, and that one is simply a GNOME + Wayland bug; it technically 'exists' in F23 and F24 because you can use Wayland there, but few people do. It's obviously much more important in F25. But we did already discuss that case on its own and reject it as a blocker.

We have people claiming that "this problem" has suddenly re-appeared in F23 and F24, but on closer inspection I'm not at all sure that they're all actually talking about the same thing, or even precisely what they're seeing. No-one's yet posted a really clear-cut reproducer with information on when it was working and when it stopped working, AFAICT. Bojan's case is maybe the clearest, but I still can't reproduce it on an OOTB install...

Comment 33 Eric Smith 2016-11-19 09:59:01 UTC
This also prevents Emacs for saving modified files for later recovery, which it would do on SIGTERM.

Comment 34 Zbigniew Jędrzejewski-Szmek 2016-11-19 14:26:28 UTC
(In reply to Eric Smith from comment #33)
> This also prevents Emacs for saving modified files for later recovery, which
> it would do on SIGTERM.

What is "this"? If you read the discussion, the only thing that is clearly wrong is some applications using wayland crashing when the wayland server goes away. But emacs still uses X, i.e. Xwayland or Xorg if you are not using wayland... Please be more precise in your statements.

Comment 35 Bojan Smojver 2016-11-19 21:12:04 UTC
(In reply to Zbigniew Jędrzejewski-Szmek from comment #34)
> If you read the discussion, the only thing that is clearly
> wrong is some applications using wayland crashing when the wayland server
> goes away.

No Wayland here. Xvnc behind xrdp.

Comment 36 Alexander Korsunsky 2016-11-21 11:58:32 UTC
Happens to me on definitely not on Wayland. I am using google-chrome on F24 with the X-Server, and more often than not (but not always), logging out or shutting down causes chrome to ask for recovery on the next start.

Comment 37 Adam Williamson 2016-11-29 00:12:12 UTC
Alexander: do you know if that's not just expected behaviour for Chrome? I thought there was a similar 'bug' for Firefox, but it turns out that's just what Firefox does when it's SIGTERMed:

https://bugzilla.mozilla.org/show_bug.cgi?id=336193

Comment 38 Alexander Korsunsky 2016-12-05 17:35:49 UTC
(In reply to Adam Williamson from comment #37)
> Alexander: do you know if that's not just expected behaviour for Chrome? I
> thought there was a similar 'bug' for Firefox, but it turns out that's just
> what Firefox does when it's SIGTERMed:

Chrome seems to be doing the right thing though. When I

pkill -SIGTERM chrome

then chrome closes nicely and after startup shows my previous tabs.
When I do a SIGKILL, tabs are dead and I get the recovery option.

So I don't think chrome is bugged regarding shutdown.

Worth Noting:

1) Same behavior on Wayland and Xorg
2) I can't get it to crash when just logging out
3) Rebooting or shutting down crashes it for me every time
4) The history in gnome-terminal according to your test case is saving just fine for me, while chrome reliably crashes. The history is probably a different bug and shouldn't be used to reproduce this one.

Comment 39 Fedora End Of Life 2017-11-16 19:23:54 UTC
This message is a reminder that Fedora 25 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 25. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '25'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 25 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 40 Christian Stadelmann 2017-11-16 23:39:05 UTC
Still happens on F26.

Comment 41 Fedora End Of Life 2018-05-03 08:16:08 UTC
This message is a reminder that Fedora 26 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 26. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '26'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 26 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 42 Anass Ahmed 2018-05-03 14:30:15 UTC
I think this still happens in F27, F28 and Rawhide.

Comment 43 Jan Kurik 2018-08-14 10:31:31 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle.
Changing version to '29'.

Comment 44 Matthew Miller 2018-11-27 20:25:15 UTC
*** Bug 1652312 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.