Several people have reported recently that on Fedora 25 - and also on updated Fedora 23 and Fedora 24 systems, apparently - processes running as part of a GNOME session are SIGKILLed when they logout or shutdown. The most obvious symptom of this is when you shut down or log out and, on logging in again, Firefox shows its 'whoops, I crashed last time I ran' screen when you run it. We accepted a previous bug along these lines - https://bugzilla.redhat.com/show_bug.cgi?id=1170765 - as a blocker. This bug started out a a re-opening of that bug, but I have investigated and determined that the original fix for that bug is still present in Fedora 23 and F24+ have somewhat different code, so while the symptom is similarly this clearly seems to be a different actual bug. Zbigniew has just remarked, as a comment on that bug: "I think the following is happening: systemd will not wait for processes to terminate under certain conditions. In this case the relevant one is that Delegate=yes is set for the unit. Currently, gnome-terminal runs as systemd --user unit, which means that from the POV of PID1, it is part of user@.service, which has Delegate=yes. So... we have the following conundrum: current code SGIKILLs graphical terminal sessions. If we simply always wait, systemd will "hang" (for 180s probably) in some cases during shutdown. Dunno, there's no nice solution here, but it's probably better to hang (if it happens rarely) than the current situation. I'll prep a patch to drop the Delegate=yes part of the condition for F25." I think he means "gnome-shell" rather than "gnome-terminal" there, but not sure.
Transferring proposed blocker status here.
To be clear, Zbigniew, we care about *all* processes, not just terminal ones (the original 'bash history lost' bug is considered only one possible symptom among many here). Is it the case that all apps running inside a regular user GNOME session will be considered part of the user@.service and killed when the user logs out or shuts down?
So I was able to reproduce this quite easily with a fresh F25 RC-1.2 install. Procedure: 1. Clean install of Fedora 25 RC-1.2 Workstation live image, create user 'test' who is an admin 2. Boot installed system, log in as 'test', run through gnome-initial-setup 3. Run Firefox, close all the 'first run' tabs, open a new tab, browse to www.bluesnews.com 4. Log out At this point, the system flips to gdm basically *instantly*, it's really a very fast transition. 5. Log back in 6. Open Firefox again At this point I did not see the 'Well, this is embarrassing.' Firefox screen indicating a crash, but I got a single tab which went to one of the startup URLs, rather than bluesnews.com. 7. Browse to www.fedoraproject.org/wiki 8. Log out again (once again, very fast transition to gdm) 9. Log in again 10. Run Firefox again At this point I did get the "Well, this is embarrassing." Firefox screen indicating a previous crash, with the list of open tabs showing the Fedora wiki page I had previously opened. 11. Hit Restore 12. Browse to https://docs.fedoraproject.org/en-US/index.html 13. Shut down (note shutdown happens extremely fast) 14. Power up, log in as test 15. Run Firefox again 16. Get the "Well, this is embarrassing." screen again, this time with the list of open tabs showing the docs page 17. Run journalctl -b-1, note it shows abrt catching three GNOME processes 'crashing' but ignoring the crashes because abrtd is not running (probably because it's already been killed/shut down itself) Here are those journal lines: Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3343]: Process 3161 (gnome-calendar) of user 1000 killed by SIGTRAP - ignoring (abrtd is not running) Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3345]: Process 2899 (abrt-applet) of user 1000 killed by SIGTRAP - ignoring ('DebugLevel' == 0) Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3346]: Process 2887 (gnome-software) of user 1000 killed by SIGTRAP - ignoring (abrtd is not running) Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3346]: If abrtd crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core' Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3344]: Process 2901 (evolution-alarm-notify) of user 1000 killed by SIGTRAP - ignoring (abrtd is not running) Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3344]: If abrtd crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core' Nov 14 18:27:12 localhost.localdomain abrt-hook-ccpp[3343]: If abrtd crashed, /proc/sys/kernel/core_pattern contains a stale value, consider resetting it to 'core' Will attach the entire journal contents from the test.
Created attachment 1220655 [details] journal output from the test run
Sorry, this step: 7. Browse to www.fedoraproject.org/wiki should be just: 7. Browse to https://fedoraproject.org/wiki
(In reply to Adam Williamson from comment #0) [...] > So... we have the following conundrum: current code SGIKILLs graphical > terminal sessions. If we simply always wait, systemd will "hang" (for > 180s probably) in some cases during shutdown. I've always wondered just how likely this hang is. The systemd in RHEL 7, for instance, has been patched to always run with wait_for_exit = true (it doesn't support Delegate= at all). Has that ever been a problem with RHEL 7 users? As far as I can see, a 180s hang can only occur if the *last* process to exit in the cgroup is not parented by PID 1. After all, any time systemd receives a SIGCHLD it can manually check to see whether the cgroup is empty -- it doesn't need to rely solely on the cgroup emptiness notification.
Oh, I've seen it happen. We had KDE lives failing to shut down until a systemd timeout expired pretty recently.
Tested with KDE (which uses sddm as its login manager) and got more or less the same results. I don't *think* KDE has a user session like GNOME - at least, 'journalctl -b' as user and root show the same thing. On KDE test I got the right page without a 'crash' page on the *first* logout/login test, but I got a crash page on both the *second* logout/login test and the shutdown/startup test, as with GNOME. Will attach journal.
Created attachment 1220693 [details] journal output from KDE test run
Hrrrrrm. Welp. It appears there may be a bit of a problem with my test procedure, namely: https://bugzilla.mozilla.org/show_bug.cgi?id=336193 apparently Firefox doesn't do the right bloody thing when you send it a SIGTERM in any case. Will re-test tomorrow with something else.
I tried my test procedure from https://bugzilla.redhat.com/show_bug.cgi?id=1170765#c13 here: 1. Install a clean Fedora 21 2. Open a terminal, open two tabs, 'su' in one 3. Run 'test01' in one terminal, 'test02' in the other 4. Reboot (by running 'reboot' in either terminal or using the GNOME 'Restart' button, it doesn't seem to matter) 5. Open a terminal, check history for both user and root and see if you see 'test01' and 'test02' 6. Open two tabs, su in one, run 'test03' in one, 'test04' in the other 7. GOTO 4 (at step 5 check for 'test03' and 'test04', obviously) I tried several iterations of that on clean installs of F25 RC-1.2 Workstation and KDE, and could not produce history loss in either case. Can any of the folks who reported that they were seeing this issue again in #1170765 (now CCed) provide a precise reproducer? Thanks.
It doesn't happen every time, but what you did in comment #11 occasionally leads to it. I only saw it on a VM I have thus far. I did not see it yet on my ThinkPad T450s, for instance.
1. For the loss of bash history I have this observation: SSH to the server (session 1) type date type date type history 10046 2016-11-16 00:29:58 date 10047 2016-11-16 00:30:00 date 10048 2016-11-16 00:30:01 history $ echo $$ 7793 The PID 7793 is the bash PID Parent Process ID of 7793 is the sshd login process of session 1. In another ssh session to the server. (session 2) Killing the PID of the bash process (7793) or the ssh login process of session 1 would destroy session 1. By default, SIGTERM is used for the kill command if no signal is specified. I think the loss of bash history is due to the SIGTERM being sent when the server is shutdown. 2. For my case in #Bug 1170765, comment 82 https://bugzilla.redhat.com/show_bug.cgi?id=1170765#c82 I had shutdown script for the database. From my observation, when the server is shutdown, SIGTERM is send out immediately. The shutdown script do not have enough time to finish, and the SIGTERM killed the database process and cause problem.
So... after some testing and putting log_unit_debug in systemd in some crucial places, I'm not convinced that there's a bug in systemd. Before I wrote: (In reply to Adam Williamson from comment #0) > "I think the following is happening: > systemd will not wait for processes to terminate under certain conditions. In > this case the relevant one is that Delegate=yes is set for the unit. > Currently, gnome-terminal runs as systemd --user unit, which means that from > the POV of PID1, it is part of user@.service, which has Delegate=yes. While user@.service has Delegate=yes, it also has MainPID= set. The condition mentioned above is that Delegate=yes, but also that Main is NOT set. So this quick killing path does not apply in this case. gnome-terminal will get a normal SIGTERM + SIGHUP combo, and then systemd will wait for it to terminate. When I logout by clicking the logout button, I see that gnome-terminal "crashes" fairly reliably. I filed #1395389. So what I think is happening: - the logout procedure is started (for example by the user clicking "logout") - systemd notices that the session should terminate, so it sends SIGSTERM and SIGHUP to various processes in session-N.scope - various processes start exiting - in particular gnome-shell exits, terminating the wayland server - gnome-terminal tries to communicate with wayland server, it cannot, and crashes I'm not certain why that would lead to history loss: I'm not sure what happens to the bash process when the terminal emulator does SIGTRAP. It seems like it is able to save history, but maybe occasionally it fails. Looking at the logs in comment #c4, various other wayland-enabled stuff crashes in a similar way: gnome-calendar, gnome-software, evolution-alarm, abrt-applet, gnome-settings-daemon. There's also this tidbit: Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: (EE) Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: Fatal server error: Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: (EE) failed to read Wayland events: Connection reset by peer Nov 14 18:27:11 localhost.localdomain org.gnome.Shell.desktop[2813]: (EE) When looking at the journalctl output, this comes from gnome-shell itself, those messages are tagged with _COMM=gnome-shell. I'm not sure what is going on here, maybe gnome-wayland is not aware that it's shutting down. tl;dr I don't see evidence of systemd killing things, just of various processes dying messily in the shutdown sequence. For example, gnome-terminal which is socket-activated must be ready to terminate cleanly in the display server goes away.
(In reply to Patrick Dung from comment #13) > I think the loss of bash history is due to the SIGTERM being sent when the > server is shutdown. SIGTERM is *the* signal to terminate processes. A process should exit cleanly when it receives SIGTERM. So the fact that a SIGTERM is sent is not a sign of anything being wrong, it should be a normal part of the shutdown sequence.
And update on my last comment in part 1 (loss of bash history) The missing bash history is not visible in session 2. If you open a new ssh session (after session 1 is killed(by SIGTERM)), the bash history is there... 10051 2016-11-16 00:29:58 date 10052 2016-11-16 00:30:00 date 10053 2016-11-16 00:30:01 history If SIGKILL was used, those bash history would be gone. Sorry for the confusion, and thanks Zbigniew for the explanation.
Patrick Dung: did you observe any kind of *change* in behaviour here recently? It's hard to tell whether your case is a bug or not without details on what database you're using and what exactly these 'shutdown scripts' are and how they work.
Zbigniew: thanks for the investigation. One question that remains for me is why several people in the original bug think they saw something change in F23/F24 recently, since F23/F24 have not switched to Wayland. (But then, there hasn't been a relevant systemd update either, AFAICT).
gnome-terminal started running in the systemd --user session in F24 (I believe, it is certainly doing that in F24, but I don't have a F23 installation at hand to check, but I think it wasn't then). This changes behaviour and timing.
I think this bug and #1366897 are related. One of them might be mistaken for the other if they are different at all.
(In reply to Adam Williamson from comment #18) > Zbigniew: thanks for the investigation. One question that remains for me is > why several people in the original bug think they saw something change in > F23/F24 recently, since F23/F24 have not switched to Wayland. (But then, > there hasn't been a relevant systemd update either, AFAICT). No wayland on the VM where it occasionally happens for me. I'm running xrdp there, so Xvnc is running behind the scenes. Just FYI.
Yeah, #1366897 is what I was getting at.
Created attachment 1220977 [details] for Patrick Dung's db shutdown script problem (systemd) This is for systemd
Created attachment 1220978 [details] for Patrick Dung's db shutdown script problem (actual script to stop db) for Patrick Dung's db shutdown script problem (actual script to stop db)
Patrick: OK, I don't think your case really has much to do with anyone else's. I still can't tell without information on how the Oracle database itself is started and shut down, but most likely you've simply got a race going on there. It'd probably be best to follow up with a separate bug (although I suspect in the end this won't be a bug in Fedora at all but a bug in your design) or a thread on a mailing list about how to properly do something like this.
I agree this and #1366897 seem similar, but the discussion in #1366897 seems fairly heavily based on the idea that it's specific to Wayland, and we do have people here claiming they're seeing similar behaviour on X.
(In reply to Adam Williamson from comment #25) > Patrick: OK, I don't think your case really has much to do with anyone > else's. I still can't tell without information on how the Oracle database > itself is started and shut down, but most likely you've simply got a race > going on there. It'd probably be best to follow up with a separate bug > (although I suspect in the end this won't be a bug in Fedora at all but a > bug in your design) or a thread on a mailing list about how to properly do > something like this. Ok, I may open a separate bug report or check in mailing list. FYI Observed behaviour: When shutdown is called, the DB process is terminated (by SIGTERM). The shutdown script do not have enough time to run. I see SIGKILL later on the console, so the DB is probably not killed by SIGKILL. If you want to go further, you can append ps and redirect the current process list right after the sqlplus shutdown command is run. In my case, I found that the DB processes are already gone.
Yeah, please open a separate bug, this one is already messy enough. In particular I'm not seeing any After/Before relationship in your shutdown service. In the new bug, please also post the logs from the shutdown, preferably after 'systemd set-log-level debug'.
-1 to this as a blocker. It might be a candidate for our nascent "important non-blocking bugs" process.
Also -1 for blocker but do think it warrants attention. I may not be able to attend opening of go/no-go today but wanted to have that in record.
Rationale for -1 to blocker, in case it's relevant: since the problem exists in updated F23 and F24, blocking F25 does not seem to be the right tool.
Matthew: I'm honestly reluctant to claim that. The #1366897 case is the only really clear one we have here, and that one is simply a GNOME + Wayland bug; it technically 'exists' in F23 and F24 because you can use Wayland there, but few people do. It's obviously much more important in F25. But we did already discuss that case on its own and reject it as a blocker. We have people claiming that "this problem" has suddenly re-appeared in F23 and F24, but on closer inspection I'm not at all sure that they're all actually talking about the same thing, or even precisely what they're seeing. No-one's yet posted a really clear-cut reproducer with information on when it was working and when it stopped working, AFAICT. Bojan's case is maybe the clearest, but I still can't reproduce it on an OOTB install...
This also prevents Emacs for saving modified files for later recovery, which it would do on SIGTERM.
(In reply to Eric Smith from comment #33) > This also prevents Emacs for saving modified files for later recovery, which > it would do on SIGTERM. What is "this"? If you read the discussion, the only thing that is clearly wrong is some applications using wayland crashing when the wayland server goes away. But emacs still uses X, i.e. Xwayland or Xorg if you are not using wayland... Please be more precise in your statements.
(In reply to Zbigniew Jędrzejewski-Szmek from comment #34) > If you read the discussion, the only thing that is clearly > wrong is some applications using wayland crashing when the wayland server > goes away. No Wayland here. Xvnc behind xrdp.
Happens to me on definitely not on Wayland. I am using google-chrome on F24 with the X-Server, and more often than not (but not always), logging out or shutting down causes chrome to ask for recovery on the next start.
Alexander: do you know if that's not just expected behaviour for Chrome? I thought there was a similar 'bug' for Firefox, but it turns out that's just what Firefox does when it's SIGTERMed: https://bugzilla.mozilla.org/show_bug.cgi?id=336193
(In reply to Adam Williamson from comment #37) > Alexander: do you know if that's not just expected behaviour for Chrome? I > thought there was a similar 'bug' for Firefox, but it turns out that's just > what Firefox does when it's SIGTERMed: Chrome seems to be doing the right thing though. When I pkill -SIGTERM chrome then chrome closes nicely and after startup shows my previous tabs. When I do a SIGKILL, tabs are dead and I get the recovery option. So I don't think chrome is bugged regarding shutdown. Worth Noting: 1) Same behavior on Wayland and Xorg 2) I can't get it to crash when just logging out 3) Rebooting or shutting down crashes it for me every time 4) The history in gnome-terminal according to your test case is saving just fine for me, while chrome reliably crashes. The history is probably a different bug and shouldn't be used to reproduce this one.
This message is a reminder that Fedora 25 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '25'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 25 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Still happens on F26.
This message is a reminder that Fedora 26 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '26'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 26 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
I think this still happens in F27, F28 and Rawhide.
This bug appears to have been reported against 'rawhide' during the Fedora 29 development cycle. Changing version to '29'.
*** Bug 1652312 has been marked as a duplicate of this bug. ***
This message is a reminder that Fedora 29 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 29 on 2019-11-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '29'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 29 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 29 changed to end-of-life (EOL) status on 2019-11-26. Fedora 29 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.