Bug 1047614
Summary: | [GSS 7.0 Disc] Powering off remote node doesn't close ssh session | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Madison Kelly <mkelly> | ||||||
Component: | systemd | Assignee: | Michal Sekletar <msekleta> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Leos Pol <lpol> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 7.0 | CC: | alexander.hass, andriusb, ayadav, ebenes, fdanapfe, ffotorel, fwissing, h.reindl, ichute, juzhang, kajtzu, lmiccini, lnykryn, lpol, mihai, mkolman, msekleta, mullens, myllynen, nagata3333333, nparmar, ohudlick, pasteur, plautrba, pwouters, rjones, rsawhill, sbeal, sgrubb, ssahani, stephan.wiesand, systemd-maint-list, tessarek, theinric, toracat, vanhoof | ||||||
Target Milestone: | rc | Keywords: | Regression, Reopened | ||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | systemd-208-9.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2016-07-28 08:58:56 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 717785, 860099, 1018952, 1050219, 1203710, 1289485, 1313485 | ||||||||
Attachments: |
|
Description
Madison Kelly
2014-01-01 04:39:25 UTC
This is the problem in systemd logic. systemd doesn't stop user sessions before it shuts down a network. It's already reported for Fedora 20 - https://bugzilla.redhat.com/show_bug.cgi?id=1023788 how comes that such regressions compared to systemd-204 of F19 make it in a systemd-release and stay there for many weeks? even with ssh root@host "systemctl reboot; exit" you have a frozen VT have fun if you are on a physical machine with no X11 and VT1-VT6 are connected to F20/RHEL7 machines you like to reboot (In reply to Petr Lautrbach from comment #2) > This is the problem in systemd logic. systemd doesn't stop user sessions > before it shuts down a network. It's already reported for Fedora 20 - > https://bugzilla.redhat.com/show_bug.cgi?id=1023788 Hmmm, I am not convinced this is the problem. AFAIKT, in my tests I made sure that order in which service are stopped is correct. But following message popped up in journal: Jan 17 20:29:36 localhost systemd[1]: Failed to destroy cgroup /user.slice/user-0.slice: Device or resource busy As I see it, systemd should successfully destroy cgroup corresponding to slice/session (and processes in it) or try harder later if it is not possible at first try. Peter can you please attach journal log from your machine? Make sure systemd is running with log level set to debug (kill -56 1). Please use persistent journal (mkdir -p /var/log/journal && systemctl restart systemd-journald), because I want everything not just what rsyslog is able to dig from journal files. Thanks! I'm seeing this issue too when running libreswan testcases using f20/rhel7 VMs. It is causing many false positives, so it would be _really_ nice to get this fixed. nice? that should have been a realease blocker for F20 given taht i reported this *months* before GA at https://bugzilla.redhat.com/show_bug.cgi?id=1023788 here you have a hit-list of systemd-troubles in F20/RHEL7 which are the biggest regressions since Fedora 15 https://bugzilla.redhat.com/show_bug.cgi?id=1023820 https://bugzilla.redhat.com/show_bug.cgi?id=1010572 https://bugzilla.redhat.com/show_bug.cgi?id=1057811 https://bugzilla.redhat.com/show_bug.cgi?id=1057618 https://bugzilla.redhat.com/show_bug.cgi?id=1023788#c Just for the record, this problem is also messing up the audit trail. I can't see user sessions getting terminated and they look like a crash. I also think there are pam modules that allocate things like name spaces, mounts, devices, etc. Meaning that not being able to properly close out pam means the resources never get released back to the OS. So, this bug is kind of important to have fixed. Created attachment 862821 [details]
journal.log
You need to disable NetworkManager.service and enable network.service.
[root@rhel-7-devel ~]# kill -56 1
[root@rhel-7-devel ~]# date
Thu Feb 13 15:43:52 CET 2014
[root@rhel-7-devel ~]# reboot
Write failed: Broken pipe
$ ssh root@rhel-7-devel
root@rhel-7-devel's password:
Last login: Thu Feb 13 15:43:24 2014 from master.virt
[root@rhel-7-devel ~]# journalctl -l --since="15:43:52" > journal.log
Created attachment 862834 [details]
journal-NM.log
using NetworkManager.service it seems to work:
[root@rhel-7-devel ~]# systemctl disable network.service
network.service is not a native service, redirecting to /sbin/chkconfig.
Executing /sbin/chkconfig network off
[root@rhel-7-devel ~]# systemctl enable NetworkManager.service
ln -s '/usr/lib/systemd/system/NetworkManager.service' '/etc/systemd/system/dbus-org.freedesktop.NetworkManager.service'
ln -s '/usr/lib/systemd/system/NetworkManager.service' '/etc/systemd/system/multi-user.target.wants/NetworkManager.service'
ln -s '/usr/lib/systemd/system/NetworkManager-dispatcher.service' '/etc/systemd/system/dbus-org.freedesktop.nm-dispatcher.service'
[root@rhel-7-devel ~]# reboot
Write failed: Broken pipe
$ ssh root@rhel-7-devel
root@rhel-7-devel's password:
Last login: Thu Feb 13 15:45:29 2014 from master.virt
[root@rhel-7-devel ~]# kill -56 1
[root@rhel-7-devel ~]# date
Thu Feb 13 15:51:51 CET 2014
[root@rhel-7-devel ~]# reboot
Broadcast message from root@rhel-7-devel on pts/0 (Thu 2014-02-13 15:51:54 CET):
The system is going down for reboot NOW!
[root@rhel-7-devel ~]# Connection to rhel-7-devel closed by remote host.
Connection to rhel-7-devel closed.
$ ssh root@rhel-7-devel
root@rhel-7-devel's password:
X11 forwarding request failed on channel 0
Last login: Thu Feb 13 15:51:35 2014 from master.virt
[root@rhel-7-devel ~]# journalctl -l --since="15:51:51" > journal-NM.log
systemd-upstream claims this to be fixed somewhere and sometime asked yesterday on the systemd-list and only got a arrogant reply that only active systemd developers are allowed for critism -------- Original-Nachricht -------- Betreff: Re: [systemd-devel] https://bugzilla.redhat.com/show_bug.cgi?id=1047614 Datum: Wed, 12 Feb 2014 21:19:02 +0100 Von: Lennart Poettering <lennart> Organisation: Red Hat, Inc. An: Reindl Harald <h.reindl> Kopie (CC): Mailing-List systemd <systemd-devel.org> On Wed, 12.02.14 20:05, Reindl Harald (h.reindl) wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=1047614 > > Product: Red Hat Enterprise Linux 7 > Component: systemd (Show other bugs) > Version: 7.0 > Hardware: Unspecified Unspecified > Priority urgent Severity high > > first reported more than 3 months ago > https://bugzilla.redhat.com/show_bug.cgi?id=1023788 > > maybe systemd-upstream should consider slow down development > and spend more energy in quality and stability Well, firstly, it's hardly your business how we spend our time. Secondly, this bug is fixed upstream. Thirdly, patches count more than complaining. *** Bug 1032109 has been marked as a duplicate of this bug. *** *** Bug 1039806 has been marked as a duplicate of this bug. *** systemd-210 in Fedora Rawhide fixes this problem and some other nasty things - hopefully it is considered to switch to version 210 in RHEL7 as well as in F20 instead try to backport cherry pickings It is not planned to rebase to 210 in RHEL7 or Fedora 20. Backport of required fixes for this is underway, however there has been a ton a changes introduced in 209 release cycle so backporting is hard. Anyway, this will be fixed soon. [root@rhel7 ~]# rpm -q systemd systemd-208-9.el7.x86_64 [root@rhel7 ~]# poweroff Connection to rhel7.virt closed by remote host. Connection to rhel7.virt closed. *** Bug 1078906 has been marked as a duplicate of this bug. *** This request was resolved in Red Hat Enterprise Linux 7.0. Contact your manager or support representative in case you have further questions about the request. Possible workaround, add drop-in configuration file /etc/systemd/system/systemd-user-sessions.service.d/after-network.conf with following content, [Unit] After=network.target and reload systemd, systemctl daemon-reload I'm on RHEL 7.2 using network.service (NetworkManager.service is stopped / disabled) because we're 100% static IP, no wifi. NetworkManager keeps modifying crap so we just disabled it and enabled network.service. I've tried Michal's work around, no luck. systemd-219-19.el7.x86_64 is installed. I'm still having this issue. Any ideas how to fix this? I hesitate to call it a work around, but at least it prevents having to kill the session if you're rebooting a server through a jump box. # (nohup sleep 10;reboot) & # logout I was unable to reproduce this issue, could you try to get a shutdown log for the issue with the new version of systemd? https://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1 I've built some test packages which contain related fix. Feel free to try them out. http://people.redhat.com/~msekleta/systemd-219-20.el7.0.bz1047614/ I believe this bug should be fix by this upstream commit. Could you please try out. https://github.com/systemd/systemd/issues/2390 https://github.com/systemd/systemd/commit/8c856804780681e135d98ca94d08afe247557770 please network.target in the After= directive. ---------------------------------------------------- # cat system/systemd-user-sessions.service # This file is part of systemd. # # systemd is free software; you can redistribute it and/or modify it # under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. [Unit] Description=Permit User Sessions Documentation=man:systemd-user-sessions.service(8) After=remote-fs.target nss-user-lookup.target network.target [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/lib/systemd/systemd-user-sessions start ExecStop=/usr/lib/systemd/systemd-user-sessions stop -------------------------------------------------- Hang on client side happens because processes under PAM session created by ssh are not put into a .scope systemd unit, e.g. systemd is not aware of those processes (loginctl knows about 0 sessions). Because of that, such group of processes is not scheduled to stop at shutdown and stays running up until the final killing spree, but even then, we first send SIGTERM and then SIGKILL. Hence if network connection is still up then there should be a chance to close ssh connection correctly. From shutdown log I see that customer is using network initscripts and ifdown script will put interface down if NetworkManager is not used. Weird thing is that pam config actually looks ok because pam_systemd is listed in password-auth and that is then included by sshd pam config, so processes should really be inside .scope units. Can you verify that pam_systemd module is present on the system? To do that you can run rpm -qV systemd-libs. I've analysed sos_report and confirmed my suspicion about incorrect PAM configuration. I think there are two ways how to resole this issue, 1) either fix PAM config to include pam_systemd.so module, hence all user processes are registered in respective scope units at those are scheduled to shut down before network connections are terminated by ifdown scripts. or 2) use NetworkManager instead of initscripts. NM doesn't put interfaces down when it is stopped and ssh session can get gracefully terminated. Either way there is nothing to fix in systemd or related component Closing as CURRENTRELEASE. Feel free to reopen in case I've missed something. This bug still exists in Redhat 7.3, btw. $ reboot PolicyKit daemon disconnected from the bus. We are no longer a registered authentication agent. <after about 60 seconds> packet_write_wait: Connection to xx.xx.xx.xx: Broken pipe (In reply to Helmut Tessarek from comment #49) > This bug still exists in Redhat 7.3, btw. This issue should be reproducible only when you don't use pam_systemd and you use legacy network initscripts instead of NetworkManager. In case you see the issue but your system is not set up in a way I described above then please file a new bug report. |