Bug 1047614

Summary:

[GSS 7.0 Disc] Powering off remote node doesn't close ssh session

Product:

Red Hat Enterprise Linux 7

Reporter:

Madison Kelly <mkelly>

Component:

systemd

Assignee:

Michal Sekletar <msekleta>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Leos Pol <lpol>

Severity:

high

Docs Contact:

Priority:

urgent

Version:

7.0

CC:

alexander.hass, andriusb, ayadav, ebenes, fdanapfe, ffotorel, fwissing, h.reindl, ichute, juzhang, kajtzu, lmiccini, lnykryn, lpol, mihai, mkolman, msekleta, mullens, myllynen, nagata3333333, nparmar, ohudlick, pasteur, plautrba, pwouters, rjones, rsawhill, sbeal, sgrubb, ssahani, stephan.wiesand, systemd-maint-list, tessarek, theinric, toracat, vanhoof

Target Milestone:

Keywords:

Regression, Reopened

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

systemd-208-9.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-07-28 08:58:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

717785, 860099, 1018952, 1050219, 1203710, 1289485, 1313485

Attachments:

Description	Flags
journal.log	none
journal-NM.log	none

Description Madison Kelly 2014-01-01 04:39:25 UTC

Description of problem:

Simple one. If you ssh into a RHEL 7 beta server and power it off with 'poweroff', the ssh session hangs instead of closes.


Version-Release number of selected component (if applicable):

openssh-6.4p1-1.el7.x86_64


How reproducible:

Seems to be 100% (based on minimal installs on KVM VMs)


Steps to Reproduce:
1. Install RHEL 7 minimal
2. SSH into RHEL 7 machine
3. Type 'poweroff'.

Actual results:

terminal hangs until ~.<enter> pressed.


Expected results:

ssh session closes


Additional info:

Comment 2 Petr Lautrbach 2014-01-02 08:49:21 UTC

This is the problem in systemd logic. systemd doesn't stop user sessions before it shuts down a network. It's already reported for Fedora 20 - https://bugzilla.redhat.com/show_bug.cgi?id=1023788

Comment 3 Harald Reindl 2014-01-15 02:01:26 UTC

how comes that such regressions compared to systemd-204 of F19 make it in a systemd-release and stay there for many weeks?

even with ssh root@host "systemctl reboot; exit" you have a frozen VT 

have fun if you are on a physical machine with no X11
and VT1-VT6 are connected to F20/RHEL7 machines you
like to reboot

Comment 5 Michal Sekletar 2014-01-17 19:59:34 UTC

(In reply to Petr Lautrbach from comment #2)
> This is the problem in systemd logic. systemd doesn't stop user sessions
> before it shuts down a network. It's already reported for Fedora 20 -
> https://bugzilla.redhat.com/show_bug.cgi?id=1023788

Hmmm, I am not convinced this is the problem. AFAIKT, in my tests I made sure that order in which service are stopped is correct. But following message popped up in journal:

Jan 17 20:29:36 localhost systemd[1]: Failed to destroy cgroup /user.slice/user-0.slice: Device or resource busy

As I see it, systemd should successfully destroy cgroup corresponding to slice/session (and processes in it) or try harder later if it is not possible at first try.

Peter can you please attach journal log from your machine? Make sure systemd is running with log level set to debug (kill -56 1). Please use persistent journal (mkdir -p /var/log/journal && systemctl restart systemd-journald), because I want everything not just what rsyslog is able to dig from journal files.

Thanks!

Comment 7 Paul Wouters 2014-02-06 20:28:44 UTC

I'm seeing this issue too when running libreswan testcases using f20/rhel7 VMs. It is causing many false positives, so it would be _really_ nice to get this fixed.

Comment 8 Harald Reindl 2014-02-06 20:31:30 UTC

nice?

that should have been a realease blocker for F20 given taht i reported this *months* before GA at https://bugzilla.redhat.com/show_bug.cgi?id=1023788

here you have a hit-list of systemd-troubles in F20/RHEL7
which are the biggest regressions since Fedora 15 

https://bugzilla.redhat.com/show_bug.cgi?id=1023820
https://bugzilla.redhat.com/show_bug.cgi?id=1010572
https://bugzilla.redhat.com/show_bug.cgi?id=1057811
https://bugzilla.redhat.com/show_bug.cgi?id=1057618
https://bugzilla.redhat.com/show_bug.cgi?id=1023788#c

Comment 9 Steve Grubb 2014-02-12 16:54:48 UTC

Just for the record, this problem is also messing up the audit trail. I can't see user sessions getting terminated and they look like a crash.

I also think there are pam modules that allocate things like name spaces, mounts, devices, etc. Meaning that not being able to properly close out pam means the resources never get released back to the OS. So, this bug is kind of important to have fixed.

Comment 10 Petr Lautrbach 2014-02-13 14:49:51 UTC

Created attachment 862821 [details]
journal.log

You need to disable NetworkManager.service and enable network.service.

[root@rhel-7-devel ~]# kill -56 1
[root@rhel-7-devel ~]# date
Thu Feb 13 15:43:52 CET 2014
[root@rhel-7-devel ~]# reboot
Write failed: Broken pipe
$ ssh root@rhel-7-devel 
root@rhel-7-devel's password: 
Last login: Thu Feb 13 15:43:24 2014 from master.virt
[root@rhel-7-devel ~]# journalctl -l --since="15:43:52" > journal.log

Comment 11 Petr Lautrbach 2014-02-13 14:55:50 UTC

Created attachment 862834 [details]
journal-NM.log

using NetworkManager.service it seems to work:

[root@rhel-7-devel ~]# systemctl disable network.service
network.service is not a native service, redirecting to /sbin/chkconfig.
Executing /sbin/chkconfig network off
[root@rhel-7-devel ~]# systemctl enable NetworkManager.service
ln -s '/usr/lib/systemd/system/NetworkManager.service' '/etc/systemd/system/dbus-org.freedesktop.NetworkManager.service'
ln -s '/usr/lib/systemd/system/NetworkManager.service' '/etc/systemd/system/multi-user.target.wants/NetworkManager.service'
ln -s '/usr/lib/systemd/system/NetworkManager-dispatcher.service' '/etc/systemd/system/dbus-org.freedesktop.nm-dispatcher.service'
[root@rhel-7-devel ~]# reboot
Write failed: Broken pipe
$ ssh root@rhel-7-devel 
root@rhel-7-devel's password: 
Last login: Thu Feb 13 15:45:29 2014 from master.virt
[root@rhel-7-devel ~]# kill -56 1
[root@rhel-7-devel ~]# date
Thu Feb 13 15:51:51 CET 2014
[root@rhel-7-devel ~]# reboot

Broadcast message from root@rhel-7-devel on pts/0 (Thu 2014-02-13 15:51:54 CET):

The system is going down for reboot NOW!

[root@rhel-7-devel ~]# Connection to rhel-7-devel closed by remote host.
Connection to rhel-7-devel closed.

$ ssh root@rhel-7-devel 
root@rhel-7-devel's password: 
X11 forwarding request failed on channel 0
Last login: Thu Feb 13 15:51:35 2014 from master.virt

[root@rhel-7-devel ~]# journalctl -l --since="15:51:51" > journal-NM.log

Comment 12 Harald Reindl 2014-02-13 15:16:48 UTC

systemd-upstream claims this to be fixed somewhere and sometime
asked yesterday on the systemd-list and only got a arrogant 
reply that only active systemd  developers are allowed for critism

-------- Original-Nachricht --------
Betreff: Re: [systemd-devel] https://bugzilla.redhat.com/show_bug.cgi?id=1047614
Datum: Wed, 12 Feb 2014 21:19:02 +0100
Von: Lennart Poettering <lennart>
Organisation: Red Hat, Inc.
An: Reindl Harald <h.reindl>
Kopie (CC): Mailing-List systemd <systemd-devel.org>

On Wed, 12.02.14 20:05, Reindl Harald (h.reindl) wrote:

> https://bugzilla.redhat.com/show_bug.cgi?id=1047614
> 
> Product: 	Red Hat Enterprise Linux 7
> Component: 	systemd (Show other bugs)
> Version: 	7.0
> Hardware: 	Unspecified Unspecified
> Priority 	urgent Severity high
> 
> first reported more than 3 months ago
> https://bugzilla.redhat.com/show_bug.cgi?id=1023788
> 
> maybe systemd-upstream should consider slow down development
> and spend more energy in quality and stability

Well, firstly, it's hardly your business how we spend our time.

Secondly, this bug is fixed upstream.

Thirdly, patches count more than complaining.

Comment 16 Lukáš Nykrýn 2014-02-25 16:13:46 UTC

*** Bug 1032109 has been marked as a duplicate of this bug. ***

Comment 17 Lukáš Nykrýn 2014-02-25 16:51:26 UTC

*** Bug 1039806 has been marked as a duplicate of this bug. ***

Comment 19 Harald Reindl 2014-03-02 15:49:36 UTC

systemd-210 in Fedora Rawhide fixes this problem and some other nasty things - hopefully it is considered to switch to version 210 in RHEL7 as well as in F20 instead try to backport cherry pickings

Comment 20 Michal Sekletar 2014-03-05 07:01:43 UTC

It is not planned to rebase to 210 in RHEL7 or Fedora 20. Backport of required fixes for this is underway, however there has been a ton a changes introduced in 209 release cycle so backporting is hard. Anyway, this will be fixed soon.

Comment 22 Leos Pol 2014-03-17 14:14:09 UTC

[root@rhel7 ~]# rpm -q systemd
systemd-208-9.el7.x86_64
[root@rhel7 ~]# poweroff
Connection to rhel7.virt closed by remote host.
Connection to rhel7.virt closed.

Comment 23 Lukáš Nykrýn 2014-03-20 15:56:18 UTC

*** Bug 1078906 has been marked as a duplicate of this bug. ***

Comment 24 Ludek Smid 2014-06-13 11:46:33 UTC

This request was resolved in Red Hat Enterprise Linux 7.0.

Contact your manager or support representative in case you have further questions about the request.

Comment 29 Michal Sekletar 2016-01-25 12:21:16 UTC

Possible workaround,

add drop-in configuration file /etc/systemd/system/systemd-user-sessions.service.d/after-network.conf

with following content,

[Unit]
After=network.target

and reload systemd,

systemctl daemon-reload

Comment 30 Sean Mullen 2016-02-17 14:15:04 UTC

I'm on RHEL 7.2 using network.service (NetworkManager.service is stopped / disabled) because we're 100% static IP, no wifi.  NetworkManager keeps modifying crap so we just disabled it and enabled network.service.

I've tried Michal's work around, no luck.

systemd-219-19.el7.x86_64 is installed.

I'm still having this issue. Any ideas how to fix this?

Comment 31 Freddy Wissing 2016-02-17 14:46:45 UTC

I hesitate to call it a work around, but at least it prevents having to kill the session if you're rebooting a server through a jump box.  

# (nohup sleep 10;reboot) &

# logout

Comment 32 Lukáš Nykrýn 2016-04-11 11:21:50 UTC

I was unable to reproduce this issue, could you try to get a shutdown log for the issue with the new version of systemd?

https://freedesktop.org/wiki/Software/systemd/Debugging/#index2h1

Comment 33 Michal Sekletar 2016-05-05 13:27:57 UTC

I've built some test packages which contain related fix. Feel free to try them out.

http://people.redhat.com/~msekleta/systemd-219-20.el7.0.bz1047614/

Comment 34 Susant Sahani 2016-05-24 07:58:52 UTC

I believe this bug should be fix by this upstream commit.

Could you please try out.

https://github.com/systemd/systemd/issues/2390
https://github.com/systemd/systemd/commit/8c856804780681e135d98ca94d08afe247557770

please network.target in the After= directive.
----------------------------------------------------
# cat system/systemd-user-sessions.service
#  This file is part of systemd.
#
#  systemd is free software; you can redistribute it and/or modify it
#  under the terms of the GNU Lesser General Public License as published by
#  the Free Software Foundation; either version 2.1 of the License, or
#  (at your option) any later version.

[Unit]
Description=Permit User Sessions
Documentation=man:systemd-user-sessions.service(8)
After=remote-fs.target nss-user-lookup.target network.target

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/lib/systemd/systemd-user-sessions start
ExecStop=/usr/lib/systemd/systemd-user-sessions stop
--------------------------------------------------

Comment 39 Michal Sekletar 2016-05-26 16:01:56 UTC

Hang on client side happens because processes under PAM session created by ssh are not put into a .scope systemd unit, e.g. systemd is not aware of those processes (loginctl knows about 0 sessions). Because of that, such group of processes is not scheduled to stop at shutdown and stays running up until the final killing spree, but even then, we first send SIGTERM and then SIGKILL. Hence if network connection is still up then there should be a chance to close ssh connection correctly. From shutdown log I see that customer is using network initscripts and ifdown script will put interface down if NetworkManager is not used.

Weird thing is that pam config actually looks ok because pam_systemd is listed in password-auth and that is then included by sshd pam config, so processes should really be inside .scope units.

Can you verify that pam_systemd module is present on the system? To do that you can run rpm -qV systemd-libs.

Comment 48 Michal Sekletar 2016-07-28 08:58:56 UTC

I've analysed sos_report and confirmed my suspicion about incorrect PAM configuration.

I think there are two ways how to resole this issue,

1) either fix PAM config to include pam_systemd.so module, hence all user processes are registered in respective scope units at those are scheduled to shut down before network connections are terminated by ifdown scripts.

or

2) use NetworkManager instead of initscripts. NM doesn't put interfaces down when it is stopped and ssh session can get gracefully terminated.

Either way there is nothing to fix in systemd or related component Closing as CURRENTRELEASE. Feel free to reopen in case I've missed something.

Comment 49 Helmut K. C. Tessarek 2017-05-10 22:12:10 UTC

This bug still exists in Redhat 7.3, btw.

$ reboot

PolicyKit daemon disconnected from the bus.
We are no longer a registered authentication agent.
<after about 60 seconds>
packet_write_wait: Connection to xx.xx.xx.xx: Broken pipe

Comment 50 Michal Sekletar 2017-05-11 07:01:58 UTC

(In reply to Helmut Tessarek from comment #49)

> This bug still exists in Redhat 7.3, btw.

This issue should be reproducible only when you don't use pam_systemd and you use legacy network initscripts instead of NetworkManager.

In case you see the issue but your system is not set up in a way I described above then please file a new bug report.