Bug 626477 - ssh connection left hanging after poweroff
ssh connection left hanging after poweroff
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: systemd (Show other bugs)
20
All Linux
low Severity medium
: ---
: ---
Assigned To: systemd-maint
Fedora Extras Quality Assurance
: Reopened
: 735508 949049 1023788 (view as bug list)
Depends On:
Blocks: FedoraServerTracker
  Show dependency treegraph
 
Reported: 2010-08-23 12:18 EDT by Zing
Modified: 2014-04-15 20:48 EDT (History)
28 users (show)

See Also:
Fixed In Version: systemd-208-16.fc20.x86_64
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2014-04-15 20:47:46 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
console log after poweroff command (51.17 KB, text/plain)
2010-11-18 08:27 EST, Petr Lautrbach
no flags Details

  None (edit)
Description Zing 2010-08-23 12:18:54 EDT
Description of problem:
After poweroff, ssh connections are left hanging and must be force disconnected (~.) from the client side.  Behavior one usually gets is a "Connection to XXX closed by remote host. Connection to XXX closed and the ssh connection dies."

Version-Release number of selected component (if applicable):
systemd-7-3.fc14.x86_64

How reproducible:
always

Steps to Reproduce:
1. ssh into system
2. $ poweroff     or  $ systemctl poweroff
  
Actual results:
server disappears, the ssh client hangs.  

Expected results:
connection is closed and ssh exits.

Additional info:
Comment 1 Lennart Poettering 2010-08-26 17:55:17 EDT
So my guess here is that this is caused by the network being shut down before ssh. There's probably some kind of ordering dep missing. But so far that's just a guess.
Comment 2 Lennart Poettering 2010-08-31 19:42:28 EDT
Hmm, actually I think the guess was wrong. It probably is related to the fact that systemd <= 8 didn't kill daemon processes that escaped their original cgroup.
Comment 3 Lennart Poettering 2010-09-02 22:27:31 EDT
Fixed now in git upstream. Upload follows shortly.
Comment 4 Fedora Update System 2010-09-02 23:38:18 EDT
systemd-9-2.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/systemd-9-2.fc14
Comment 5 Fedora Update System 2010-09-03 12:44:20 EDT
systemd-9-2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update systemd'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/systemd-9-2.fc14
Comment 6 Fedora Update System 2010-09-03 18:37:12 EDT
systemd-9-3.fc14, initscripts-9.18-1.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update systemd initscripts'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/systemd-9-3.fc14,initscripts-9.18-1.fc14
Comment 7 Fedora Update System 2010-09-07 23:18:31 EDT
initscripts-9.19-1.fc14, systemd-9-3.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update initscripts systemd'.  You can provide feedback for this update here: https://admin.fedoraproject.org/updates/systemd-9-3.fc14,initscripts-9.19-1.fc14
Comment 8 Petr Lautrbach 2010-10-25 08:34:07 EDT
# rpm -q systemd
systemd-11-1.fc15.x86_64

Neither poweroff nor reboot set runlevel to 0 resp 6 so /etc/init.d/sshd can not figure out that system is shutting down and leaves actual ssh sessions running.

/etc/init.d/sshd:

47 runlevel=$(set -- $(runlevel); eval "echo \$$#" )

149         # if we are in halt or reboot runlevel kill all running sessions
150         # so the TCP connections are closed cleanly
151         if [ "x$runlevel" = x0 -o "x$runlevel" = x6 ] ; then
152             trap '' TERM
153             killall $prog 2>/dev/null
154             trap TERM
155         fi
Comment 9 Lennart Poettering 2010-11-16 18:37:15 EST
systemd now keeps track of user sessions via pam_systemd in cgroups and kills them on shutdown using this. A hack like the shell script you pasted is not necessary anymore.
Comment 10 Petr Lautrbach 2010-11-18 08:25:01 EST
That's really nice but it still doesn't work for me.

systemd-12-3.fc15.x86_64

All sshd processes have to be killed before network is stopped. I've added this line to /etc/init.d/network as first command for stop) target:

        ps xawf -eo pid,user,cgroup,args | grep ssh

and after shutdown I can see this log:
 1040 root     name=systemd:/system/sshd.service   sshd: root@pts/0 
 1044 root     name=systemd:/system/sshd.service    \_ -bash
Comment 11 Petr Lautrbach 2010-11-18 08:27:00 EST
Created attachment 461292 [details]
console log after poweroff command
Comment 12 Lennart Poettering 2010-11-18 18:21:52 EST
unfortunately bash does not react to SIGTERM. Just uploaded v13 now which tries harder to get rid of all processes. This should also fix this problem.
Comment 13 Petr Lautrbach 2010-11-19 03:39:47 EST
You haven't really tried it, have you? 

/etc/init.d/sshd calls 'killall sshd' in case of reboot/shutdown to kill all running ssh sessions. This has to be done before network is stopped if you don't want to ssh client hang.

For now /etc/init.d/sshd determines reboot/shutdown through the runlevel, see #c8. systemd doesn't set this and tries to solve it otherwise. That's ok but again all sshd processes must be killed before network is stopped.

With this modification of /etc/init.d/network ssh clients don't hang anymore:

--- /etc/init.d/network.orig    2010-11-19 09:32:08.938001916 +0100
+++ /etc/init.d/network 2010-11-19 09:33:43.691999956 +0100
@@ -172,6 +172,12 @@
         ;;
   stop)
 
+       set -x
+       ps xawf -eo pid,user,cgroup,args | grep sshd.service
+       killall sshd
+       ps xawf -eo pid,user,cgroup,args | grep sshd.service
+       set +x
+
        [ "$EUID" != "0" ] && exit 4
        # Don't shut the network down if root is on NFS or a network
        # block device.
Comment 14 Petr Lautrbach 2010-11-19 03:57:57 EST
Patch from previous comment is not fix, it's just demonstration how it is supposed to work.
Comment 15 Lennart Poettering 2010-11-21 16:49:19 EST
I have tried it. It works fine here.

If this continues to be a problem for you then most likely because systemd-user-sessions gets stopped after the network for you which means it probably just needs an ordering dep. Are you using NM or the old scripts?

("killall sshd" is an evil thing, the script should'nt do that)
Comment 16 Petr Lautrbach 2010-11-22 06:53:11 EST
it's set up with network scripts. NM is not even installed.

# systemctl  | grep -i network
network.service           loaded active exited        LSB: Bring up/down network
network.target            loaded active active        Network
Comment 17 Zing 2010-12-13 14:41:17 EST
I am also still getting ssh client hangs on poweroff and/or reboot.  (no NM installed if that matters)

systemd-15-1.fc15.x86_64
systemd-units-15-1.fc15.x86_64
kernel-2.6.37-0.rc5.git2.1.fc15.x86_64

Just before running poweroff/reboot:
# systemctl |egrep -i 'network|ssh'
network.service           loaded active running       LSB: Bring up/down network
sshd.service              loaded active running       LSB: Start up the OpenSSH 
network.target            loaded active active        Network
Comment 18 Lennart Poettering 2011-09-03 10:08:45 EDT
*** Bug 735508 has been marked as a duplicate of this bug. ***
Comment 19 Fedora Admin XMLRPC Client 2011-10-20 12:26:16 EDT
This package has changed ownership in the Fedora Package Database.  Reassigning to the new owner of this component.
Comment 20 Jason Tibbitts 2011-10-20 12:31:01 EDT
I haven't seen any issues of this type in quite some time.
Comment 21 Wolfgang Rupprecht 2011-10-20 13:38:13 EDT
(In reply to comment #20)
> I haven't seen any issues of this type in quite some time.

This happens when a wifi-connected machine shuts down with an inward ssh still active.  See Bug 735508 for details.  Unfortunately that bug was closed and this one stayed open even though it isn't really an ssh issue but a tcp and wifi issue when a system is shut down.  Systemd needs to let the tcp queue drain for perhaps 10ms - 100ms before shutting the wifi-managing daemons down.

yum list systemd kernel ssh wpa_supplicant NetworkManager
...
Installed Packages
NetworkManager.x86_64 1:0.9.1.90-3.git20110927.fc16 @fedora                     
kernel.x86_64         3.1.0-0.rc6.git0.3.fc16       @koji-override-0/$releasever
kernel.x86_64         3.1.0-0.rc9.git0.0.fc16       @fedora                     
kernel.x86_64         3.1.0-0.rc10.git0.1.fc16      @fedora                     
systemd.x86_64        37-2.fc16                     @updates-testing            
wpa_supplicant.x86_64 1:0.7.3-11.fc16               @fedora                     

# shutdown -r now
<hangs till reboot finishes and IP can send an ICMP>
<1-2 minutes later:>
Write failed: Broken pipe
Comment 22 Jóhann B. Guðmundsson 2012-02-27 05:47:13 EST
What's the current status on this as in is this still relevant? 

Somehow this seems more like a initscript ( network ) issue or NetworkManager one rather systemd.
Comment 23 Wolfgang Rupprecht 2012-02-27 11:27:03 EST
This still happens. Comment #21 still applies.
Comment 24 Jóhann B. Guðmundsson 2012-02-27 11:54:52 EST
Are you using the native systemd units or the legacy sysv init script. 

If you are using the legacy sysv init script try the units in git repo for ssh as in copy them to /etc/systemd/system directory run systemctl daemon-reload and recreate your case.
Comment 25 Wolfgang Rupprecht 2012-02-27 12:07:10 EST
I'm using the stock fedora-16 setup from a clean, from-scratch install. Whatever F16 does is what I do.  A bug report against fedora-16 would be worthless otherwise.

Looking at /etc/rc5.d/ I don't see any *ssh* files.  I assume that means stock f16 isn't using the old systemv init stuff for sshd.

If you want me to install something non-stock, you'll have to be much more explicit than get something from a git repo.  (like what git repo and what command line args/tags to get the right thing)
Comment 26 Jóhann B. Guðmundsson 2012-02-27 12:37:57 EST
The original report is against systemd on F14 ( comment 1 ) then in comment 10 we are on F15 with legacy sysv init script and now you are on F16 which has native systemd units so you can see this can be a bit confusing. 

Anyway now that we have established that this is happening with you with units we can start trying couple of things. 

The problem here can be that the [Install] section of the wpa_supplicant.service is pointing to multi-user.target as opposed to network.target which means that when openssh is stopped wpa-supplicant.service may actually have been stopped before it. 

To test if that is the case then you can add wpa-supplicant.service to the After= line of the openssh.service reload the daemon ( systemctl daemon-reload ) and recreate the scene that is happening.
Comment 27 Wolfgang Rupprecht 2012-02-27 12:57:05 EST
It still hangs.  Was the edit correct?

[root@ancho ~]# locate wpa_supplicant.service
/lib/systemd/system/wpa_supplicant.service
[root@ancho ~]# emacs /lib/systemd/system/wpa_supplicant.service
[root@ancho ~]# systemctl daemon-reload
[root@ancho ~]# shutdown -r 
Shutdown scheduled for Mon, 27 Feb 2012 09:55:05 -0800, use 'shutdown -c' to cancel.
[root@ancho ~]# 
Broadcast message from root@ancho.wsrcc.com (Mon, 27 Feb 2012 09:54:05 -0800):

The system is going down for reboot at Mon, 27 Feb 2012 09:55:05 -0800!
<hang>

Write failed: Broken pipe


diff -c /lib/systemd/system/sshd.service\~ /lib/systemd/system/sshd.service
*** /lib/systemd/system/sshd.service~	2012-02-22 00:04:28.000000000 -0800
--- /lib/systemd/system/sshd.service	2012-02-27 09:52:10.952396541 -0800
***************
*** 1,6 ****
  [Unit]
  Description=OpenSSH server daemon
! After=syslog.target network.target auditd.service
  
  [Service]
  EnvironmentFile=/etc/sysconfig/sshd
--- 1,6 ----
  [Unit]
  Description=OpenSSH server daemon
! After=syslog.target network.target auditd.service wpa-supplicant.service
  
  [Service]
  EnvironmentFile=/etc/sysconfig/sshd

Diff finished.  Mon Feb 27 09:52:25 2012
Comment 28 Jóhann B. Guðmundsson 2012-02-27 13:23:44 EST
(In reply to comment #27)
> It still hangs.  Was the edit correct?

Yup but you might have had to restart the service for it to take full effect.
( That reboot certanly has taken care of that now. )
 
> The system is going down for reboot at Mon, 27 Feb 2012 09:55:05 -0800!
> <hang>

Is this system hang that times out in 90s ? If so is there something in the logs on why it's hanging?

> Write failed: Broken pipe

This usually indicates that you have to tweak your ServerAliveInterval in sshd config for the ssh client ( if you want to tweak that to the network stack level do "echo <seconds> > /proc/sys/net/ipv4/tcp_keepalive_time"
Comment 29 Wolfgang Rupprecht 2012-02-27 13:39:16 EST
> Yup but you might have had to restart the service for it to take full effect.
> ( That reboot certanly has taken care of that now. )

I rebooted a few times after that just to make certain that a clean restart had the same problems.  The same hang happened every time.

> Is this system hang that times out in 90s ? If so is there something in the
> logs on why it's hanging?

The hang appeared to be 63 seconds (timed by hand via typing "date" in another window and subtracting.)

I assume the timeout is on the local system and is the local tcp trying to get the extra <cr>'s I typed (after the "shutdown -r<cr>") to be accepted by the remote system.  63 sec delay is also the expected number if one has a 1,2,4,8,16,32 second timeout in succession.

I'm happy that this timeout is happening as a fallback.  My ssh scripts that update the remote systems would hang indefinitely if this timeout didn't happen.
Comment 30 Rudd-O DragonFear 2012-07-25 21:00:49 EDT
IMO sshd server processes that get killed should not end immediately after sigterm but rather wait until every byte has been written to the network and the connection is successfully closed.

Would this fix the problem?
Comment 31 Fedora End Of Life 2013-04-03 16:17:25 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
Comment 32 Lennart Poettering 2013-04-09 06:51:11 EDT
*** Bug 949049 has been marked as a duplicate of this bug. ***
Comment 33 Michal Jaegermann 2013-10-10 19:36:15 EDT
(In reply to Fedora End Of Life from comment #31)
> This bug appears to have been reported against 'rawhide' during the Fedora
> 19 development cycle.
> Changing version to '19'.

This bug, or at least something with the same effects, started to show up consistently on the current rawhide (this will be '21').  Previously it was hitting from time to time but now it become a "constant feature". So far I did not observe such thing on Fedora 18 installations.

If a new report is required then please leave a note saying so.
Comment 34 Ville Skyttä 2013-12-31 07:46:40 EST
Comment 33 seconded, I didn't see this in Fedora 18 or 19. But now I see it every time when remotely shutting down a Fedora 20 box (no wifi involved here).

systemd-208-9.fc20.x86_64
Comment 35 Lennart Poettering 2014-02-23 11:59:04 EST
*** Bug 1023788 has been marked as a duplicate of this bug. ***
Comment 36 Harald Reindl 2014-02-26 06:15:28 EST
how can https://bugzilla.redhat.com/show_bug.cgi?id=1023788 be a duplicate of this 3 years old bugreport? this one was reported for F14 and 1023788 is for F20, there are *six* dist-upgrades and god knows how many systemd major versions between

systemd-210-2.fc21.x86_64 is fixing that behavior, well, hopefully F20 will get systemd-210 soon - unsure how to verify if it also fixes the "rely-on-typo-behavior" from https://bugzilla.redhat.com/show_bug.cgi?id=1023788#c12
Comment 37 Mikhail Strizhov 2014-02-27 15:20:55 EST
I confirm the same problem for Fedora 20.  systemd-208-14.fc20.x86_64
Comment 39 Michal Sekletar 2014-03-11 10:00:19 EDT
I've pushed patches to the stable branch (and couple of followup fixups) which should solve the issue.
Comment 40 Harald Reindl 2014-03-11 11:09:54 EDT
> For 208 in fedora 20 we should include

sounds promising, thank you

> I've pushed patches to the stable branch

for now on koji i only see a F21 build which was already fixed, no F20
http://koji.fedoraproject.org/koji/packageinfo?packageID=10477
Comment 41 Michal Sekletar 2014-03-11 11:20:45 EDT
(In reply to Harald Reindl from comment #40)
> for now on koji i only see a F21 build which was already fixed, no F20
> http://koji.fedoraproject.org/koji/packageinfo?packageID=10477

I haven't build anything yet. I've just pushed code to the stable branch, I believe Zbyszek will release it in the next F20 update.

Let me know if you'd like to test out the scratch build. I could do a scratch-build of v208-stable for you. Otherwise wait until the next systemd update in F20. Thanks!
Comment 42 Harald Reindl 2014-03-11 11:22:47 EDT
scratch build sounds fine, i can install that on my VMware-test-server and look for remaining issues and/or confirm fixed ones
Comment 43 Michal Sekletar 2014-03-11 12:17:14 EDT
Please give it a try. Thanks!

http://koji.fedoraproject.org/koji/taskinfo?taskID=6622144
Comment 44 Harald Reindl 2014-03-11 12:54:00 EDT
[root@testserver:~]$ rpm -qa | grep systemd
systemd-libs-208-16.fc20.x86_64
systemd-208-16.fc20.x86_64

the issue of this bugreport is solved, boot looks fine, also dracut -f called to make sure the build sits also in the initrd 
_____________________________________________

that issue still exists and should at least be fixed for the 
beauty of the eye but not critical and another topic

if will take a closer look tomorrow after some hours of running

https://bugzilla.redhat.com/show_bug.cgi?id=1010572
Failed to open private bus connection: Failed to connect to socket /run/user/0/dbus/user_bus_socket: No such file or directory
Comment 45 Harald Reindl 2014-03-11 17:12:33 EDT
seems to work fine so long and i will update my both physical machines now to the scratchbuild, maybe the messages below are worth to note, no idea where to report them and if a own bugreport in context of a scratch-build is a good idea

[root@testserver:~]$ cat messages | grep "Failed to mark scope session"
Mar 11 18:45:02 testserver systemd[1]: Failed to mark scope session-25.scope as abandoned : Stale file handle
Mar 11 19:15:01 testserver systemd[1]: Failed to mark scope session-39.scope as abandoned : Stale file handle
Mar 11 20:10:01 testserver systemd[1]: Failed to mark scope session-63.scope as abandoned : Stale file handle
Mar 11 20:15:01 testserver systemd[1]: Failed to mark scope session-64.scope as abandoned : Stale file handle
Mar 11 21:45:01 testserver systemd[1]: Failed to mark scope session-102.scope as abandoned : Stale file handle
Comment 46 Harald Reindl 2014-03-12 09:13:28 EDT
ok *that* was bad, while update samba from koji packages a 
segfault in systemd itself and after that re-exec don't work 

[root@srv-rhsoft:~]$ systemctl daemon-reexec 
Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Verbindungsaufbau abgelehnt

[root@srv-rhsoft:~]$ cat messages 
Mar 12 14:00:03 srv-rhsoft systemd[1]: Failed to mark scope session-712.scope as abandoned : Stale file handle
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md0.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/md0.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:0.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-1d691642:baed26df:1d197496:4fb00ff8.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/boot.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md2.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/md2.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:2.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-ea253255:cb915401:f32794ad:ce0fe396.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/data.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Stopping Samba SMB Daemon...
Mar 12 14:09:14 srv-rhsoft systemd[1]: Starting Samba SMB Daemon...
Mar 12 14:09:14 srv-rhsoft systemd[1]: smb.service failed to run 'start' task: File exists
Mar 12 14:09:14 srv-rhsoft kernel: systemd[1]: segfault at a8 ip 00007f8ea03d561e sp 00007fffdfa52180 error 4 in systemd[7f8ea0341000+10b000]
Mar 12 14:09:14 srv-rhsoft systemd[1]: Caught <SEGV>, dumped core as pid 23370.
Mar 12 14:09:14 srv-rhsoft systemd[1]: Freezing execution.
Comment 47 Michal Sekletar 2014-03-12 10:30:11 EDT
Can you provide coredump or at least backtrace?
Comment 48 Harald Reindl 2014-03-12 12:10:04 EDT
sadly no coredumps enabled and i can't reproduce it

i noted that in the KDE noticiations "window xyz sent signal" thought "uhm while a yum upgrade is running? why" finally i rebooted the machine with "sync; reboot -f" because systemctl no longer worked

the only thing i can offer is try to reproduce that with downgrades/upgrades and if that is possible enable specific debugging or at least provide steps that you or another fedora/systemd-developer can trigger it too
Comment 49 Harald Reindl 2014-03-12 12:36:19 EDT
uhm recently my KDE desktop was away, the system switched to TTY1 with the boot-messages while the graphical user session runs on TTY2 - originally i thought WTF is my desktop gone, after CTRL+ALT+F2 all is still here, but the messages smells like the same problem

not sure if the "systemd-logind[787]: Failed to store session release timer fd" below shortly before that happened have anything in common

[root@srv-rhsoft:~]$ 
Message from syslogd@rh at Mar 12 17:30:37 ...
 kernel:systemd[1]: segfault at a8 ip 00007f4ece2ca61e sp 00007fffe460b110 error 4 in systemd[7f4ece236000+10b000]
______________________________________________________

Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 292 of user wwwcron.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 292 of user wwwcron.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 293 of user root.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 293 of user root.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 294 of user wwwcron.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 294 of user wwwcron.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 295 of user root.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 295 of user root.
Mär 12 17:30:01 rh.thelounge.net CROND[10363]: (wwwcron) CMD (  umask 006; nice php /www/distribute/call_cronjob.php modules/podcast/cron.php passthru,popen,pclose)
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 296 of user wwwcron.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 296 of user wwwcron.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 298 of user apache.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 298 of user apache.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 297 of user root.
Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 297 of user root.
Mär 12 17:30:01 rh.thelounge.net CROND[10375]: (root) CMD (/usr/lib64/sa/sa1 1 1)
Mär 12 17:30:01 rh.thelounge.net CROND[10382]: (wwwcron) CMD (  umask 006; nice php /www/distribute/call_cronjob.php modules/doc-managment/cron.php passthru)
Mär 12 17:30:01 rh.thelounge.net CROND[10391]: (wwwcron) CMD (  umask 006; nice php /www/distribute/call_cronjob.php schedule.php)
Mär 12 17:30:01 rh.thelounge.net CROND[10392]: (root) CMD (     php /usr/local/scripts/rh_watchdog.php)
Mär 12 17:30:01 rh.thelounge.net CROND[10395]: (apache) CMD (   nice -n 19 ionice -c 3 bash /scripts/cleanup-uploadtemp.cron)
Mär 12 17:30:01 rh.thelounge.net CROND[10396]: (root) CMD (     nice rsync --quiet --no-motd --times --force --recursive --delete-after --links --hard-links --devices --specials --perms --owner --group --executability --acls --xattrs root@adminserver:/var/lib/clamav/ /var/lib/clamav/ )
Mär 12 17:30:01 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:01 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:01 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd
Mär 12 17:30:37 rh.thelounge.net kernel: systemd[1]: segfault at a8 ip 00007f4ece2ca61e sp 00007fffe460b110 error 4 in systemd[7f4ece236000+10b000]
Mär 12 17:30:37 rh.thelounge.net systemd[1]: Caught <SEGV>, dumped core as pid 10912.
Mär 12 17:30:37 rh.thelounge.net systemd[1]: Freezing execution.
Comment 50 Harald Reindl 2014-03-12 14:14:02 EDT
the segfault happened again on my home-machine (recently after come home from office), i managed to supsend my VMware guests followed by "sync; reboot --force" and at boot time fsck cleaned orphaned inodes

well, after that i decided to reboot my virtual machines after "touch /forcefsck" for safety, due shutdown the testserver had a systemd-crash due stopping servicces and needed a hard reset

the scratch-build solves the reboot-problem but is not that stable
may i ask one of the developers to install it on a own-dogfoot-machine because debugging systemd feels really outside my scope :-(
Comment 51 Harald Reindl 2014-03-13 19:46:51 EDT
i downgraded to systemd-208-15.fc20.x86_64 because these segfault happening randomly and leading to file-system errors at boot, damaged mysqld tables and finally if that happens at the wrong moment in risk of work data

it is *unacceptable that systemd upstream has a 208-209-10-*BULLSHIT*-release policy, Lennart happily makes the initial builds and no longer cares about anything happens downstream - you can do that for f**g mediaplayer but not for PID1 

upstream needs to realize that POINT-RELEASES instead this cheery-pick bullshit is needed and as long as we Fedora users are downstream testers for upstream RHEL the project leader paied by Redhat is much more responsible than change some bugreport subjects and initial commits

*WHERE IS* sytemd 208.1, 208.2, 208.3....

what is that? "i build a new major and than someone bites the result"?
http://koji.fedoraproject.org/koji/packageinfo?packageID=10477
Comment 52 Michal Jaegermann 2014-03-13 20:42:37 EDT
(In reply to Harald Reindl from comment #50)
> the segfault happened again on my home-machine (recently after come home
> from office) ....

So far I failed to segfault systemd-208-16.fc20.x86_64 on my test system although I tried.  Afer 'daemon-reexec' I found the following AVC in my logs:

 type=1403 audit(1394755727.603:3): policy loaded auid=0 ses=1
 type=1107 audit(1394755735.582:4): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc:  received policyload notice (seqno=2)
/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
 type=1305 audit(1394755761.385:5): audit_pid=2397 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1

OTOH selinux there is set to "permissive".  Harald, are you running by any chance with "enforcing"? 0xffffffff for auid and ses do not look very convincing.

BTW - I do not have here any VM guests.
Comment 53 Harald Reindl 2014-03-13 20:49:35 EDT
> Harald, are you running by any chance with "enforcing"?

no, any of my machines has selinux=0 as kernel-param and so no selinux

> BTW - I do not have here any VM guests

i speak about my *physical machine* running my daily workloaf and hosting development-virtual machines, the segfault happened 3 times today and there is no chance to get services stopped clean 

since this cherry picking of the broken systemd-208 in F20 leads to nowhere what about upgrade systemd in F20 at all to latest upstream and only rule out  https://bugzilla.redhat.com/show_bug.cgi?id=1072368 because the other bugs are fixed until upstream realizes that the "cool" Google-Chrome-Versioning is not the way to go for PID1?
Comment 54 Michal Schmidt 2014-03-14 06:48:28 EDT
Harald Reindl,

for your information, systemd upstream already maintains stable branches (currently "v208-stable" and "v210-stable"):
http://www.freedesktop.org/wiki/Software/systemd/Backports/
Although there are currently no tagged point releases made from them, these git branches are meant for distros to base their updates on.

I don't understand your objection to cherry-picking. How do you think any software project creates bugfix releases in the first place? Mostly by cherry-picking selected fixes, of course.

Your innuendo that Lennart should himself be made responsible for maintaining bugfix releases shows an inflated sense of entitlement. Also, doing that would be counter-productive. Division of labour is awesome.

Would you please stop posting such angry comments in BZs? They are demotivating.

A productive contribution would be to provide the coredump or backtrace. Your logs show it did dump the core somewhere. What is your /proc/sys/kernel/core_pattern set to?
Comment 55 Harald Reindl 2014-03-14 07:14:55 EDT
> for your information, systemd upstream already maintains stable branches

2014-02-28: [systemd-devel] [HEADSUP] Formalizing Backports
all the bugs and troubles in systemd of F20 are there for months

> I don't understand your objection to cherry-picking

than subscribe to httpd-devel, there you can see how maintaing of important software works for even more than one stable release over many years, frankly longer than systemd exists at all, currently updates for 2.2 and 2.4 are in vote-state

> Your innuendo that Lennart should himself be made responsible for 
> maintaining bugfix releases shows an inflated sense of entitlement

i do not care if Lennart or whoever does but that the current release strategy of systemd is broken becomes obviously by see reported bugs before F20 release still there and trying to fix them ending in completly unstable systems

> Would you please stop posting such angry comments in BZs? 
> They are demotivating

report bugs months ago and see them go into a stable release too

see things like "Failed to open private bus connection: Failed to connect to socket /run/user/0/dbus/user_bus_socket" at every reboot and for every user is also demotivating because it leaves the question if the person who introduced the upgrade ever did read any systemlogs

> A productive contribution would be to provide the coredump or 
> backtrace. Your logs show it did dump the core somewhere. What 
> is your /proc/sys/kernel/core_pattern set to?

cat /proc/sys/kernel/core_pattern
core

"somewhere" does not help me much, if you see what my logs showing while i don't understand how to debug systemd as a web-developer it's up to you help me to help and where i find the informations to submit 

all the time we are wasting in the large amount of systemd-related bugs in F20 are still leading to the question why not consider a upgrade instead spent another month in try-to-get-the-same-behavior
Comment 56 Harald Reindl 2014-03-14 08:23:13 EDT
check private mail with the coredumps found directly in the 
rootfs, that would have been a important info because for 
you it's obviously where to find such things, for a user not

sent private because trying to avoid leak of possible sensitive data
Comment 57 Michal Schmidt 2014-03-14 09:22:28 EDT
Thank you. I received the coredumps and also forwarded them to Michal Sekletar.

The stack traces in the four coredumps all point to unit_unwatch_pid().
That makes
0328-core-watch-SIGCHLD-more-closely-to-track-processes-o.patch
the primary suspect.

Here's one of the stack traces:
$ gdb -c cores/core.5677 
[...]
(gdb) bt full
#0  0x00007fd0cbf9162b in raise () from /lib64/libpthread.so.0
No symbol table info available.
#1  0x00007fd0cdadca7e in crash (sig=11) at ../src/core/main.c:152
        rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {
              0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        pid = 0
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2  <signal handler called>
No symbol table info available.
#3  0x00007fd0cdb5061e in unit_unwatch_pid (u=u@entry=0x7fd0ce34fe40, pid=32328)
    at ../src/core/unit.c:1686
        __PRETTY_FUNCTION__ = "unit_unwatch_pid"
#4  0x00007fd0cdade5a6 in manager_dispatch_sigchld (m=m@entry=0x7fd0ce2bd100)
    at ../src/core/manager.c:1392
        si = {si_signo = 17, si_errno = 0, si_code = 1, _sifields = {_pad = {32328, 500, 
              0 <repeats 26 times>}, _kill = {si_pid = 32328, si_uid = 500}, _timer = {
              si_tid = 32328, si_overrun = 500, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, 
            _rt = {si_pid = 32328, si_uid = 500, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, 
            _sigchld = {si_pid = 32328, si_uid = 500, si_status = 0, si_utime = 0, si_stime = 0}, 
            _sigfault = {si_addr = 0x1f400007e48, si_addr_lsb = 0}, _sigpoll = {
              si_band = 2147483680328, si_fd = 0}, _sigsys = {_call_addr = 0x1f400007e48, 
              _syscall = 0, _arch = 0}}}
        u = 0x7fd0ce34fe40
        r = <optimized out>
        __PRETTY_FUNCTION__ = "manager_dispatch_sigchld"
        __func__ = "manager_dispatch_sigchld"
#5  0x00007fd0cdae41cd in manager_process_signal_fd (m=<optimized out>) at ../src/core/manager.c:1642
        n = <optimized out>
        sfsi = {ssi_signo = 17, ssi_errno = 0, ssi_code = 1, ssi_pid = 32328, ssi_uid = 500, 
          ssi_fd = 0, ssi_tid = 0, ssi_band = 0, ssi_overrun = 0, ssi_trapno = 0, ssi_status = 0, 
          ssi_int = 0, ssi_ptr = 0, ssi_utime = 1387, ssi_stime = 326, ssi_addr = 0, 
          __pad = '\000' <repeats 47 times>}
        sigchld = true
#6  process_event (ev=0x7fff6c329d50, m=0x7fd0ce2bd100) at ../src/core/manager.c:1667
        r = <optimized out>
        w = <optimized out>
#7  manager_loop (m=0x7fd0ce2bd100) at ../src/core/manager.c:1864
        event = {events = 1, data = {ptr = 0x7fd0ce2bd1d0, fd = -835989040, u32 = 3458978256, 
            u64 = 140534788903376}}
        n = <optimized out>
        wait_msec = <optimized out>
        r = <optimized out>
        rl = {interval = 1000000, begin = 28707172784, burst = 50000, num = 17}
        __PRETTY_FUNCTION__ = "manager_loop"
        __func__ = "manager_loop"
#8  0x00007fd0cdada94e in main (argc=5, argv=0x7fff6c32a6b8) at ../src/core/main.c:1672
        m = 0x7fd0ce2bd100
        r = <optimized out>
        retval = 1
        before_startup = 2535502
        after_startup = <optimized out>
        timespan = "570.282ms\000\302\313\320\177\000\000P\302+\316\320\177\000\000\020\300+\316\320\177\000\000\000\000\000\000\000\000\000\000g\254&\315\320\177\000\000\000\000\000\000\000\000\000\000P\302+\316\320\177\000"
        fds = 0x0
        reexecute = false
        shutdown_verb = 0x0
        initrd_timestamp = {realtime = 0, monotonic = 0}
        userspace_timestamp = {realtime = 1394723031488972, monotonic = 2423982}
        kernel_timestamp = {realtime = 1394723029064992, monotonic = 0}
        systemd = "systemd"
---Type <return> to continue, or q <return> to quit---
        skip_setup = false
        j = <optimized out>
        loaded_policy = false
        arm_reboot_watchdog = false
        queue_default_job = <optimized out>
        switch_root_dir = 0x0
        switch_root_init = 0x0
        saved_rlimit_nofile = {rlim_cur = 1024, rlim_max = 4096}
        __func__ = "main"
        __PRETTY_FUNCTION__ = "main"
Comment 58 Michal Sekletar 2014-03-17 11:02:13 EDT
I think we now have all fixes in stable branch. I apologize for inconvenience caused by systemd-208-16. I can provide you with the test build, but I doubt you'd like to try one more.
Comment 59 Harald Reindl 2014-03-17 11:08:44 EDT
oh, as reporter i am fine with test another one, later that evening on my testserver-VM and after some hours with no issues as well on my phyiscal machine, that time the company workstation and after 2 days my homeserver-host 

my primary intention is get systemd in F20 clean because i plan to upgrade production servers in may from F19 to F20 and appreciate the larger time window having all development systems and production machines on the same level for cleanup rsyslog configs and whatever - that's way easier with having everywhere the same software versions for distribute tested configs
Comment 60 Harald Reindl 2014-04-03 08:36:09 EDT
is there any progress?

these systemd problems exists for more than 7 months now
and are show-stoppers for any admin taking serious his
responsibility for production machines in case of consider
upgrade to Fedora 20
Comment 61 Michal Jaegermann 2014-04-06 16:12:38 EDT
(In reply to Harald Reindl from comment #60)

> and are show-stoppers for any admin taking serious his
> responsibility for production machines 

This is clearly a workaround but you may run something in this style:

   ( reboot & ) ; exit

instead of a bare 'reboot'.

> in case of consider upgrade to Fedora 20

As you can see here the issue was first reported for Fedora 14 although it definitely got more annoying with Fedora 20.
Comment 62 Harald Reindl 2014-04-06 16:45:12 EDT
what i DO NOT UNDERSTAND is "I can provide you with the test build, but I doubt you'd like to try one more" floowed by a clear "yes i am willing to test" with no further response

> As you can see here the issue was first reported for Fedora 
> 14 although it definitely got more annoying with Fedora 20

don't get me wrong but:

* F14 had no systemd at all
* i am running Fedora servers starting with Fedora 9 in production
* before F20 it was *no* serious problem to type "reboot" in a ssh-session

soory, no, that is *clearly* a F20 problem and besides in the meantime 30 "real" production servers i maintain a few workstations as well as routers/firewalls/gateways running on Fedora over *many* years and reboot was never a issue expcet 1 out of 100 reboots on a single machine - what we have now is a *every reboot* hangs and all the worksarounds are not doing anything in cae of larger infrastructures where you ahve *scripted+ operations
Comment 63 Wolfgang Rupprecht 2014-04-06 17:46:57 EDT
The problem only happens to me when the machines are wifi connected.  Ethernet connected systems reboot just fine without the ssh hang on f20 (and have for the past 15 or so versions).   While annoying that nobody has fixed this bug yet, the simplest workaround is to just use "shutdown -r" or "shutdown" to add a 1 minute delay to the shutdown or reboot.  Ssh can then exit normally.  To reboot all my servers (25 less than yours ;-)) I just run a loop over "ssh $host shutdown -r".   It doesn't hang and the extra minute delay isn't going to kill me.
Comment 64 Harald Reindl 2014-04-06 18:18:41 EDT
it happens from the moment you are using network.service instead NM, only in F20 and machines with static network configuration especially servers don't need NM

the extra minute is killing me - my typical servers are rebooting within 8 and 15 seconds and so while i wait for some timeout i can check logs and typical admin tasks after a frsh reboot / kernel update instead just wait for nothing

that is why servers are still running F19 and it's a shame to be forced to
Comment 65 Michal Jaegermann 2014-04-06 18:30:22 EDT
> it happens from the moment you are using network.service instead NM, only in F20

I am seeing that on machines with NM handling networks and no wifi so both look to me coincidental.  OTOH I was hitting that condition for sure in Fedora 18 and earlier, although there only occasionally, while with F20 you can bank on it.
Comment 66 Michal Sekletar 2014-04-07 04:29:17 EDT
(In reply to Harald Reindl from comment #62)
> what i DO NOT UNDERSTAND is "I can provide you with the test build, but I
> doubt you'd like to try one more" floowed by a clear "yes i am willing to
> test" with no further response

I am sorry for not providing you with the test build earlier. Here it is [1]. Hope that helps. BTW, for the future, if you need something from me and I do not respond in BZ feel free to reach me on IRC (#fedora-devel or #systemd on freenode). Thanks!

[1] http://msekleta.fedorapeople.org/systemd/
Comment 67 Georg Müller 2014-04-07 05:49:17 EDT
(In reply to Michal Sekletar from comment #66)
> I am sorry for not providing you with the test build earlier. Here it is
> [1]. Hope that helps. BTW, for the future, if you need something from me and
> I do not respond in BZ feel free to reach me on IRC (#fedora-devel or
> #systemd on freenode). Thanks!
> 
> [1] http://msekleta.fedorapeople.org/systemd/

The rpms solve the issue in my VM.
Comment 68 Harald Reindl 2014-04-07 07:01:12 EDT
build installed on my testserver - looks fine but from the expierience with the scratch-build too soon to qualifiy

i will give it some hours in the VM and if it don't eat children at the evening or tomorrow morning try it on my physical machines - looking at the history if it don't segfault or start spam the syslog within two days we have a candidate for a stable update

thanks!
Comment 69 Harald Reindl 2014-04-07 10:12:50 EDT
until now no segfault - knock on wood

however, the messages below are still appearing and happened the first time with the segfaulting scratch-build and never before

Apr  7 14:30:01 testserver systemd[1]: Failed to mark scope session-42.scope as abandoned : Stale file handle
Apr  7 15:00:01 testserver systemd[1]: Failed to mark scope session-56.scope as abandoned : Stale file handle
Apr  7 15:00:02 testserver systemd[1]: Failed to mark scope session-57.scope as abandoned : Stale file handle
Apr  7 16:00:01 testserver systemd[1]: Failed to mark scope session-91.scope as abandoned : Stale file handle
Comment 70 Michal Sekletar 2014-04-07 11:12:01 EDT
Yes, I know about this issue and it is on my TODO list. 

FWIW...This happens because logind will try to call new AbandonScope() D-Bus API  of systemd but scope on which this is called is neither abandoned nor running. I have seen this happening when session and scope get created via pam_systemd, but scope goes away because stop on it is explicitly requested and after that app which created pam session tears it down but scope is already in stop-sigterm substate thus this message is logged.

Do you have an another reproducer?
Comment 71 Harald Reindl 2014-04-07 11:17:21 EDT
> Do you have an another reproducer?

i admit missing enough kowledge behind, in the meantime all my Fedora 20 machines including both of my phyiscal machines are upgraded to that build followed by "dracut -f", i see that on all of them

just a half-educated guess: https://bugzilla.redhat.com/show_bug.cgi?id=1010572
maybe that is related to these failing sockets we see for any user regards of cronjob or real user after the first session start
Comment 72 Ville Skyttä 2014-04-15 09:34:03 EDT
Based on a few quick tests, I'm no longer experiencing this issue with systemd-208-16.fc20.x86_64 even though its changelog doesn't refer to this bug. Is 208-16 expected to fix it?
Comment 73 Jason Tibbitts 2014-04-15 19:02:02 EDT
I'm also not seeing this with the -16 package.  Not sure what to make of it but I've tried a bunch of times to reproduce across several machines and can't.  I suspect a fix for this was fallout from some of the other restructuring as there doesn't appear to have been a general rollup of the other pending systemd 208 patches.
Comment 74 Jason Tibbitts 2014-04-15 19:09:40 EDT
OK, I'm completely wrong; it looks like everything from http://cgit.freedesktop.org/systemd/systemd-stable/log/?h=v208-stable (newer than 2014-03-05, beause those were in -15) is rolled into this update.  Which is odd because the bug where systemd --user and sd-pam processes hang around isn't fixed, and I thought that was supposed bo be in the stable tree as well.

Oh, well, at least this one does appear to be fixed.  I'd guess this ticket can be closed.
Comment 75 Harald Reindl 2014-04-15 19:11:16 EDT
also confirmed for systemd-208-16.fc20.x86_64 but now services with shell-scripts suspdning VMware guests are fucked up and killed, taht was the case a long time after the switch to systemd, worked the last year and now we have to choose between freezing ssh-lients or killed VMware guests - systemd after F19 is simply crap

there is no single reason to kill the unit below unexpected

____________________________________________

[root@srv-rhsoft:~]$ cat /etc/systemd/system/vmware.service
[Unit]
Description=VMware Workstation Service
After=network.service systemd-networkd.service kdm.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/systemctl start vmware.target
ExecStartPost=-/usr/bin/su -c "/scripts/vmware/vm-default-start.sh" vmware
ExecStop=-/scripts/vmware/vm-suspend-all.sh
ExecStop=/usr/bin/systemctl stop vmware.target
TimeoutSec=1800

[Install]
WantedBy=multi-user.target
Comment 76 Harald Reindl 2014-04-15 19:17:16 EDT
WHY in the world systemd of F20 is not thrown away and rebased against the rawhide version after explain some people like Lennart that only the DDOS-attack against central syslog-servers needs to be fixed and he is not in the position to rule the world in case of a central component like systemd because in doubt it is not relevant what he or Kay are thinking - nobody asked them to make systemd that worse as in F20 and so they have to accept that the ordinary sysadmin is not interested in systemd debug messages - the answe of Linus on the kernel-list in that contetx was pretty clear (https://bugzilla.redhat.com/show_bug.cgi?id=1072368) 

i do not make much friends because stating things as they are - who cares - i do not need friends, in eed proper working systems as before F15 and systemd and frustratd after thinking things goes wy better from F17 to F19 and now see we make a step backwards which is unacceptable
Comment 77 Zbigniew Jędrzejewski-Szmek 2014-04-15 20:47:46 EDT
(In reply to Jason Tibbitts from comment #74)
> Oh, well, at least this one does appear to be fixed.  I'd guess this ticket
> can be closed.
Yeah, I forgot to add this bug to the update comments.

Note You need to log in before you can comment on or make changes to this bug.