Description of problem: After poweroff, ssh connections are left hanging and must be force disconnected (~.) from the client side. Behavior one usually gets is a "Connection to XXX closed by remote host. Connection to XXX closed and the ssh connection dies." Version-Release number of selected component (if applicable): systemd-7-3.fc14.x86_64 How reproducible: always Steps to Reproduce: 1. ssh into system 2. $ poweroff or $ systemctl poweroff Actual results: server disappears, the ssh client hangs. Expected results: connection is closed and ssh exits. Additional info:
So my guess here is that this is caused by the network being shut down before ssh. There's probably some kind of ordering dep missing. But so far that's just a guess.
Hmm, actually I think the guess was wrong. It probably is related to the fact that systemd <= 8 didn't kill daemon processes that escaped their original cgroup.
Fixed now in git upstream. Upload follows shortly.
systemd-9-2.fc14 has been submitted as an update for Fedora 14. https://admin.fedoraproject.org/updates/systemd-9-2.fc14
systemd-9-2.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update systemd'. You can provide feedback for this update here: https://admin.fedoraproject.org/updates/systemd-9-2.fc14
systemd-9-3.fc14, initscripts-9.18-1.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update systemd initscripts'. You can provide feedback for this update here: https://admin.fedoraproject.org/updates/systemd-9-3.fc14,initscripts-9.18-1.fc14
initscripts-9.19-1.fc14, systemd-9-3.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update initscripts systemd'. You can provide feedback for this update here: https://admin.fedoraproject.org/updates/systemd-9-3.fc14,initscripts-9.19-1.fc14
# rpm -q systemd systemd-11-1.fc15.x86_64 Neither poweroff nor reboot set runlevel to 0 resp 6 so /etc/init.d/sshd can not figure out that system is shutting down and leaves actual ssh sessions running. /etc/init.d/sshd: 47 runlevel=$(set -- $(runlevel); eval "echo \$$#" ) 149 # if we are in halt or reboot runlevel kill all running sessions 150 # so the TCP connections are closed cleanly 151 if [ "x$runlevel" = x0 -o "x$runlevel" = x6 ] ; then 152 trap '' TERM 153 killall $prog 2>/dev/null 154 trap TERM 155 fi
systemd now keeps track of user sessions via pam_systemd in cgroups and kills them on shutdown using this. A hack like the shell script you pasted is not necessary anymore.
That's really nice but it still doesn't work for me. systemd-12-3.fc15.x86_64 All sshd processes have to be killed before network is stopped. I've added this line to /etc/init.d/network as first command for stop) target: ps xawf -eo pid,user,cgroup,args | grep ssh and after shutdown I can see this log: 1040 root name=systemd:/system/sshd.service sshd: root@pts/0 1044 root name=systemd:/system/sshd.service \_ -bash
Created attachment 461292 [details] console log after poweroff command
unfortunately bash does not react to SIGTERM. Just uploaded v13 now which tries harder to get rid of all processes. This should also fix this problem.
You haven't really tried it, have you? /etc/init.d/sshd calls 'killall sshd' in case of reboot/shutdown to kill all running ssh sessions. This has to be done before network is stopped if you don't want to ssh client hang. For now /etc/init.d/sshd determines reboot/shutdown through the runlevel, see #c8. systemd doesn't set this and tries to solve it otherwise. That's ok but again all sshd processes must be killed before network is stopped. With this modification of /etc/init.d/network ssh clients don't hang anymore: --- /etc/init.d/network.orig 2010-11-19 09:32:08.938001916 +0100 +++ /etc/init.d/network 2010-11-19 09:33:43.691999956 +0100 @@ -172,6 +172,12 @@ ;; stop) + set -x + ps xawf -eo pid,user,cgroup,args | grep sshd.service + killall sshd + ps xawf -eo pid,user,cgroup,args | grep sshd.service + set +x + [ "$EUID" != "0" ] && exit 4 # Don't shut the network down if root is on NFS or a network # block device.
Patch from previous comment is not fix, it's just demonstration how it is supposed to work.
I have tried it. It works fine here. If this continues to be a problem for you then most likely because systemd-user-sessions gets stopped after the network for you which means it probably just needs an ordering dep. Are you using NM or the old scripts? ("killall sshd" is an evil thing, the script should'nt do that)
it's set up with network scripts. NM is not even installed. # systemctl | grep -i network network.service loaded active exited LSB: Bring up/down network network.target loaded active active Network
I am also still getting ssh client hangs on poweroff and/or reboot. (no NM installed if that matters) systemd-15-1.fc15.x86_64 systemd-units-15-1.fc15.x86_64 kernel-2.6.37-0.rc5.git2.1.fc15.x86_64 Just before running poweroff/reboot: # systemctl |egrep -i 'network|ssh' network.service loaded active running LSB: Bring up/down network sshd.service loaded active running LSB: Start up the OpenSSH network.target loaded active active Network
*** Bug 735508 has been marked as a duplicate of this bug. ***
This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component.
I haven't seen any issues of this type in quite some time.
(In reply to comment #20) > I haven't seen any issues of this type in quite some time. This happens when a wifi-connected machine shuts down with an inward ssh still active. See Bug 735508 for details. Unfortunately that bug was closed and this one stayed open even though it isn't really an ssh issue but a tcp and wifi issue when a system is shut down. Systemd needs to let the tcp queue drain for perhaps 10ms - 100ms before shutting the wifi-managing daemons down. yum list systemd kernel ssh wpa_supplicant NetworkManager ... Installed Packages NetworkManager.x86_64 1:0.9.1.90-3.git20110927.fc16 @fedora kernel.x86_64 3.1.0-0.rc6.git0.3.fc16 @koji-override-0/$releasever kernel.x86_64 3.1.0-0.rc9.git0.0.fc16 @fedora kernel.x86_64 3.1.0-0.rc10.git0.1.fc16 @fedora systemd.x86_64 37-2.fc16 @updates-testing wpa_supplicant.x86_64 1:0.7.3-11.fc16 @fedora # shutdown -r now <hangs till reboot finishes and IP can send an ICMP> <1-2 minutes later:> Write failed: Broken pipe
What's the current status on this as in is this still relevant? Somehow this seems more like a initscript ( network ) issue or NetworkManager one rather systemd.
This still happens. Comment #21 still applies.
Are you using the native systemd units or the legacy sysv init script. If you are using the legacy sysv init script try the units in git repo for ssh as in copy them to /etc/systemd/system directory run systemctl daemon-reload and recreate your case.
I'm using the stock fedora-16 setup from a clean, from-scratch install. Whatever F16 does is what I do. A bug report against fedora-16 would be worthless otherwise. Looking at /etc/rc5.d/ I don't see any *ssh* files. I assume that means stock f16 isn't using the old systemv init stuff for sshd. If you want me to install something non-stock, you'll have to be much more explicit than get something from a git repo. (like what git repo and what command line args/tags to get the right thing)
The original report is against systemd on F14 ( comment 1 ) then in comment 10 we are on F15 with legacy sysv init script and now you are on F16 which has native systemd units so you can see this can be a bit confusing. Anyway now that we have established that this is happening with you with units we can start trying couple of things. The problem here can be that the [Install] section of the wpa_supplicant.service is pointing to multi-user.target as opposed to network.target which means that when openssh is stopped wpa-supplicant.service may actually have been stopped before it. To test if that is the case then you can add wpa-supplicant.service to the After= line of the openssh.service reload the daemon ( systemctl daemon-reload ) and recreate the scene that is happening.
It still hangs. Was the edit correct? [root@ancho ~]# locate wpa_supplicant.service /lib/systemd/system/wpa_supplicant.service [root@ancho ~]# emacs /lib/systemd/system/wpa_supplicant.service [root@ancho ~]# systemctl daemon-reload [root@ancho ~]# shutdown -r Shutdown scheduled for Mon, 27 Feb 2012 09:55:05 -0800, use 'shutdown -c' to cancel. [root@ancho ~]# Broadcast message from root.com (Mon, 27 Feb 2012 09:54:05 -0800): The system is going down for reboot at Mon, 27 Feb 2012 09:55:05 -0800! <hang> Write failed: Broken pipe diff -c /lib/systemd/system/sshd.service\~ /lib/systemd/system/sshd.service *** /lib/systemd/system/sshd.service~ 2012-02-22 00:04:28.000000000 -0800 --- /lib/systemd/system/sshd.service 2012-02-27 09:52:10.952396541 -0800 *************** *** 1,6 **** [Unit] Description=OpenSSH server daemon ! After=syslog.target network.target auditd.service [Service] EnvironmentFile=/etc/sysconfig/sshd --- 1,6 ---- [Unit] Description=OpenSSH server daemon ! After=syslog.target network.target auditd.service wpa-supplicant.service [Service] EnvironmentFile=/etc/sysconfig/sshd Diff finished. Mon Feb 27 09:52:25 2012
(In reply to comment #27) > It still hangs. Was the edit correct? Yup but you might have had to restart the service for it to take full effect. ( That reboot certanly has taken care of that now. ) > The system is going down for reboot at Mon, 27 Feb 2012 09:55:05 -0800! > <hang> Is this system hang that times out in 90s ? If so is there something in the logs on why it's hanging? > Write failed: Broken pipe This usually indicates that you have to tweak your ServerAliveInterval in sshd config for the ssh client ( if you want to tweak that to the network stack level do "echo <seconds> > /proc/sys/net/ipv4/tcp_keepalive_time"
> Yup but you might have had to restart the service for it to take full effect. > ( That reboot certanly has taken care of that now. ) I rebooted a few times after that just to make certain that a clean restart had the same problems. The same hang happened every time. > Is this system hang that times out in 90s ? If so is there something in the > logs on why it's hanging? The hang appeared to be 63 seconds (timed by hand via typing "date" in another window and subtracting.) I assume the timeout is on the local system and is the local tcp trying to get the extra <cr>'s I typed (after the "shutdown -r<cr>") to be accepted by the remote system. 63 sec delay is also the expected number if one has a 1,2,4,8,16,32 second timeout in succession. I'm happy that this timeout is happening as a fallback. My ssh scripts that update the remote systems would hang indefinitely if this timeout didn't happen.
IMO sshd server processes that get killed should not end immediately after sigterm but rather wait until every byte has been written to the network and the connection is successfully closed. Would this fix the problem?
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
*** Bug 949049 has been marked as a duplicate of this bug. ***
(In reply to Fedora End Of Life from comment #31) > This bug appears to have been reported against 'rawhide' during the Fedora > 19 development cycle. > Changing version to '19'. This bug, or at least something with the same effects, started to show up consistently on the current rawhide (this will be '21'). Previously it was hitting from time to time but now it become a "constant feature". So far I did not observe such thing on Fedora 18 installations. If a new report is required then please leave a note saying so.
Comment 33 seconded, I didn't see this in Fedora 18 or 19. But now I see it every time when remotely shutting down a Fedora 20 box (no wifi involved here). systemd-208-9.fc20.x86_64
*** Bug 1023788 has been marked as a duplicate of this bug. ***
how can https://bugzilla.redhat.com/show_bug.cgi?id=1023788 be a duplicate of this 3 years old bugreport? this one was reported for F14 and 1023788 is for F20, there are *six* dist-upgrades and god knows how many systemd major versions between systemd-210-2.fc21.x86_64 is fixing that behavior, well, hopefully F20 will get systemd-210 soon - unsure how to verify if it also fixes the "rely-on-typo-behavior" from https://bugzilla.redhat.com/show_bug.cgi?id=1023788#c12
I confirm the same problem for Fedora 20. systemd-208-14.fc20.x86_64
In rawhide this is fixed by 210. For 208 in fedora 20 we should include http://cgit.freedesktop.org/systemd/systemd-stable/commit/?id=151079d27da70b27845569fce5856dfada8fd467 http://cgit.freedesktop.org/systemd/systemd-stable/commit/?id=d9eb14931b12ded622f5aa4bd547e6075d10100f http://cgit.freedesktop.org/systemd/systemd-stable/commit/?id=634acdc887a967039233e40b96e4f6ff79f86b72 http://cgit.freedesktop.org/systemd/systemd-stable/commit/?id=1a6dce07f96e63891db22c10acf5b0edc5c4dc81 http://cgit.freedesktop.org/systemd/systemd-stable/commit/?id=a0ef58e3e421909661c615ee6b067a9c2cd9f955 plus one more upcoming patch from Michal which solves a bug introduces by these patches.
I've pushed patches to the stable branch (and couple of followup fixups) which should solve the issue.
> For 208 in fedora 20 we should include sounds promising, thank you > I've pushed patches to the stable branch for now on koji i only see a F21 build which was already fixed, no F20 http://koji.fedoraproject.org/koji/packageinfo?packageID=10477
(In reply to Harald Reindl from comment #40) > for now on koji i only see a F21 build which was already fixed, no F20 > http://koji.fedoraproject.org/koji/packageinfo?packageID=10477 I haven't build anything yet. I've just pushed code to the stable branch, I believe Zbyszek will release it in the next F20 update. Let me know if you'd like to test out the scratch build. I could do a scratch-build of v208-stable for you. Otherwise wait until the next systemd update in F20. Thanks!
scratch build sounds fine, i can install that on my VMware-test-server and look for remaining issues and/or confirm fixed ones
Please give it a try. Thanks! http://koji.fedoraproject.org/koji/taskinfo?taskID=6622144
[root@testserver:~]$ rpm -qa | grep systemd systemd-libs-208-16.fc20.x86_64 systemd-208-16.fc20.x86_64 the issue of this bugreport is solved, boot looks fine, also dracut -f called to make sure the build sits also in the initrd _____________________________________________ that issue still exists and should at least be fixed for the beauty of the eye but not critical and another topic if will take a closer look tomorrow after some hours of running https://bugzilla.redhat.com/show_bug.cgi?id=1010572 Failed to open private bus connection: Failed to connect to socket /run/user/0/dbus/user_bus_socket: No such file or directory
seems to work fine so long and i will update my both physical machines now to the scratchbuild, maybe the messages below are worth to note, no idea where to report them and if a own bugreport in context of a scratch-build is a good idea [root@testserver:~]$ cat messages | grep "Failed to mark scope session" Mar 11 18:45:02 testserver systemd[1]: Failed to mark scope session-25.scope as abandoned : Stale file handle Mar 11 19:15:01 testserver systemd[1]: Failed to mark scope session-39.scope as abandoned : Stale file handle Mar 11 20:10:01 testserver systemd[1]: Failed to mark scope session-63.scope as abandoned : Stale file handle Mar 11 20:15:01 testserver systemd[1]: Failed to mark scope session-64.scope as abandoned : Stale file handle Mar 11 21:45:01 testserver systemd[1]: Failed to mark scope session-102.scope as abandoned : Stale file handle
ok *that* was bad, while update samba from koji packages a segfault in systemd itself and after that re-exec don't work [root@srv-rhsoft:~]$ systemctl daemon-reexec Failed to get D-Bus connection: Failed to connect to socket /run/systemd/private: Verbindungsaufbau abgelehnt [root@srv-rhsoft:~]$ cat messages Mar 12 14:00:03 srv-rhsoft systemd[1]: Failed to mark scope session-712.scope as abandoned : Stale file handle Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md0. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/md0. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:0. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-1d691642:baed26df:1d197496:4fb00ff8. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/boot. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /sys/devices/virtual/block/md2. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/md2. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-name-localhost.localdomain:2. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-id/md-uuid-ea253255:cb915401:f32794ad:ce0fe396. Mar 12 14:09:14 srv-rhsoft systemd[1]: Found device /dev/disk/by-label/data. Mar 12 14:09:14 srv-rhsoft systemd[1]: Stopping Samba SMB Daemon... Mar 12 14:09:14 srv-rhsoft systemd[1]: Starting Samba SMB Daemon... Mar 12 14:09:14 srv-rhsoft systemd[1]: smb.service failed to run 'start' task: File exists Mar 12 14:09:14 srv-rhsoft kernel: systemd[1]: segfault at a8 ip 00007f8ea03d561e sp 00007fffdfa52180 error 4 in systemd[7f8ea0341000+10b000] Mar 12 14:09:14 srv-rhsoft systemd[1]: Caught <SEGV>, dumped core as pid 23370. Mar 12 14:09:14 srv-rhsoft systemd[1]: Freezing execution.
Can you provide coredump or at least backtrace?
sadly no coredumps enabled and i can't reproduce it i noted that in the KDE noticiations "window xyz sent signal" thought "uhm while a yum upgrade is running? why" finally i rebooted the machine with "sync; reboot -f" because systemctl no longer worked the only thing i can offer is try to reproduce that with downgrades/upgrades and if that is possible enable specific debugging or at least provide steps that you or another fedora/systemd-developer can trigger it too
uhm recently my KDE desktop was away, the system switched to TTY1 with the boot-messages while the graphical user session runs on TTY2 - originally i thought WTF is my desktop gone, after CTRL+ALT+F2 all is still here, but the messages smells like the same problem not sure if the "systemd-logind[787]: Failed to store session release timer fd" below shortly before that happened have anything in common [root@srv-rhsoft:~]$ Message from syslogd@rh at Mar 12 17:30:37 ... kernel:systemd[1]: segfault at a8 ip 00007f4ece2ca61e sp 00007fffe460b110 error 4 in systemd[7f4ece236000+10b000] ______________________________________________________ Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 292 of user wwwcron. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 292 of user wwwcron. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 293 of user root. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 293 of user root. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 294 of user wwwcron. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 294 of user wwwcron. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 295 of user root. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 295 of user root. Mär 12 17:30:01 rh.thelounge.net CROND[10363]: (wwwcron) CMD ( umask 006; nice php /www/distribute/call_cronjob.php modules/podcast/cron.php passthru,popen,pclose) Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 296 of user wwwcron. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 296 of user wwwcron. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 298 of user apache. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 298 of user apache. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Starting Session 297 of user root. Mär 12 17:30:01 rh.thelounge.net systemd[1]: Started Session 297 of user root. Mär 12 17:30:01 rh.thelounge.net CROND[10375]: (root) CMD (/usr/lib64/sa/sa1 1 1) Mär 12 17:30:01 rh.thelounge.net CROND[10382]: (wwwcron) CMD ( umask 006; nice php /www/distribute/call_cronjob.php modules/doc-managment/cron.php passthru) Mär 12 17:30:01 rh.thelounge.net CROND[10391]: (wwwcron) CMD ( umask 006; nice php /www/distribute/call_cronjob.php schedule.php) Mär 12 17:30:01 rh.thelounge.net CROND[10392]: (root) CMD ( php /usr/local/scripts/rh_watchdog.php) Mär 12 17:30:01 rh.thelounge.net CROND[10395]: (apache) CMD ( nice -n 19 ionice -c 3 bash /scripts/cleanup-uploadtemp.cron) Mär 12 17:30:01 rh.thelounge.net CROND[10396]: (root) CMD ( nice rsync --quiet --no-motd --times --force --recursive --delete-after --links --hard-links --devices --specials --perms --owner --group --executability --acls --xattrs root@adminserver:/var/lib/clamav/ /var/lib/clamav/ ) Mär 12 17:30:01 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:01 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:01 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:02 rh.thelounge.net systemd-logind[787]: Failed to store session release timer fd Mär 12 17:30:37 rh.thelounge.net kernel: systemd[1]: segfault at a8 ip 00007f4ece2ca61e sp 00007fffe460b110 error 4 in systemd[7f4ece236000+10b000] Mär 12 17:30:37 rh.thelounge.net systemd[1]: Caught <SEGV>, dumped core as pid 10912. Mär 12 17:30:37 rh.thelounge.net systemd[1]: Freezing execution.
the segfault happened again on my home-machine (recently after come home from office), i managed to supsend my VMware guests followed by "sync; reboot --force" and at boot time fsck cleaned orphaned inodes well, after that i decided to reboot my virtual machines after "touch /forcefsck" for safety, due shutdown the testserver had a systemd-crash due stopping servicces and needed a hard reset the scratch-build solves the reboot-problem but is not that stable may i ask one of the developers to install it on a own-dogfoot-machine because debugging systemd feels really outside my scope :-(
i downgraded to systemd-208-15.fc20.x86_64 because these segfault happening randomly and leading to file-system errors at boot, damaged mysqld tables and finally if that happens at the wrong moment in risk of work data it is *unacceptable that systemd upstream has a 208-209-10-*BULLSHIT*-release policy, Lennart happily makes the initial builds and no longer cares about anything happens downstream - you can do that for f**g mediaplayer but not for PID1 upstream needs to realize that POINT-RELEASES instead this cheery-pick bullshit is needed and as long as we Fedora users are downstream testers for upstream RHEL the project leader paied by Redhat is much more responsible than change some bugreport subjects and initial commits *WHERE IS* sytemd 208.1, 208.2, 208.3.... what is that? "i build a new major and than someone bites the result"? http://koji.fedoraproject.org/koji/packageinfo?packageID=10477
(In reply to Harald Reindl from comment #50) > the segfault happened again on my home-machine (recently after come home > from office) .... So far I failed to segfault systemd-208-16.fc20.x86_64 on my test system although I tried. Afer 'daemon-reexec' I found the following AVC in my logs: type=1403 audit(1394755727.603:3): policy loaded auid=0 ses=1 type=1107 audit(1394755735.582:4): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='avc: received policyload notice (seqno=2) /systemd/systemd" sauid=0 hostname=? addr=? terminal=?' type=1305 audit(1394755761.385:5): audit_pid=2397 old=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:auditd_t:s0 res=1 OTOH selinux there is set to "permissive". Harald, are you running by any chance with "enforcing"? 0xffffffff for auid and ses do not look very convincing. BTW - I do not have here any VM guests.
> Harald, are you running by any chance with "enforcing"? no, any of my machines has selinux=0 as kernel-param and so no selinux > BTW - I do not have here any VM guests i speak about my *physical machine* running my daily workloaf and hosting development-virtual machines, the segfault happened 3 times today and there is no chance to get services stopped clean since this cherry picking of the broken systemd-208 in F20 leads to nowhere what about upgrade systemd in F20 at all to latest upstream and only rule out https://bugzilla.redhat.com/show_bug.cgi?id=1072368 because the other bugs are fixed until upstream realizes that the "cool" Google-Chrome-Versioning is not the way to go for PID1?
Harald Reindl, for your information, systemd upstream already maintains stable branches (currently "v208-stable" and "v210-stable"): http://www.freedesktop.org/wiki/Software/systemd/Backports/ Although there are currently no tagged point releases made from them, these git branches are meant for distros to base their updates on. I don't understand your objection to cherry-picking. How do you think any software project creates bugfix releases in the first place? Mostly by cherry-picking selected fixes, of course. Your innuendo that Lennart should himself be made responsible for maintaining bugfix releases shows an inflated sense of entitlement. Also, doing that would be counter-productive. Division of labour is awesome. Would you please stop posting such angry comments in BZs? They are demotivating. A productive contribution would be to provide the coredump or backtrace. Your logs show it did dump the core somewhere. What is your /proc/sys/kernel/core_pattern set to?
> for your information, systemd upstream already maintains stable branches 2014-02-28: [systemd-devel] [HEADSUP] Formalizing Backports all the bugs and troubles in systemd of F20 are there for months > I don't understand your objection to cherry-picking than subscribe to httpd-devel, there you can see how maintaing of important software works for even more than one stable release over many years, frankly longer than systemd exists at all, currently updates for 2.2 and 2.4 are in vote-state > Your innuendo that Lennart should himself be made responsible for > maintaining bugfix releases shows an inflated sense of entitlement i do not care if Lennart or whoever does but that the current release strategy of systemd is broken becomes obviously by see reported bugs before F20 release still there and trying to fix them ending in completly unstable systems > Would you please stop posting such angry comments in BZs? > They are demotivating report bugs months ago and see them go into a stable release too see things like "Failed to open private bus connection: Failed to connect to socket /run/user/0/dbus/user_bus_socket" at every reboot and for every user is also demotivating because it leaves the question if the person who introduced the upgrade ever did read any systemlogs > A productive contribution would be to provide the coredump or > backtrace. Your logs show it did dump the core somewhere. What > is your /proc/sys/kernel/core_pattern set to? cat /proc/sys/kernel/core_pattern core "somewhere" does not help me much, if you see what my logs showing while i don't understand how to debug systemd as a web-developer it's up to you help me to help and where i find the informations to submit all the time we are wasting in the large amount of systemd-related bugs in F20 are still leading to the question why not consider a upgrade instead spent another month in try-to-get-the-same-behavior
check private mail with the coredumps found directly in the rootfs, that would have been a important info because for you it's obviously where to find such things, for a user not sent private because trying to avoid leak of possible sensitive data
Thank you. I received the coredumps and also forwarded them to Michal Sekletar. The stack traces in the four coredumps all point to unit_unwatch_pid(). That makes 0328-core-watch-SIGCHLD-more-closely-to-track-processes-o.patch the primary suspect. Here's one of the stack traces: $ gdb -c cores/core.5677 [...] (gdb) bt full #0 0x00007fd0cbf9162b in raise () from /lib64/libpthread.so.0 No symbol table info available. #1 0x00007fd0cdadca7e in crash (sig=11) at ../src/core/main.c:152 rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615} sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = { 0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0} pid = 0 __func__ = "crash" __PRETTY_FUNCTION__ = "crash" #2 <signal handler called> No symbol table info available. #3 0x00007fd0cdb5061e in unit_unwatch_pid (u=u@entry=0x7fd0ce34fe40, pid=32328) at ../src/core/unit.c:1686 __PRETTY_FUNCTION__ = "unit_unwatch_pid" #4 0x00007fd0cdade5a6 in manager_dispatch_sigchld (m=m@entry=0x7fd0ce2bd100) at ../src/core/manager.c:1392 si = {si_signo = 17, si_errno = 0, si_code = 1, _sifields = {_pad = {32328, 500, 0 <repeats 26 times>}, _kill = {si_pid = 32328, si_uid = 500}, _timer = { si_tid = 32328, si_overrun = 500, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _rt = {si_pid = 32328, si_uid = 500, si_sigval = {sival_int = 0, sival_ptr = 0x0}}, _sigchld = {si_pid = 32328, si_uid = 500, si_status = 0, si_utime = 0, si_stime = 0}, _sigfault = {si_addr = 0x1f400007e48, si_addr_lsb = 0}, _sigpoll = { si_band = 2147483680328, si_fd = 0}, _sigsys = {_call_addr = 0x1f400007e48, _syscall = 0, _arch = 0}}} u = 0x7fd0ce34fe40 r = <optimized out> __PRETTY_FUNCTION__ = "manager_dispatch_sigchld" __func__ = "manager_dispatch_sigchld" #5 0x00007fd0cdae41cd in manager_process_signal_fd (m=<optimized out>) at ../src/core/manager.c:1642 n = <optimized out> sfsi = {ssi_signo = 17, ssi_errno = 0, ssi_code = 1, ssi_pid = 32328, ssi_uid = 500, ssi_fd = 0, ssi_tid = 0, ssi_band = 0, ssi_overrun = 0, ssi_trapno = 0, ssi_status = 0, ssi_int = 0, ssi_ptr = 0, ssi_utime = 1387, ssi_stime = 326, ssi_addr = 0, __pad = '\000' <repeats 47 times>} sigchld = true #6 process_event (ev=0x7fff6c329d50, m=0x7fd0ce2bd100) at ../src/core/manager.c:1667 r = <optimized out> w = <optimized out> #7 manager_loop (m=0x7fd0ce2bd100) at ../src/core/manager.c:1864 event = {events = 1, data = {ptr = 0x7fd0ce2bd1d0, fd = -835989040, u32 = 3458978256, u64 = 140534788903376}} n = <optimized out> wait_msec = <optimized out> r = <optimized out> rl = {interval = 1000000, begin = 28707172784, burst = 50000, num = 17} __PRETTY_FUNCTION__ = "manager_loop" __func__ = "manager_loop" #8 0x00007fd0cdada94e in main (argc=5, argv=0x7fff6c32a6b8) at ../src/core/main.c:1672 m = 0x7fd0ce2bd100 r = <optimized out> retval = 1 before_startup = 2535502 after_startup = <optimized out> timespan = "570.282ms\000\302\313\320\177\000\000P\302+\316\320\177\000\000\020\300+\316\320\177\000\000\000\000\000\000\000\000\000\000g\254&\315\320\177\000\000\000\000\000\000\000\000\000\000P\302+\316\320\177\000" fds = 0x0 reexecute = false shutdown_verb = 0x0 initrd_timestamp = {realtime = 0, monotonic = 0} userspace_timestamp = {realtime = 1394723031488972, monotonic = 2423982} kernel_timestamp = {realtime = 1394723029064992, monotonic = 0} systemd = "systemd" ---Type <return> to continue, or q <return> to quit--- skip_setup = false j = <optimized out> loaded_policy = false arm_reboot_watchdog = false queue_default_job = <optimized out> switch_root_dir = 0x0 switch_root_init = 0x0 saved_rlimit_nofile = {rlim_cur = 1024, rlim_max = 4096} __func__ = "main" __PRETTY_FUNCTION__ = "main"
I think we now have all fixes in stable branch. I apologize for inconvenience caused by systemd-208-16. I can provide you with the test build, but I doubt you'd like to try one more.
oh, as reporter i am fine with test another one, later that evening on my testserver-VM and after some hours with no issues as well on my phyiscal machine, that time the company workstation and after 2 days my homeserver-host my primary intention is get systemd in F20 clean because i plan to upgrade production servers in may from F19 to F20 and appreciate the larger time window having all development systems and production machines on the same level for cleanup rsyslog configs and whatever - that's way easier with having everywhere the same software versions for distribute tested configs
is there any progress? these systemd problems exists for more than 7 months now and are show-stoppers for any admin taking serious his responsibility for production machines in case of consider upgrade to Fedora 20
(In reply to Harald Reindl from comment #60) > and are show-stoppers for any admin taking serious his > responsibility for production machines This is clearly a workaround but you may run something in this style: ( reboot & ) ; exit instead of a bare 'reboot'. > in case of consider upgrade to Fedora 20 As you can see here the issue was first reported for Fedora 14 although it definitely got more annoying with Fedora 20.
what i DO NOT UNDERSTAND is "I can provide you with the test build, but I doubt you'd like to try one more" floowed by a clear "yes i am willing to test" with no further response > As you can see here the issue was first reported for Fedora > 14 although it definitely got more annoying with Fedora 20 don't get me wrong but: * F14 had no systemd at all * i am running Fedora servers starting with Fedora 9 in production * before F20 it was *no* serious problem to type "reboot" in a ssh-session soory, no, that is *clearly* a F20 problem and besides in the meantime 30 "real" production servers i maintain a few workstations as well as routers/firewalls/gateways running on Fedora over *many* years and reboot was never a issue expcet 1 out of 100 reboots on a single machine - what we have now is a *every reboot* hangs and all the worksarounds are not doing anything in cae of larger infrastructures where you ahve *scripted+ operations
The problem only happens to me when the machines are wifi connected. Ethernet connected systems reboot just fine without the ssh hang on f20 (and have for the past 15 or so versions). While annoying that nobody has fixed this bug yet, the simplest workaround is to just use "shutdown -r" or "shutdown" to add a 1 minute delay to the shutdown or reboot. Ssh can then exit normally. To reboot all my servers (25 less than yours ;-)) I just run a loop over "ssh $host shutdown -r". It doesn't hang and the extra minute delay isn't going to kill me.
it happens from the moment you are using network.service instead NM, only in F20 and machines with static network configuration especially servers don't need NM the extra minute is killing me - my typical servers are rebooting within 8 and 15 seconds and so while i wait for some timeout i can check logs and typical admin tasks after a frsh reboot / kernel update instead just wait for nothing that is why servers are still running F19 and it's a shame to be forced to
> it happens from the moment you are using network.service instead NM, only in F20 I am seeing that on machines with NM handling networks and no wifi so both look to me coincidental. OTOH I was hitting that condition for sure in Fedora 18 and earlier, although there only occasionally, while with F20 you can bank on it.
(In reply to Harald Reindl from comment #62) > what i DO NOT UNDERSTAND is "I can provide you with the test build, but I > doubt you'd like to try one more" floowed by a clear "yes i am willing to > test" with no further response I am sorry for not providing you with the test build earlier. Here it is [1]. Hope that helps. BTW, for the future, if you need something from me and I do not respond in BZ feel free to reach me on IRC (#fedora-devel or #systemd on freenode). Thanks! [1] http://msekleta.fedorapeople.org/systemd/
(In reply to Michal Sekletar from comment #66) > I am sorry for not providing you with the test build earlier. Here it is > [1]. Hope that helps. BTW, for the future, if you need something from me and > I do not respond in BZ feel free to reach me on IRC (#fedora-devel or > #systemd on freenode). Thanks! > > [1] http://msekleta.fedorapeople.org/systemd/ The rpms solve the issue in my VM.
build installed on my testserver - looks fine but from the expierience with the scratch-build too soon to qualifiy i will give it some hours in the VM and if it don't eat children at the evening or tomorrow morning try it on my physical machines - looking at the history if it don't segfault or start spam the syslog within two days we have a candidate for a stable update thanks!
until now no segfault - knock on wood however, the messages below are still appearing and happened the first time with the segfaulting scratch-build and never before Apr 7 14:30:01 testserver systemd[1]: Failed to mark scope session-42.scope as abandoned : Stale file handle Apr 7 15:00:01 testserver systemd[1]: Failed to mark scope session-56.scope as abandoned : Stale file handle Apr 7 15:00:02 testserver systemd[1]: Failed to mark scope session-57.scope as abandoned : Stale file handle Apr 7 16:00:01 testserver systemd[1]: Failed to mark scope session-91.scope as abandoned : Stale file handle
Yes, I know about this issue and it is on my TODO list. FWIW...This happens because logind will try to call new AbandonScope() D-Bus API of systemd but scope on which this is called is neither abandoned nor running. I have seen this happening when session and scope get created via pam_systemd, but scope goes away because stop on it is explicitly requested and after that app which created pam session tears it down but scope is already in stop-sigterm substate thus this message is logged. Do you have an another reproducer?
> Do you have an another reproducer? i admit missing enough kowledge behind, in the meantime all my Fedora 20 machines including both of my phyiscal machines are upgraded to that build followed by "dracut -f", i see that on all of them just a half-educated guess: https://bugzilla.redhat.com/show_bug.cgi?id=1010572 maybe that is related to these failing sockets we see for any user regards of cronjob or real user after the first session start
Based on a few quick tests, I'm no longer experiencing this issue with systemd-208-16.fc20.x86_64 even though its changelog doesn't refer to this bug. Is 208-16 expected to fix it?
I'm also not seeing this with the -16 package. Not sure what to make of it but I've tried a bunch of times to reproduce across several machines and can't. I suspect a fix for this was fallout from some of the other restructuring as there doesn't appear to have been a general rollup of the other pending systemd 208 patches.
OK, I'm completely wrong; it looks like everything from http://cgit.freedesktop.org/systemd/systemd-stable/log/?h=v208-stable (newer than 2014-03-05, beause those were in -15) is rolled into this update. Which is odd because the bug where systemd --user and sd-pam processes hang around isn't fixed, and I thought that was supposed bo be in the stable tree as well. Oh, well, at least this one does appear to be fixed. I'd guess this ticket can be closed.
also confirmed for systemd-208-16.fc20.x86_64 but now services with shell-scripts suspdning VMware guests are fucked up and killed, taht was the case a long time after the switch to systemd, worked the last year and now we have to choose between freezing ssh-lients or killed VMware guests - systemd after F19 is simply crap there is no single reason to kill the unit below unexpected ____________________________________________ [root@srv-rhsoft:~]$ cat /etc/systemd/system/vmware.service [Unit] Description=VMware Workstation Service After=network.service systemd-networkd.service kdm.service [Service] Type=oneshot RemainAfterExit=yes ExecStart=/usr/bin/systemctl start vmware.target ExecStartPost=-/usr/bin/su -c "/scripts/vmware/vm-default-start.sh" vmware ExecStop=-/scripts/vmware/vm-suspend-all.sh ExecStop=/usr/bin/systemctl stop vmware.target TimeoutSec=1800 [Install] WantedBy=multi-user.target
WHY in the world systemd of F20 is not thrown away and rebased against the rawhide version after explain some people like Lennart that only the DDOS-attack against central syslog-servers needs to be fixed and he is not in the position to rule the world in case of a central component like systemd because in doubt it is not relevant what he or Kay are thinking - nobody asked them to make systemd that worse as in F20 and so they have to accept that the ordinary sysadmin is not interested in systemd debug messages - the answe of Linus on the kernel-list in that contetx was pretty clear (https://bugzilla.redhat.com/show_bug.cgi?id=1072368) i do not make much friends because stating things as they are - who cares - i do not need friends, in eed proper working systems as before F15 and systemd and frustratd after thinking things goes wy better from F17 to F19 and now see we make a step backwards which is unacceptable
(In reply to Jason Tibbitts from comment #74) > Oh, well, at least this one does appear to be fixed. I'd guess this ticket > can be closed. Yeah, I forgot to add this bug to the update comments.