Description of problem: After a few hundred connections, a socket unit failed with the error: socket failed to queue socket startup job: Transport endpoint is not connected There seems to be no way to tell systemd to restart a socket unit automatically. Version-Release number of selected component (if applicable): systemd-37-3.fc16.x86_64 How reproducible: Sometimes Steps to Reproduce: 1. Make a few hundred connections to a socket unit, probably closing a few while soon after connecting Additional info: Reported upstream but no response: http://lists.freedesktop.org/archives/systemd-devel/2012-January/004246.html
I see that it is a socket unit with "Accept=yes". Does the corresponding *.service unit use the dash prefix in ExecStart?: ExecStart=-/bin/foo ... Does "systemctl --all" list a large amount of the service instances? If the problem is in something else, please attach both the unit files (foo.socket and foo@.service).
Yes, I have the dash. [Unit] Description=SSH Per-Connection Server [Service] ExecStart=-@/usr/sbin/sshd sshd_foo -ddd -i -f /etc/ssh/sshd_foo StandardInput=socket StandardOutput=socket StandardError=syslog SyslogFacility=local0 SyslogLevel=info SyslogLevelPrefix=true SyslogIdentifier=custom [Install] Also=foo.socket Looks somehow related to this problem: http://lists.freedesktop.org/archives/systemd-devel/2011-February/001359.html
I also have /etc/pam.d/sshd_foo which is what that sshd_foo after sshd is about. Right now it's the same as /etc/pam.d/sshd.
I managed to reproduce it. foo.socket: [Unit] Description=foo socket [Socket] ListenStream=22222 Accept=yes KeepAlive=yes foo@.service: [Unit] Description=foo service [Service] ExecStart=-/bin/cat StandardInput=socket StandardOutput=socket systemd was running in a virtual guest where TCP keep-alive was configured as: echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl echo 10 > /proc/sys/net/ipv4/tcp_keepalive_time Attached to systemd with gdb and set a breakpoint on instance_from_socket(). On the host: nc $IP_OF_GUEST 22222 In the guest systemd hits the breakpoint. TCP keepalive packets can be seen with tcpdump. On the host: iptables -I INPUT 1 -i virbr0 -p tcp --sport 22222 -j REJECT --reject-with tcp-reset Soon a keepalive packet will hit this rule and cause a TCP RST to be sent back. It is the TCP RST that causes the previously connected socket to become disconnected. In the guest resume the paused systemd. getpeername() returns ENOTCONN. The socket enters the 'failed' state.
Fixed upstream: http://cgit.freedesktop.org/systemd/systemd/commit/?id=1a710b430b7e5fa036ee5c03e14e60f725df5baf
Any chance of including this patch in an update? Thanks.
It already is on it's way to Fedora's update path as in update-testing --> updates You can grap the rpm's from here if you cant wait. http://koji.fedoraproject.org/koji/buildinfo?buildID=294582
systemd-37-10.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/FEDORA-2012-0409/systemd-37-10.fc16
Package systemd-37-10.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing systemd-37-10.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-0409/systemd-37-10.fc16 then log in and leave karma (feedback).
Package systemd-37-11.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing systemd-37-11.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-0409/systemd-37-11.fc16 then log in and leave karma (feedback).
systemd-37-11.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.