783344 – socket failed to queue socket startup job: Transport endpoint is not connected

Bug 783344 - socket failed to queue socket startup job: Transport endpoint is not connected

Summary: socket failed to queue socket startup job: Transport endpoint is not connected

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	systemd
Sub Component:
Version:	16
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	systemd-maint
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-01-20 04:04 UTC by Albert Strasheim
Modified:	2012-01-30 21:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:	systemd-37-11.fc16
Clone Of:
Environment:
Last Closed:	2012-01-30 21:00:09 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Albert Strasheim 2012-01-20 04:04:31 UTC

Description of problem:

After a few hundred connections, a socket unit failed with the error:

socket failed to queue socket startup job: Transport endpoint is not connected

There seems to be no way to tell systemd to restart a socket unit automatically.

Version-Release number of selected component (if applicable):

systemd-37-3.fc16.x86_64

How reproducible:

Sometimes

Steps to Reproduce:
1. Make a few hundred connections to a socket unit, probably closing a few while soon after connecting  

Additional info:

Reported upstream but no response:

http://lists.freedesktop.org/archives/systemd-devel/2012-January/004246.html

Comment 1 Michal Schmidt 2012-01-20 09:47:42 UTC

I see that it is a socket unit with "Accept=yes". Does the corresponding *.service unit use the dash prefix in ExecStart?:
ExecStart=-/bin/foo ...
Does "systemctl --all" list a large amount of the service instances?

If the problem is in something else, please attach both the unit files (foo.socket and foo@.service).

Comment 2 Albert Strasheim 2012-01-20 09:54:39 UTC

Yes, I have the dash.

[Unit]
Description=SSH Per-Connection Server

[Service]
ExecStart=-@/usr/sbin/sshd sshd_foo -ddd -i -f /etc/ssh/sshd_foo
StandardInput=socket
StandardOutput=socket
StandardError=syslog
SyslogFacility=local0
SyslogLevel=info
SyslogLevelPrefix=true
SyslogIdentifier=custom

[Install]
Also=foo.socket

Looks somehow related to this problem:

http://lists.freedesktop.org/archives/systemd-devel/2011-February/001359.html

Comment 3 Albert Strasheim 2012-01-20 09:58:38 UTC

I also have /etc/pam.d/sshd_foo which is what that sshd_foo after sshd is about. Right now it's the same as /etc/pam.d/sshd.

Comment 4 Michal Schmidt 2012-01-20 22:34:20 UTC

I managed to reproduce it.

foo.socket:

[Unit]
Description=foo socket
[Socket]
ListenStream=22222
Accept=yes
KeepAlive=yes

foo@.service:

[Unit]
Description=foo service
[Service]
ExecStart=-/bin/cat
StandardInput=socket
StandardOutput=socket

systemd was running in a virtual guest where TCP keep-alive was configured as:
echo 10 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 10 > /proc/sys/net/ipv4/tcp_keepalive_time

Attached to systemd with gdb and set a breakpoint on instance_from_socket().

On the host: nc $IP_OF_GUEST 22222
In the guest systemd hits the breakpoint. TCP keepalive packets can be seen with tcpdump.
On the host:
iptables -I INPUT 1 -i virbr0 -p tcp --sport 22222 -j REJECT --reject-with tcp-reset
Soon a keepalive packet will hit this rule and cause a TCP RST to be sent back.
It is the TCP RST that causes the previously connected socket to become disconnected.
In the guest resume the paused systemd.
getpeername() returns ENOTCONN. The socket enters the 'failed' state.

Comment 5 Michal Schmidt 2012-01-20 23:02:22 UTC

Fixed upstream:
http://cgit.freedesktop.org/systemd/systemd/commit/?id=1a710b430b7e5fa036ee5c03e14e60f725df5baf

Comment 6 Albert Strasheim 2012-01-21 05:46:27 UTC

Any chance of including this patch in an update? Thanks.

Comment 7 Jóhann B. Guðmundsson 2012-01-21 10:18:56 UTC

It already is on it's way to Fedora's update path as in update-testing --> updates

You can grap the rpm's from here if you cant wait.

http://koji.fedoraproject.org/koji/buildinfo?buildID=294582

Comment 8 Fedora Update System 2012-01-22 18:57:09 UTC

systemd-37-10.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/FEDORA-2012-0409/systemd-37-10.fc16

Comment 9 Fedora Update System 2012-01-22 22:55:31 UTC

Package systemd-37-10.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-37-10.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-0409/systemd-37-10.fc16
then log in and leave karma (feedback).

Comment 10 Fedora Update System 2012-01-26 22:59:17 UTC

Package systemd-37-11.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing systemd-37-11.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-0409/systemd-37-11.fc16
then log in and leave karma (feedback).

Comment 11 Fedora Update System 2012-01-30 21:00:09 UTC

systemd-37-11.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.