Bug 1203000

Summary:	RFE: Seamless server restart, modify SO_REUSEPORT
Product:	Red Hat Enterprise Linux 7	Reporter:	Jesper Brouer <jbrouer>
Component:	kernel	Assignee:	Jesper Brouer <jbrouer>
kernel sub component:	tcp	QA Contact:	xmu
Status:	CLOSED WONTFIX	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aloughla, atragler, ccui, fwestpha, haliu, hsowa, jbrouer, jeder, jialiu, kzhang, mleitner, mpatel, network-qe, rkhan
Version:	7.3	Keywords:	FutureFeature
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:	Cause: Consequence: Fix: Result:	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-08-26 11:33:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1151756
Bug Blocks:

Description Jesper Brouer 2015-03-17 21:16:53 UTC

Description of problem:

Several customer use-cases would like support for seamless restart of server services. This is not currently supported in upstream kernels.

The socket feature SO_REUSEPORT is often misinterpreted as supporting seamless restart. This could be because (a) some of the BSD implementations support this, or (b) because with casual testing it might look like it works.

The focus of this bugzilla is to modify SO_REUSEPORT and the TCP/IP stack to support seamless restart.

What is the technical problem:

The problem only exists during the TCP-3WHS, and originates from where we store the request sockets. When receiving a SYN packet, a request_sock is created, which will be needed to match the 3rd-ACK packet in the 3WHS. These request_sock's are stored in the given listen socket.

Thus, as SO_REUSEPORT allows creating several LISTEN sockets, we now have several lists of request_sock's one per listen_sock. This creates two possible issues.

Issue-1) When removing a listen socket (e.g. app closing).

Then the in-flight 3rd-ACK packets cannot lookup the corrosponding request_sock. The 3rd-ACK will actually find another listen_sock, and try to lookup the request_sock. When failing it will send back a RST, resulting in the connect() call failing with "Connection reset by peer" (errno 104/ ECONNRESET).

Issue-2) When adding a listen socket to the pool (e.g. new app starting)

Why adding an extra listen socket is problematic is harder to explain. When several listen sockets exists, the selection among the listen sockets (in __inet_lookup_listener()) is done via seeding a pseudo random number generator (next_pseudo_random32()) with a hash including saddr+sport (for the first matching socket), and then selecting the socket based on a modulo like
functionality (reciprocal_scale()) where the "modulo" is the "matches" count.

Thus, adding a listen socket can "shift" this selection, and result in the 3rd-ACK packet getting matched against a wrong request_sock list in a different listen_sock.

Version-Release number of selected component (if applicable):

SO_REUSEPORT have been backported to RHEL6 >= kernel-2.6.32-417.el6 see bug 991600 and bug 1030735.

SO_REUSEPORT is already part of RHEL7 as it was introduced in kernel 3.9

Comment 3 Jesper Brouer 2015-03-17 21:29:41 UTC

*** Bug 1030735 has been marked as a duplicate of this bug. ***

Comment 4 Jesper Brouer 2015-03-17 21:52:47 UTC

The reproducer described in bug 1030735 is fairly complicated, and involves 2x haproxy, 2x nodejs and apache-bench (ab).

While troubleshooting (bug 1030735) I developed two testing tools tcp_sink[1] and tcp_sink_client[2], which made it easier for me to reproduce.

These tools are avail in the github repository:
 https://github.com/netoptimizer/network-testing/

tcp_sink
 [1] https://github.com/netoptimizer/network-testing/blob/master/src/tcp_sink.c

tcp_sink_client
 [2] https://github.com/netoptimizer/network-testing/blob/master/src/tcp_sink_client.c

Comment 5 Jesper Brouer 2015-03-17 22:05:29 UTC

Reproducer01: for issue-1 desc in comment #0

(In reply to Jesper Brouer from comment #0)
> Issue-1) When removing a listen socket (e.g. app closing).
> 
>  Then the in-flight 3rd-ACK packets cannot lookup the corresponding
> request_sock.  The 3rd-ACK will actually find another listen_sock, and try
> to lookup the request_sock.  When failing it will send back a  RST,
> resulting in the connect() call failing with "Connection reset  by peer"
> (errno 104/ ECONNRESET).

Requires three shells.

Shell01: Start tcp_sink with high connection count limit ::

  ./tcp_sink --reuse --count 20000000

Shell02: Create a loop that will restart tcp_sink, and limit tcp_sink
to accept 1000 connections.

  i=0; while (( i++ < 1000 )); do  ./tcp_sink --reuse -c 1000; done

Shell03: Start a tcp_sink_client doing many conn attempts

  ./tcp_sink_client -c 20000000 127.0.0.1

Notice the failure from tcp_sink_client:

 [...]
 count:8185
 ERROR: Likely SO_REUSEPORT failed errno(104) - connect: Connection reset by peer

In this run it took 8185 connections before provoking the race with the 3WHS, given the 1000 conn limit before restart, this means that the service managed to restarted 8 times without hitting the 3WHS race-condition.

Comment 6 Jesper Brouer 2015-03-17 22:28:18 UTC

Reproducer02: for issue-2 desc in comment #0

(In reply to Jesper Brouer from comment #0)
> Issue-2) When adding a listen socket to the pool (e.g. new app starting)
> 
>  Why adding an extra listen socket is problematic is harder to  explain. 
> When several listen sockets exists, the selection among the  listen sockets
> (in __inet_lookup_listener()) is done via seeding a  pseudo random number
> generator (next_pseudo_random32()) with a hash  including saddr+sport (for
> the first matching socket), and then selecting the socket based on a modulo
> like functionality  (reciprocal_scale()) where the "modulo" is the "matches"
> count.
> 
>  Thus, adding a listen socket can "shift" this selection, and result  in the
> 3rd-ACK packet getting matched against a wrong request_sock  list in a
> different listen_sock.

Starting more and more LISTEN sockets should also cause the issue.

Requires two shells.

Shell01: Create a loop that will start-and-background tcp_sink, 100 times, but delay the start of each with 1 sec.

  i=0
  while (( i++ < 100 )); do ./tcp_sink --reuse --write & echo $i; sleep 1; done
  killall tcp_sink

Shell02: Start a tcp_sink_client doing many conn attempts

 ./tcp_sink_client -c 20000000 127.0.0.1

This also cause the issue, but it is harder to trigger (when many LISTEN sockets exists).

The failure from tcp_sink_client looks like::

 [...]
 count:209044
 ERROR: Likely SO_REUSEPORT failed errno(104) - connect: Connection reset by peer

For this run, as can be seen the connection count were much higher (209044), before hitting the 3WHS race, this was because I allowed it to start approx 20 TCP listen sockets, which reduced the probability of hitting a wrong listen socket.