RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1203000 - RFE: Seamless server restart, modify SO_REUSEPORT
Summary: RFE: Seamless server restart, modify SO_REUSEPORT
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.3
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Jesper Brouer
QA Contact: xmu
URL:
Whiteboard:
: 1030735 (view as bug list)
Depends On: 1151756
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-03-17 21:16 UTC by Jesper Brouer
Modified: 2016-08-26 11:33 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Cause: Consequence: Fix: Result:
Clone Of:
Environment:
Last Closed: 2016-08-26 11:33:37 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jesper Brouer 2015-03-17 21:16:53 UTC
Description of problem:

 Several customer use-cases would like support for seamless restart of server services.  This is not currently supported in upstream kernels.

The socket feature SO_REUSEPORT is often misinterpreted as supporting seamless restart.  This could be because (a) some of the BSD implementations support this, or (b) because with casual testing it might look like it works.

The focus of this bugzilla is to modify SO_REUSEPORT and the TCP/IP stack to support seamless restart.


What is the technical problem:

The problem only exists during the TCP-3WHS, and originates from where we store the request sockets.  When receiving a SYN packet, a request_sock is created, which will be needed to match the 3rd-ACK packet in the 3WHS.  These request_sock's are stored in the given listen socket.

 Thus, as SO_REUSEPORT allows creating several LISTEN sockets, we now have several lists of request_sock's one per listen_sock.  This creates two possible issues.

Issue-1) When removing a listen socket (e.g. app closing).

 Then the in-flight 3rd-ACK packets cannot lookup the corrosponding request_sock.  The 3rd-ACK will actually find another listen_sock, and try to lookup the request_sock.  When failing it will send back a  RST, resulting in the connect() call failing with "Connection reset  by peer" (errno 104/ ECONNRESET).

Issue-2) When adding a listen socket to the pool (e.g. new app starting)

 Why adding an extra listen socket is problematic is harder to  explain.  When several listen sockets exists, the selection among the  listen sockets (in __inet_lookup_listener()) is done via seeding a  pseudo random number generator (next_pseudo_random32()) with a hash  including saddr+sport (for the first matching socket), and then selecting the socket based on a modulo like 
functionality  (reciprocal_scale()) where the "modulo" is the "matches" count.

 Thus, adding a listen socket can "shift" this selection, and result  in the 3rd-ACK packet getting matched against a wrong request_sock  list in a different listen_sock.


Version-Release number of selected component (if applicable):

 SO_REUSEPORT have been backported to RHEL6 >= kernel-2.6.32-417.el6 see bug 991600 and bug 1030735.

 SO_REUSEPORT is already part of RHEL7 as it was introduced in kernel 3.9

Comment 3 Jesper Brouer 2015-03-17 21:29:41 UTC
*** Bug 1030735 has been marked as a duplicate of this bug. ***

Comment 4 Jesper Brouer 2015-03-17 21:52:47 UTC
The reproducer described in bug 1030735 is fairly complicated, and involves 2x haproxy, 2x nodejs and apache-bench (ab).

While troubleshooting (bug 1030735) I developed two testing tools tcp_sink[1] and tcp_sink_client[2], which made it easier for me to reproduce.

These tools are avail in the github repository:
 https://github.com/netoptimizer/network-testing/

tcp_sink
 [1] https://github.com/netoptimizer/network-testing/blob/master/src/tcp_sink.c

tcp_sink_client
 [2] https://github.com/netoptimizer/network-testing/blob/master/src/tcp_sink_client.c

Comment 5 Jesper Brouer 2015-03-17 22:05:29 UTC
Reproducer01: for issue-1 desc in comment #0

(In reply to Jesper Brouer from comment #0)
> Issue-1) When removing a listen socket (e.g. app closing).
> 
>  Then the in-flight 3rd-ACK packets cannot lookup the corresponding
> request_sock.  The 3rd-ACK will actually find another listen_sock, and try
> to lookup the request_sock.  When failing it will send back a  RST,
> resulting in the connect() call failing with "Connection reset  by peer"
> (errno 104/ ECONNRESET).

Requires three shells.

Shell01: Start tcp_sink with high connection count limit ::

  ./tcp_sink --reuse --count 20000000

Shell02: Create a loop that will restart tcp_sink, and limit tcp_sink
to accept 1000 connections.

  i=0; while (( i++ < 1000 )); do  ./tcp_sink --reuse -c 1000; done

Shell03: Start a tcp_sink_client doing many conn attempts

  ./tcp_sink_client -c 20000000 127.0.0.1

Notice the failure from tcp_sink_client:

 [...]
 count:8185
 ERROR: Likely SO_REUSEPORT failed errno(104) - connect: Connection reset by peer

In this run it took 8185 connections before provoking the race with the 3WHS, given the 1000 conn limit before restart, this means that the service managed to restarted 8 times without hitting the 3WHS race-condition.

Comment 6 Jesper Brouer 2015-03-17 22:28:18 UTC
Reproducer02: for issue-2 desc in comment #0

(In reply to Jesper Brouer from comment #0)
> Issue-2) When adding a listen socket to the pool (e.g. new app starting)
> 
>  Why adding an extra listen socket is problematic is harder to  explain. 
> When several listen sockets exists, the selection among the  listen sockets
> (in __inet_lookup_listener()) is done via seeding a  pseudo random number
> generator (next_pseudo_random32()) with a hash  including saddr+sport (for
> the first matching socket), and then selecting the socket based on a modulo
> like functionality  (reciprocal_scale()) where the "modulo" is the "matches"
> count.
> 
>  Thus, adding a listen socket can "shift" this selection, and result  in the
> 3rd-ACK packet getting matched against a wrong request_sock  list in a
> different listen_sock.

Starting more and more LISTEN sockets should also cause the issue.

Requires two shells.

Shell01: Create a loop that will start-and-background tcp_sink, 100 times, but delay the start of each with 1 sec.

  i=0
  while (( i++ < 100 )); do ./tcp_sink --reuse --write & echo $i; sleep 1; done
  killall tcp_sink

Shell02: Start a tcp_sink_client doing many conn attempts

 ./tcp_sink_client -c 20000000 127.0.0.1

This also cause the issue, but it is harder to trigger (when many LISTEN sockets exists).

The failure from tcp_sink_client looks like::

 [...]
 count:209044
 ERROR: Likely SO_REUSEPORT failed errno(104) - connect: Connection reset by peer

For this run, as can be seen the connection count were much higher (209044), before hitting the 3WHS race, this was because I allowed it to start approx 20 TCP listen sockets, which reduced the probability of hitting a wrong listen socket.


Note You need to log in before you can comment on or make changes to this bug.