Bug 605259

Summary: tcp: sending reset to the already closed socket
Product: Red Hat Enterprise Linux 5 Reporter: Konstantin Khorenko <khorenko>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Dayong Tian <dtian>
Severity: low Docs Contact:
Priority: low    
Version: 5.5CC: dtian, hjia, jolsa, nhorman, tgraf
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-01-13 21:38:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
suggested fix
none
captures(4 pcap files) none

Description Konstantin Khorenko 2010-06-17 14:18:29 UTC
Created attachment 424837 [details]
suggested fix

Description of problem:
Some time ago i found some strange packets reported in logs.
Investigation brought me to the fact that they are caused by a kernel bug:

  tcp_close() can be called for an already closed socket, but still sends reset in this case (tcp_send_active_reset()) which seems to be incorrect.
  Moreover, a packet with reset is sent with different source port as original port number has been already cleared on socket.
  Besides that incrementing stat counter for LINUX_MIB_TCPABORTONCLOSE also does not look correct in this case.

Version-Release number of selected component (if applicable):
Originally found on 2.6.18-164.15.1.el5, but seems all kernels are affected, including current mainstream.

How reproducible:
from time to time

Steps to Reproduce:
Originally it was reproduced by quickly clicking on links in browser, not waiting until the page is loaded.
The extra reset packets were sent from server (where apache hosted the site seen in browser) with "random" source port, but with correct source IP.

Additional info:
How that happens:
1) the server receives a packet for socket in TCP_CLOSE_WAIT state that triggers a tcp_reset():

Call Trace:
 <IRQ>  [<ffffffff8025b9b9>] tcp_reset+0x12f/0x1e8
 [<ffffffff80046125>] tcp_rcv_state_process+0x1c0/0xa08
 [<ffffffff8003eb22>] tcp_v4_do_rcv+0x310/0x37a
 [<ffffffff80028bea>] tcp_v4_rcv+0x74d/0xb43
 [<ffffffff8024ef4c>] ip_local_deliver_finish+0x0/0x259
 [<ffffffff80037131>] ip_local_deliver+0x200/0x2f4
 [<ffffffff8003843c>] ip_rcv+0x64c/0x69f
 [<ffffffff80021d89>] netif_receive_skb+0x4c4/0x4fa
 [<ffffffff80032eca>] process_backlog+0x90/0xec
 [<ffffffff8000cc50>] net_rx_action+0xbb/0x1f1
 [<ffffffff80012d3a>] __do_softirq+0xf5/0x1ce
 [<ffffffff8001147a>] handle_IRQ_event+0x56/0xb0
 [<ffffffff8006334c>] call_softirq+0x1c/0x28
 [<ffffffff80070476>] do_softirq+0x2c/0x85
 [<ffffffff80070441>] do_IRQ+0x149/0x152
 [<ffffffff80062665>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80008a2e>] __handle_mm_fault+0x6cd/0x1303
 [<ffffffff80008903>] __handle_mm_fault+0x5a2/0x1303
 [<ffffffff80033a9d>] cache_free_debugcheck+0x21f/0x22e
 [<ffffffff8006a263>] do_page_fault+0x49a/0x7dc
 [<ffffffff80066487>] thread_return+0x89/0x174
 [<ffffffff800c5aee>] audit_syscall_exit+0x341/0x35c
 [<ffffffff80062e39>] error_exit+0x0/0x84

tcp_rcv_state_process()
...  // (sk_state == TCP_CLOSE_WAIT here)
...
        /* step 2: check RST bit */
        if(th->rst) {
                tcp_reset(sk);
                goto discard;
        }
...
---------------------------------
tcp_rcv_state_process
 tcp_reset
  tcp_done
   tcp_set_state(sk, TCP_CLOSE);
     inet_put_port
      __inet_put_port
       inet_sk(sk)->num = 0;

   sk->sk_shutdown = SHUTDOWN_MASK;


2) After that the process (socket owner) tries to write something to that socket and "inet_autobind" sets a _new_ (which differs from the original!) port number for the socket:

 Call Trace:
  [<ffffffff80255a12>] inet_bind_hash+0x33/0x5f
  [<ffffffff80257180>] inet_csk_get_port+0x216/0x268
  [<ffffffff8026bcc9>] inet_autobind+0x22/0x8f
  [<ffffffff80049140>] inet_sendmsg+0x27/0x57
  [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
  [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
  [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
  [<ffffffff8001fb49>] __pollwait+0x0/0xdd
  [<ffffffff8008d533>] default_wake_function+0x0/0xe
  [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
  [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
  [<ffffffff80066538>] thread_return+0x13a/0x174
  [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
  [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
  [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
  [<ffffffff800622dd>] tracesys+0xd5/0xe0


3) sendmsg fails at last with -EPIPE (=> 'write' returns -EPIPE in userspace):

F: tcp_sendmsg1 -EPIPE: sk=ffff81000bda00d0, sport=49847, old_state=7, new_state=7, sk_err=0, sk_shutdown=3

Call Trace:
 [<ffffffff80027557>] tcp_sendmsg+0xcb/0xe87
 [<ffffffff80033300>] release_sock+0x10/0xae
 [<ffffffff8016f20f>] vgacon_cursor+0x0/0x1a7
 [<ffffffff8026bd32>] inet_autobind+0x8b/0x8f
 [<ffffffff8003a9d9>] do_sock_write+0xae/0xea
 [<ffffffff80226ac7>] sock_writev+0xdc/0xf6
 [<ffffffff800680c7>] _spin_lock_irqsave+0x9/0xe
 [<ffffffff8001fb49>] __pollwait+0x0/0xdd
 [<ffffffff8008d533>] default_wake_function+0x0/0xe
 [<ffffffff800a4f10>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800f0b49>] do_readv_writev+0x163/0x274
 [<ffffffff80066538>] thread_return+0x13a/0x174
 [<ffffffff800145d8>] tcp_poll+0x0/0x1c9
 [<ffffffff800c56d3>] audit_syscall_entry+0x180/0x1b3
 [<ffffffff800f0dd0>] sys_writev+0x49/0xe4
 [<ffffffff800622dd>] tracesys+0xd5/0xe0

tcp_sendmsg()
...
        /* Wait for a connection to finish. */
        if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
                int old_state = sk->sk_state;
                if ((err = sk_stream_wait_connect(sk, &timeo)) != 0) {
if (f_d && (err == -EPIPE)) {
        printk("F: tcp_sendmsg1 -EPIPE: sk=%p, sport=%u, old_state=%d, new_state=%d, "
                "sk_err=%d, sk_shutdown=%d\n",
                sk, ntohs(inet_sk(sk)->sport), old_state, sk->sk_state,
                sk->sk_err, sk->sk_shutdown);
        dump_stack();
}
                        goto out_err;
                }
        }
...

4) Then the process (socket owner) understands that it's time to close that socket and does that (and thus triggers sending reset packet):

Call Trace:
...
 [<ffffffff80032077>] dev_queue_xmit+0x343/0x3d6
 [<ffffffff80034698>] ip_output+0x351/0x384
 [<ffffffff80251ae9>] dst_output+0x0/0xe
 [<ffffffff80036ec6>] ip_queue_xmit+0x567/0x5d2
 [<ffffffff80095700>] vprintk+0x21/0x33
 [<ffffffff800070f0>] check_poison_obj+0x2e/0x206
 [<ffffffff80013587>] poison_obj+0x36/0x45
 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
 [<ffffffff80023481>] dbg_redzone1+0x1c/0x25
 [<ffffffff8025dea6>] tcp_send_active_reset+0x15/0x14d
 [<ffffffff8000ca94>] cache_alloc_debugcheck_after+0x189/0x1c8
 [<ffffffff80023405>] tcp_transmit_skb+0x764/0x786
 [<ffffffff8025df8a>] tcp_send_active_reset+0xf9/0x14d
 [<ffffffff80258ff1>] tcp_close+0x39a/0x960
 [<ffffffff8026be12>] inet_release+0x69/0x80
 [<ffffffff80059b31>] sock_release+0x4f/0xcf
 [<ffffffff80059d4c>] sock_close+0x2c/0x30
 [<ffffffff800133c9>] __fput+0xac/0x197
 [<ffffffff800252bc>] filp_close+0x59/0x61
 [<ffffffff8001eff6>] sys_close+0x85/0xc7
 [<ffffffff800622dd>] tracesys+0xd5/0xe0


So, in brief:
* a received packet for socket in TCP_CLOSE_WAIT state triggers tcp_reset() which clears inet_sk(sk)->num and put socket into TCP_CLOSE state
* an attempt to write to that socket forces inet_autobind() to get a new port (but the write itself fails with -EPIPE)
* tcp_close() called for socket in TCP_CLOSE state sends an active reset via socket with newly allocated port


Please, take a look at this and if nothing is missed, here is a patch with the fix.

Thank you.

--
Best regards,

Konstantin Khorenko,
PVCfL/OpenVZ developer,
Parallels

Comment 1 Konstantin Khorenko 2010-06-28 10:18:50 UTC
Sent the same to [linux-netdev]:
http://kerneltrap.org/mailarchive/linux-netdev/2010/6/17/6279612

Comment 2 Neil Horman 2010-07-02 10:49:08 UTC
looks like this patch got accepted.  I'll backport this shortly

Comment 3 RHEL Program Management 2010-07-02 10:53:27 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Jarod Wilson 2010-08-11 00:12:54 UTC
in kernel-2.6.18-211.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 8 Dayong Tian 2010-12-08 09:57:18 UTC
I tried with following steps:

1. ran a program as a socket server on machineS(kernel 2.6.18-194.el5), the program never closed the accepted sockets from clients.
2. ran another program as a socket client on machineC, connect to machineS and only closed the socket on machineC, then the connection on machineS was in CLOSE_WAIT state:
tcp        1      0 10.16.65.240:33456          10.16.65.45:33457           CLOSE_WAIT  21346/sserver
3. ran the same program with the same source port on machineC to connect to machineS, but failed on machineC because "bind failed: Address already in use".

I have set tcp_tw_reuse and tcp_tw_recycle to 1 on machineC, it took about 60 seconds before I could re-bind the same source port on machineC. But at that time, seems the new connection from machineC reset the previous CLOSE_WAIT socket on machineS, and no issue appeared.

Do you have some good ideas to trigger the issue? Thanks!

Comment 9 Konstantin Khorenko 2010-12-08 11:35:43 UTC
i believe you can trigger it if you establish a connection, put it into CLOSE_WAIT state and then send a manually created packet from machineC using for example "hping".

However i never tried this, as i already wrote i faced this issue when i click links in a normal browser without waiting till pages are loaded completely.

Comment 10 Dayong Tian 2010-12-09 06:24:56 UTC
I think that's what I did, strace log of the server socket on machineS may be readable:

------------------------------------
Process 23759 attached - interrupt to quit
01:03:25.926427 accept(3, {sa_family=AF_INET, sin_port=htons(42583), sin_addr=inet_addr("10.16.65.45")}, [14605617132835176464]) = 4
01:03:34.830869 recvfrom(4, "hello server!\0", 20, 0, NULL, NULL) = 14
01:03:34.830971 fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(4, 64), ...}) = 0
01:03:34.831043 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B115200 opost isig -icanon -echo ...}) = 0
01:03:34.831106 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b511ef1f000
01:03:34.831169 write(1, "received from 10.16.65.45 at POR"..., 40) = 40 
01:03:34.831217 write(1, "get hello server!\n", 18) = 18
01:03:34.831259 accept(3, {sa_family=AF_INET, sin_port=htons(42583), sin_addr=inet_addr("10.16.65.45")}, [3543822943798689808]) = 5 
01:05:09.271998 recvfrom(5, "reset test\n", 20, 0, NULL, NULL) = 11
01:05:09.274530 write(1, "received from 10.16.65.45 at POR"..., 40) = 40
01:05:09.274579 write(1, "get reset test\n", 15) = 15
01:05:09.274620 write(1, "r!\n", 3)     = 3
01:05:09.274666 accept(3,  <unfinished ...>
Process 23759 detached
------------------------------------

After the time 01:03:53, the connection was always in CLOSE_WAIT status:

[root@intel-s3e8132-01 SPECS]# date;netstat -anlp|grep CLOSE
Thu Dec  9 01:03:53 EST 2010                                
tcp        1      0 10.16.65.240:33456          10.16.65.45:42583           CLOSE_WAIT  23759/sserver

Did I miss something?

Comment 11 Konstantin Khorenko 2010-12-09 08:21:26 UTC
1) i wonder how you are going to understand that you've triggered the issue? Let's imagine you have a reproducer, but what message do you expect or something else in order to prove that the issue has been really triggered?
i believe you need to use tcpdump at least to catch that "incorrect" packet.

2) i believe to trigger this situation you need to send a packet from machineC to machineS when the socket is already in TCP_CLOSE_WAIT state. Did you do that? Did you check that packet really came (with tcpdump)?

Comment 12 Dayong Tian 2010-12-09 09:05:57 UTC
I supposed the call trace or some panic would appear when I triggered the issue. With tcpdump server sent R(reset) first when sent a packet to the connection which was already in CLOSE_WAIT state, that's the same with/without patch. 

The packet was recevied successfully by machineS, I printed the data to console, you can also see that from strace messages.

-------------------------
03:43:49.446848 IP 10.16.65.45.54623 > 10.16.65.240.33456: S 1979644304:1979644304(0) win 5840 <mss 1460,sackOK,timestamp 93645990 0,nop,wscale 7>
03:43:49.446933 IP 10.16.65.240.33456 > 10.16.65.45.54623: . ack 1879703935 win 46 <nop,nop,timestamp 4294964843 93565014>
03:43:49.447011 IP 10.16.65.45.54623 > 10.16.65.240.33456: R 1879703935:1879703935(0) win 0
03:43:52.447282 IP 10.16.65.45.54623 > 10.16.65.240.33456: S 1979644304:1979644304(0) win 5840 <mss 1460,sackOK,timestamp 93648990 0,nop,wscale 7>
03:43:52.447299 IP 10.16.65.240.33456 > 10.16.65.45.54623: S 1034965990:1034965990(0) ack 1979644305 win 5792 <mss 1460,sackOK,timestamp 548 93648990,nop,wscale 7>
03:43:52.447432 IP 10.16.65.45.54623 > 10.16.65.240.33456: . ack 1 win 46 <nop,nop,timestamp 93648990 548>
03:43:52.449927 IP 10.16.65.45.54623 > 10.16.65.240.33456: P 1:12(11) ack 1 win 46 <nop,nop,timestamp 93648993 548>
03:43:52.449930 IP 10.16.65.45.54623 > 10.16.65.240.33456: F 12:12(0) ack 1 win 46 <nop,nop,timestamp 93648993 548>
-------------------------

Comment 13 Konstantin Khorenko 2010-12-09 10:03:02 UTC
1) i do not see here a moment when the socket gone into TCP_CLOSE_WAIT state
2) i do not see here a packet (incoming!) received for that socket after the moment 1)
3) i do not see here the reply to the packet received in 2)

Comment 14 Dayong Tian 2010-12-15 07:52:01 UTC
Created attachment 468786 [details]
captures(4 pcap files)

Hi Konstantin, I captured packets on receiver with both kernel 2.6.18-194.el5 and 2.6.18-236.el5. After I put the connection in CLOSE_WAIT state, I sent RST/FIN bit packets from sender. Could you take a look?

Comment 15 Konstantin Khorenko 2010-12-15 09:52:09 UTC
Hi Dayong, here is a log left from my experiments:

12:29:38.209159 IP 10.30.8.65.52583 > 10.0.14.191.4643: S 3519276605:3519276605(0) win 5840 <mss 1460,sackOK,timestamp 257707695 0,nop,wscale 7>
12:29:38.209182 IP 10.0.14.191.4643 > 10.30.8.65.52583: S 2608604519:2608604519(0) ack 3519276606 win 5792 <mss 1460,sackOK,timestamp 2237497 257707695,nop,wscale 5>
12:29:38.209996 IP 10.30.8.65.52583 > 10.0.14.191.4643: . ack 1 win 46 <nop,nop,timestamp 257707696 2237497>
12:29:38.210468 IP 10.30.8.65.52583 > 10.0.14.191.4643: P 1:177(176) ack 1 win 46 <nop,nop,timestamp 257707696 2237497>
12:29:38.210483 IP 10.0.14.191.4643 > 10.30.8.65.52583: . ack 177 win 215 <nop,nop,timestamp 2237498 257707696>
12:29:38.210765 IP 10.0.14.191.4643 > 10.30.8.65.52583: P 1:139(138) ack 177 win 215 <nop,nop,timestamp 2237498 257707696>
12:29:38.211997 IP 10.30.8.65.52583 > 10.0.14.191.4643: . ack 139 win 54 <nop,nop,timestamp 257707698 2237498>
12:29:38.217393 IP 10.30.8.65.52583 > 10.0.14.191.4643: P 177:945(768) ack 139 win 54 <nop,nop,timestamp 257707703 2237498>
12:29:38.257420 IP 10.0.14.191.4643 > 10.30.8.65.52583: . ack 945 win 263 <nop,nop,timestamp 2237545 257707703>
12:29:38.491389 IP 10.30.8.65.52583 > 10.0.14.191.4643: P 945:982(37) ack 139 win 54 <nop,nop,timestamp 257707977 2237545>
12:29:38.491438 IP 10.0.14.191.4643 > 10.30.8.65.52583: . ack 982 win 263 <nop,nop,timestamp 2237779 257707977>
12:29:38.491403 IP 10.30.8.65.52583 > 10.0.14.191.4643: F 982:982(0) ack 139 win 54 <nop,nop,timestamp 257707977 2237545>
12:29:38.531428 IP 10.0.14.191.4643 > 10.30.8.65.52583: . ack 983 win 263 <nop,nop,timestamp 2237819 257707977>
12:29:38.538476 IP 10.0.14.191.4643 > 10.30.8.65.52583: . 139:1587(1448) ack 983 win 263 <nop,nop,timestamp 2237826 257707977>
12:29:38.538529 IP 10.0.14.191.4643 > 10.30.8.65.52583: . 1587:3035(1448) ack 983 win 263 <nop,nop,timestamp 2237826 257707977>
12:29:38.538552 IP 10.0.14.191.4643 > 10.30.8.65.52583: P 3035:4483(1448) ack 983 win 263 <nop,nop,timestamp 2237826 257707977>
12:29:38.539912 IP 10.30.8.65.52583 > 10.0.14.191.4643: R 3519277588:3519277588(0) win 0
12:29:38.539941 IP 10.30.8.65.52583 > 10.0.14.191.4643: R 3519277588:3519277588(0) win 0
12:29:38.539957 IP 10.30.8.65.52583 > 10.0.14.191.4643: R 3519277588:3519277588(0) win 0
12:29:38.540575 IP 10.0.14.191.52851 > 10.30.8.65.52583: R 2608609002:2608609002(0) ack 3519277588 win 263 <nop,nop,timestamp 2237828 257707977>

Comment 16 Dayong Tian 2010-12-15 10:39:18 UTC
Thank you, Konstantin.
Your reproducer on the receiver sent three same "ack 983" to get RST from the sender, but I just used "hping" to manually sent RST bit packets from the sender, seems my way was not easy to get it.
Would you mind sharing your reproducer?

Comment 17 Konstantin Khorenko 2010-12-15 11:07:09 UTC
Dayong, unfortunately i cannot share the reproducer because it is not "clear": i did experiments on Parallels Virtuozzo Containers node (with PVC kernel, which is based on RHEL kernel, but still is different) and i was able to reproduce the issue by quickly clicking links in browser while connected to the Parallels Power Panel.

That's why i believe that "reproducer" cannot be used to verification the bug on RedHat kernel.

Comment 18 Dayong Tian 2010-12-20 01:00:34 UTC
Tried many ways but couldn't trigger the issue.
Konstantin had tested the patch and I did sanity checks with kernel 2.6.18-236.el5, the patch was included and applied in kernel 2.6.18-236.el5, also passed related function tests:

[root@intel-s3e8132-01 SPECS]# grep "605259" kernel-2.6.spec
- [net] ip4v/tcp: no additional reset on closed sockets (Neil Horman)
[605259]

[root@intel-s3e8132-01 SPECS]# grep -i "Patch25404" kernel-2.6.spec
Patch25404:
linux-2.6-net-ip4v-tcp-no-additional-reset-on-closed-sockets.patch
%patch25404 -p1

Comment 20 errata-xmlrpc 2011-01-13 21:38:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html