Created attachment 480919 [details] Patch Description of problem: NFS client can read corrupt data after disconnecting from NFS server and re-connecting to it. Version-Release number of selected component (if applicable): 2.6.18-194.32.1.el5. Steps to Reproduce: 1. Stop NFS server while NFS client reads a large file. 2. After 7 minutes, start NFS server again. 3. NFS client will re-connect to the server and continue to read. 4. After 5 minutes, NFS client can read corrupt data from time to time. Additional info: After disconnecting from NFS server, NFS client tries to connect to the server. (xprt_connect() -> xs_connect() -> xs_tcp_connect_worker()) If the server is not ready, the client delays next connect work. Initial delay time is 3 secs(XS_TCP_INIT_REEST_TO@net/sunrpc/xprtsock.c) and next is 6s and next is 12s... 24s... 48s... Maximum delay time is 5 mins(XS_TCP_MAX_REEST_TO). Meanwhile the rpc task calling xprt_connect() will time out in 1 minute(XS_TCP_CONN_TO). After delay time reaches 5 minutes, delayed connect work and task's time out will occur at the same time. ----- Time : +00:00:00 : NFS client try to connect the server which was stopped. (rpciod) -> xprt_connect() -> xs_connect() -> queue_delayed_work( work.fn=xs_tcp_connect_worker, delay=3 secs ) +00:00:03 : (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() -> kernel_connect() return -EINPROGRESS (from tcp) -> xs_tcp_state_change() sk_state = TCP_CLOSE -> xprt_wake_pending_tasks() (rpciod) retry to connect -> xprt_connect() -> xs_connect() -> queue_delayed_work( xs_tcp_connect_worker, 6 secs ) +00:00:09 : (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() -> kernel_connect() return -EINPROGRESS (from tcp) -> xs_tcp_state_change() sk_state = TCP_CLOSE -> xprt_wake_pending_tasks() (rpciod) retry to connect -> xprt_connect() -> xs_connect() -> queue_delayed_work( xs_tcp_connect_worker, 12 secs ) <<snip>> +00:01:33 : (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() -> kernel_connect() return -EINPROGRESS (from tcp) -> xs_tcp_state_change() sk_state = TCP_CLOSE -> xprt_wake_pending_tasks() (rpciod) retry to connect -> xprt_connect() -> xs_connect() -> queue_delayed_work( xs_tcp_connect_worker, 96 secs ) +00:02:33 : (from timer) -> __rpc_default_timer() == rpc task time out -> rpc_wake_up_task() (rpciod) retry to connect -> xprt_connect() -> xs_connect() xprt is "connecting" state, do nothing. +00:03:09 : (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() -> kernel_connect() return -EINPROGRESS (from tcp) -> xs_tcp_state_change() sk_state = TCP_CLOSE -> xprt_wake_pending_tasks() (rpciod) retry to connect -> xprt_connect() -> xs_connect() -> queue_delayed_work( xs_tcp_connect_worker, 192 secs ) <<snip>> +00:06:21 : (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() -> kernel_connect() return -EINPROGRESS (from tcp) -> xs_tcp_state_change() sk_state = TCP_CLOSE -> xprt_wake_pending_tasks() (rpciod) retry to connect -> xprt_connect() -> xs_connect() -> queue_delayed_work( xs_tcp_connect_worker, 300 secs ) *** Here, NFS server is ready again **** +00:07:21 : (from timer) -> __rpc_default_timer() == rpc task time out -> rpc_wake_up_task() (rpciod) retry to connect -> xprt_connect() -> xs_connect() xprt is "connecting" state, do nothing. +00:08:21 : (from timer) -> __rpc_default_timer() == rpc task time out -> rpc_wake_up_task() (rpciod) retry to connect -> xprt_connect() -> xs_connect() xprt is "connecting" state, do nothing. +00:09:21 : (from timer) -> __rpc_default_timer() == rpc task time out -> rpc_wake_up_task() (rpciod) retry to connect -> xprt_connect() -> xs_connect() xprt is "connecting" state, do nothing. +00:10:21 : (from timer) -> __rpc_default_timer() == rpc task time out -> rpc_wake_up_task() (rpciod) retry to connect -> xprt_connect() -> xs_connect() xprt is "connecting" state, do nothing. +00:11:21 : RPC Task time out and delayed connect work at the same time (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() -> kernel_connect() return -EINPROGRESS * xs_tcp_connect_worker() clears xprt's "connecting" state. (from timer) -> __rpc_default_timer() == rpc task time out -> rpc_wake_up_task() (rpciod) retry to connect -> xprt_connect() -> xs_connect() -> queue_delayed_work( xs_tcp_connect_worker, 300 secs ) (from tcp) -> xs_tcp_state_change() sk_state = TCP_ESTABLISHED -> xprt_wake_pending_tasks() Now, TCP connection to the server is ready. But xs_tcp_connect_worker still in delayed work queue. ( If xs_tcp_state_change() called before xprt_connect(), xprt_connect() does not queue connect work. So problem will not occur. ) +00:16:21 : (rpciod) schedule delayed connect work -> xs_tcp_connect_worker() -> xs_tcp_reuse_connection() This disconnects active TCP session (sending RST to the server). But xprt state is still "connected". -> kernel_connect() return -EINPROGRESS (from tcp) -> xs_tcp_state_change() sk_state = TCP_ESTABLISHED Here, TCP record must be reset. But xprt do not so because xprt state is already "connected". Then, newer reply from the NFS server will be concatenated to old TCP stream. This cause read data corruption. "nfs server not responding" may also occur because xprt misses record marker. -> xprt_wake_pending_tasks() ----- I attached a patch to avoid race condition between task time out and delayed connect work.
Created attachment 481331 [details] Patch2 Previous patch I attached is not enough to avoid race, because it can take over 15 seconds from kernel_connect() to xs_tcp_state_change(). So I attached new one.
Nice detective work! The explanation looks plausible, but I need to go through it again to be sure (this code is pretty complex). I took a quick look but it's not immediately obvious so I'll ask: Is this also a problem upstream and/or in RHEL6?
I tested RHEL6 additionally. No problem. In RHEL6, even if the connect worker is in delayed work queue, active TCP connection will not be disconnected by xs_tcp_reuse_connection(). xs_tcp_reuse_connection() does not call kernel_connect() with AF_UNSPEC when sock state is ESTABLISHED. -- RHEL6 -- static void xs_tcp_reuse_connection(struct rpc_xprt *xprt, struct sock_xprt *transport) { unsigned int state = transport->inet->sk_state; if (state == TCP_CLOSE && transport->sock->state == SS_UNCONNECTED) return; if ((1 << state) & (TCPF_ESTABLISHED|TCPF_SYN_SENT)) return; xs_abort_connection(xprt, transport); } -- -- RHEL5 -- static void xs_tcp_reuse_connection(struct rpc_xprt *xprt) { int result; struct socket *sock = xprt->sock; struct sockaddr any; dprintk("RPC: disconnecting xprt %p to reuse port\n", xprt); /* * Disconnect the transport socket by doing a connect operation * with AF_UNSPEC. This should return immediately... */ memset(&any, 0, sizeof(any)); any.sa_family = AF_UNSPEC; result = kernel_connect(sock, &any, sizeof(any), 0); if (result) dprintk("RPC: AF_UNSPEC connect return code %d\n", result); } --
Would we better suited then to backport that change to RHEL5? Namely: commit 40d2549db5f515e415894def98b49db7d4c56714 Author: Trond Myklebust <Trond.Myklebust> Date: Wed Mar 11 14:37:58 2009 -0400 SUNRPC: Don't disconnect if a connection is still in progress. Signed-off-by: Trond Myklebust <Trond.Myklebust>
Yes. Thank you.
OK, I've backported the patch and put it into the test kernels on my people.redhat.com page: http://people.redhat.com/jlayton/ ...do you have a way to reproduce this reliably? If so can you test this kernel and see if it fixes the issue?
I can often reproduce this, but not 100%. I want to test with systemtap to check the trace. Could you up kernel-debuginfo and kernel-debuginfo-common, too?
(In reply to comment #8) > I can often reproduce this, but not 100%. > I want to test with systemtap to check the trace. > > Could you up kernel-debuginfo and kernel-debuginfo-common, too? Sadly, no. I don't have enough quota on people.redhat.com to make the debuginfo available. If you have a support case open for this, I can try to get them to you via our support channels.
Created attachment 483388 [details] systemtap script Then, could you compile my systemtap script for your i686 test kernel and up the .ko module?
Rather than that, I just removed the xen test kernels and put the i686 debuginfo up there. Let me know if you need other packages.
I confirm the issue is fixed and the new kernel works fine! Thank you for your work.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Patch(es) available in kernel-2.6.18-250.el5 Detailed testing feedback is always welcomed.
The bug is reproduced in 2.6.18-194.32.1.el5 and verified in 2.6.18-261.el5. Test step use instruction of Ichiko's Description and his systemtap script. Test result is as follows: 1. reproducer [******NFS-SERVER******] [root@hp-ml370g6-01 test]# exportfs /mnt/test <world> [root@hp-ml370g6-01 test]# ll total 51260 -rw-r--r-- 1 root root 52428800 May 19 00:40 b [[[[[[[[[CLIENT]]]]]]]]]] [root@ibm-ls22-01 ~]# uname -a Linux ibm-ls22-01.rhts.eng.nay.redhat.com 2.6.18-194.32.1.el5 #1 SMP Mon Dec 20 10:52:42 EST 2010 x86_64 x86_64 x86_64 GNU/Linux [root@ibm-ls22-01 ~]# mount | grep /mnt/test hp-ml370g6-01.rhts.eng.bos.redhat.com:/mnt/test on /mnt/test type nfs (rw,addr=10.16.66.124) [root@ibm-ls22-01 ~]# cp /mnt/test/b ./ [******NFS-SERVER******] [root@hp-ml370g6-01 test]# service nfs stop; sleep 420; service nfs start [[[[[[[[[CLIENT]]]]]]]]]] cp command got stuck Output of stap script: ====start Thu May 19 04:41:20 2011 --> xs_tcp_state_change() sock=0xffff81021faef380 sk_state=CLOSE Thu May 19 04:41:20 2011 <-- xs_tcp_state_change() Thu May 19 04:41:20 2011 --> xprt_connect() tk_pid=1691 xprt=0xffff81022d7afc00 (not connected) Thu May 19 04:41:20 2011 --> xs_connect() tk_pid=1691 socket=0xffff81022cef2840 (not connecting, queue delayed work) reestablish_timeout=3000 ====snip 3second-timeout (rpciod) Thu May 19 04:41:23 2011 --> xs_tcp_connect_worker() xprt=0xffff81022d7afc00 socket=0xffff81022cef2840 sk_state=CLOSE Thu May 19 04:41:23 2011 --> xs_tcp_reuse_connection() Thu May 19 04:41:23 2011 --> kernel_connect() socket=0xffff81022cef2840 sock=0xffff81021faef380 family=AF_UNSPEC Thu May 19 04:41:23 2011 <-- kernel_connect() return=0 Thu May 19 04:41:23 2011 --> kernel_connect() socket=0xffff81022cef2840 sock=0xffff81021faef380 family=AF_INET Thu May 19 04:41:23 2011 <-- kernel_connect() return=-EINPROGRESS Thu May 19 04:41:23 2011 <-- xs_tcp_connect_worker() Thu May 19 04:41:23 2011 --> xs_tcp_state_change() sock=0xffff81021faef380 sk_state=CLOSE Thu May 19 04:41:23 2011 <-- xs_tcp_state_change() Thu May 19 04:41:23 2011 --> xprt_connect() tk_pid=1691 xprt=0xffff81022d7afc00 (not connected) Thu May 19 04:41:23 2011 --> xs_connect() tk_pid=1691 socket=0xffff81022cef2840 (not connecting, queue delayed work) reestablish_timeout=6000 Thu May 19 04:41:23 2011 <-- xs_connect() tk_pid=1691 Thu May 19 04:41:23 2011 <-- xprt_connect() tk_pid=1691 ====snip 1second-timeout (timer on xprt->pending) Thu May 19 04:51:43 2011 --> __rpc_default_timer(): tk_pid=1691 Timed out Thu May 19 04:51:43 2011 <-- __rpc_default_timer() Thu May 19 04:51:43 2011 --> xprt_connect() tk_pid=1691 xprt=0xffff81022d7afc00 (not connected) Thu May 19 04:51:43 2011 --> xs_connect() tk_pid=1691 socket=0xffff81022cef2840 (connecting, do nothing) Thu May 19 04:51:43 2011 <-- xs_connect() tk_pid=1691 Thu May 19 04:51:43 2011 <-- xprt_connect() tk_pid=1691 ====5minute-timeout (rpciod) 04:41:20 + 0:11:21 Thu May 19 04:52:43 2011 --> __rpc_default_timer(): tk_pid=1691 Timed out Thu May 19 04:52:43 2011 <-- __rpc_default_timer() Thu May 19 04:52:43 2011 --> xs_tcp_connect_worker() xprt=0xffff81022d7afc00 socket=0xffff81022cef2840 sk_state=CLOSE Thu May 19 04:52:43 2011 --> xs_tcp_reuse_connection() Thu May 19 04:52:43 2011 --> kernel_connect() socket=0xffff81022cef2840 sock=0xffff81021faef380 family=AF_UNSPEC Thu May 19 04:52:43 2011 <-- kernel_connect() return=0 Thu May 19 04:52:43 2011 --> kernel_connect() socket=0xffff81022cef2840 sock=0xffff81021faef380 family=AF_INET Thu May 19 04:52:43 2011 <-- kernel_connect() return=-EINPROGRESS Thu May 19 04:52:43 2011 <-- xs_tcp_connect_worker() Thu May 19 04:52:43 2011 --> xprt_connect() tk_pid=1691 xprt=0xffff81022d7afc00 (not connected) Thu May 19 04:52:43 2011 --> xs_connect() tk_pid=1691 socket=0xffff81022cef2840 (not connecting, queue delayed work) reestablish_timeout=300000 Thu May 19 04:52:43 2011 <-- xs_connect() tk_pid=1691 Thu May 19 04:52:43 2011 <-- xprt_connect() tk_pid=1691 Thu May 19 04:52:43 2011 --> xs_tcp_state_change() sock=0xffff81021faef380 sk_state=ESTABLISHED Thu May 19 04:52:43 2011 <-- xs_tcp_state_change() ====5-minute-timeout (rpciod) 04:52:43+0:5:0 timer disabled, but rpciod's delayed work still exist Thu May 19 04:57:43 2011 --> xs_tcp_connect_worker() xprt=0xffff81022d7afc00 socket=0xffff81022cef2840 sk_state=ESTABLISHED Thu May 19 04:57:43 2011 --> xs_tcp_reuse_connection() Thu May 19 04:57:43 2011 --> kernel_connect() socket=0xffff81022cef2840 sock=0xffff81021faef380 family=AF_UNSPEC Thu May 19 04:57:43 2011 <-- kernel_connect() return=0 Thu May 19 04:57:43 2011 --> kernel_connect() socket=0xffff81022cef2840 sock=0xffff81021faef380 family=AF_INET Thu May 19 04:57:43 2011 <-- kernel_connect() return=-EINPROGRESS Thu May 19 04:57:43 2011 <-- xs_tcp_connect_worker() Thu May 19 04:57:46 2011 --> xs_tcp_state_change() sock=0xffff81021faef380 sk_state=ESTABLISHED Thu May 19 04:57:46 2011 <-- xs_tcp_state_change() 2. Verify test step is same as producer, and 'cp' command successes. [root@hp-ml370g6-01 ~]# uname -a Linux hp-ml370g6-01.rhts.eng.bos.redhat.com 2.6.18-261.el5 #1 SMP Thu May 12 16:47:18 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux Output of stap script: ====start Thu May 19 05:55:52 2011 --> xs_tcp_state_change() sock=0xffff810127145340 sk_state=CLOSE Thu May 19 05:55:52 2011 <-- xs_tcp_state_change() Thu May 19 05:55:52 2011 --> xprt_connect() tk_pid=455 xprt=0xffff810132318400 (not connected) Thu May 19 05:55:52 2011 --> xs_connect() tk_pid=455 socket=0xffff8101325fcd00 (not connecting, queue delayed work) reestablish_timeout=3000 ====snip 3-second-timeout (rpciod) Thu May 19 05:55:55 2011 --> xs_tcp_connect_worker() xprt=0xffff810132318400 socket=0xffff8101325fcd00 sk_state=CLOSE Thu May 19 05:55:55 2011 --> xs_tcp_reuse_connection() Thu May 19 05:55:55 2011 --> kernel_connect() socket=0xffff8101325fcd00 sock=0xffff810127145340 family=AF_UNSPEC Thu May 19 05:55:55 2011 <-- kernel_connect() return=0 Thu May 19 05:55:55 2011 --> kernel_connect() socket=0xffff8101325fcd00 sock=0xffff810127145340 family=AF_INET Thu May 19 05:55:55 2011 <-- kernel_connect() return=-EINPROGRESS Thu May 19 05:55:55 2011 <-- xs_tcp_connect_worker() Thu May 19 05:55:55 2011 --> xs_tcp_state_change() sock=0xffff810127145340 sk_state=CLOSE Thu May 19 05:55:55 2011 <-- xs_tcp_state_change() Thu May 19 05:55:55 2011 --> xprt_connect() tk_pid=455 xprt=0xffff810132318400 (not connected) Thu May 19 05:55:55 2011 --> xs_connect() tk_pid=455 socket=0xffff8101325fcd00 (not connecting, queue delayed work) reestablish_timeout=6000 Thu May 19 05:55:55 2011 <-- xs_connect() tk_pid=455 Thu May 19 05:55:55 2011 <-- xprt_connect() tk_pid=455 ====snip 1-minute-timeout of timer Thu May 19 06:07:15 2011 --> __rpc_default_timer(): tk_pid=455 Timed out Thu May 19 06:07:15 2011 <-- __rpc_default_timer() ====5-minute-timeout (rpciod) Thu May 19 06:07:15 2011 --> xs_tcp_connect_worker() xprt=0xffff810132318400 socket=0xffff8101325fcd00 sk_state=CLOSE Thu May 19 06:07:15 2011 --> xs_tcp_reuse_connection() Thu May 19 06:07:15 2011 --> kernel_connect() socket=0xffff8101325fcd00 sock=0xffff810127145340 family=AF_UNSPEC Thu May 19 06:07:15 2011 <-- kernel_connect() return=0 Thu May 19 06:07:15 2011 --> kernel_connect() socket=0xffff8101325fcd00 sock=0xffff810127145340 family=AF_INET Thu May 19 06:07:15 2011 <-- kernel_connect() return=-EINPROGRESS Thu May 19 06:07:15 2011 <-- xs_tcp_connect_worker() Thu May 19 06:07:15 2011 --> xprt_connect() tk_pid=455 xprt=0xffff810132318400 (not connected) Thu May 19 06:07:15 2011 --> xs_connect() tk_pid=455 socket=0xffff8101325fcd00 (not connecting, queue delayed work) reestablish_timeout=300000 Thu May 19 06:07:15 2011 <-- xs_connect() tk_pid=455 Thu May 19 06:07:15 2011 <-- xprt_connect() tk_pid=455 Thu May 19 06:07:15 2011 --> xs_tcp_state_change() sock=0xffff810127145340 sk_state=ESTABLISHED Thu May 19 06:07:15 2011 <-- xs_tcp_state_change() ====5-minute-timeout (rpciod) Thu May 19 06:12:15 2011 --> xs_tcp_connect_worker() xprt=0xffff810132318400 socket=0xffff8101325fcd00 sk_state=ESTABLISHED Thu May 19 06:12:15 2011 --> xs_tcp_reuse_connection() Thu May 19 06:12:15 2011 --> kernel_connect() socket=0xffff8101325fcd00 sock=0xffff810127145340 family=AF_INET Thu May 19 06:12:15 2011 <-- kernel_connect() return=0 Thu May 19 06:12:15 2011 <-- xs_tcp_connect_worker() ====connectint to AF_UNSPEC isn't invoked, which would close established socket.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-1065.html