Bug 77465

Summary: TCP reconnect timeout problem. Patch written by NetApp fixes the problem.
Product: Red Hat Enterprise Linux 2.1 Reporter: Michael Waite <mwaite>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED DEFERRED QA Contact: Ben Levenson <benl>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1CC: sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-14 10:54:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michael Waite 2002-11-07 15:15:22 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.6 (X11; Linux i686; U;) Gecko/20020830

Description of problem:
> We are using AS2.1 as our based OS to run Oracle 9i with Netapp NFS
> server.When we did a failover netapp cluster, the cluster completed takeover
in about 30 seconds including exports, ip and mac address. We could not "df -k"
the Netapp mount file systems on the Linux server for about 10-12 minutes when
we used mount option tcp. However, if you the mount option udp, "df -k" came
back right away after the takeover. Is there any TCP setting to reduce the
10minutes delay?

> > The times for the test are:
> > 
> > 11:27:27 Takeover begun
> > 11:27:50 Takeover complete
> > 11:38:52 Oracle resumes processing load
> > 
> > To clarify:
> > 
> > "cf takeover" causes the problem
> > "cf giveback" causes the problem
> > "halt" (to induce a takeover) causes the problem
> > power off (to induce a takeover) does NOT cause the problem

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.mount a filesystem with TCP
2.failover the NFS server
3.
	

Actual Results:   We installed the patch today.  It worked!  Now, there is 
> about a one minute delay.  Oracle starts processing load about one minute
after  takeover or giveback completes.  When we previously made this change
manually, we miscompiled the kernel, hence the reason why it did not work before. 

Additional info:

> > Attached is the rpc debug output.  No additional
> > activity was running this time.
> 
> looking through this i see the same pattern of behavior on
> one of the file systems -- there is significant activity
> during the ten minute takeover period which stops by itself,
> and the connection times out and closes.
> 
> however, another connection shows that it waits ten minutes
> for a reconnection that times out.  activity on that connection
> ends around 11:27:29, and picks up with the timeout again at
> 11:37:27, about when things get back to normal.
> 
> i've attached a patch that sets the reconnect timeout value
> to 1 minute (basically what the changes i sent friday were
> supposed to do).  given the rpc debug output, i don't under-
> stand why that change didn't fix your problem.
> 
> so take off the old patches i sent previously and apply this
> one.  if it doesn't fix the problem, send me rpc debug output.

Here is the patch that seems to fix the problem:

diff -ruN linux/net/sunrpc/xprt.c linux.maxval/net/sunrpc/xprt.c
--- linux/net/sunrpc/xprt.c     Fri Nov  1 13:09:46 2002
+++ linux.maxval/net/sunrpc/xprt.c      Mon Nov  4 15:54:35 2002
@@ -474,7 +474,7 @@
 
                spin_lock_bh(&xprt_sock_lock);
                if (!xprt_connected(xprt)) {
-                       task->tk_timeout = xprt->timeout.to_maxval;
+                       task->tk_timeout = 60 * HZ;
                        rpc_sleep_on(&xprt->reconn, task, xprt_reconn_status, NULL);
                     spin_unlock_bh(&xprt_sock_lock);
                        return;

Comment 1 Stephen Tweedie 2002-11-07 18:22:14 UTC
*** Bug 76942 has been marked as a duplicate of this bug. ***

Comment 2 Larry Woodman 2005-10-14 10:54:48 UTC
I dont think we can fix this in AS2.1 at this late date in the life cycle.  I
think everything is OK in RHEL3.

Larry Woodman