77465 – TCP reconnect timeout problem. Patch written by NetApp fixes the problem.

Bug 77465 - TCP reconnect timeout problem. Patch written by NetApp fixes the problem.

Summary: TCP reconnect timeout problem. Patch written by NetApp fixes the problem.

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Enterprise Linux 2.1
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	2.1
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Larry Woodman
QA Contact:	Ben Levenson
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	76942 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-11-07 15:15 UTC by Michael Waite
Modified:	2007-11-30 22:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-10-14 10:54:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michael Waite 2002-11-07 15:15:22 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.6 (X11; Linux i686; U;) Gecko/20020830

Description of problem:
> We are using AS2.1 as our based OS to run Oracle 9i with Netapp NFS
> server.When we did a failover netapp cluster, the cluster completed takeover
in about 30 seconds including exports, ip and mac address. We could not "df -k"
the Netapp mount file systems on the Linux server for about 10-12 minutes when
we used mount option tcp. However, if you the mount option udp, "df -k" came
back right away after the takeover. Is there any TCP setting to reduce the
10minutes delay?

> > The times for the test are:
> > 
> > 11:27:27 Takeover begun
> > 11:27:50 Takeover complete
> > 11:38:52 Oracle resumes processing load
> > 
> > To clarify:
> > 
> > "cf takeover" causes the problem
> > "cf giveback" causes the problem
> > "halt" (to induce a takeover) causes the problem
> > power off (to induce a takeover) does NOT cause the problem

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1.mount a filesystem with TCP
2.failover the NFS server
3.
	

Actual Results:   We installed the patch today.  It worked!  Now, there is 
> about a one minute delay.  Oracle starts processing load about one minute
after  takeover or giveback completes.  When we previously made this change
manually, we miscompiled the kernel, hence the reason why it did not work before. 

Additional info:

> > Attached is the rpc debug output.  No additional
> > activity was running this time.
> 
> looking through this i see the same pattern of behavior on
> one of the file systems -- there is significant activity
> during the ten minute takeover period which stops by itself,
> and the connection times out and closes.
> 
> however, another connection shows that it waits ten minutes
> for a reconnection that times out.  activity on that connection
> ends around 11:27:29, and picks up with the timeout again at
> 11:37:27, about when things get back to normal.
> 
> i've attached a patch that sets the reconnect timeout value
> to 1 minute (basically what the changes i sent friday were
> supposed to do).  given the rpc debug output, i don't under-
> stand why that change didn't fix your problem.
> 
> so take off the old patches i sent previously and apply this
> one.  if it doesn't fix the problem, send me rpc debug output.

Here is the patch that seems to fix the problem:

diff -ruN linux/net/sunrpc/xprt.c linux.maxval/net/sunrpc/xprt.c
--- linux/net/sunrpc/xprt.c     Fri Nov  1 13:09:46 2002
+++ linux.maxval/net/sunrpc/xprt.c      Mon Nov  4 15:54:35 2002
@@ -474,7 +474,7 @@
 
                spin_lock_bh(&xprt_sock_lock);
                if (!xprt_connected(xprt)) {
-                       task->tk_timeout = xprt->timeout.to_maxval;
+                       task->tk_timeout = 60 * HZ;
                        rpc_sleep_on(&xprt->reconn, task, xprt_reconn_status, NULL);
                     spin_unlock_bh(&xprt_sock_lock);
                        return;

Comment 1 Stephen Tweedie 2002-11-07 18:22:14 UTC

*** Bug 76942 has been marked as a duplicate of this bug. ***

Comment 2 Larry Woodman 2005-10-14 10:54:48 UTC

I dont think we can fix this in AS2.1 at this late date in the life cycle.  I
think everything is OK in RHEL3.

Larry Woodman

Note You need to log in before you can comment on or make changes to this bug.