Bug 213624 - LVS - IPVSADM and TCP connection re-use - failure of service
Summary: LVS - IPVSADM and TCP connection re-use - failure of service
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: ipvsadm
Version: 4.0
Hardware: i686
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Andy Gospodarek
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-11-02 11:38 UTC by Ubiquity Software
Modified: 2014-06-29 22:58 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-01-29 17:05:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
This file contians the configuration files of the LVS directors (5.00 KB, application/gzip)
2006-11-02 11:38 UTC, Ubiquity Software
no flags Details

Description Ubiquity Software 2006-11-02 11:38:05 UTC
Hi Redhat Support,

We have come across a situation which we suspect is a bug in LVS, whilst doing
some integration tests.

The problem only arises when we attempt to re-use a TCP connection after a
failover, the scenario is that we have performed some redundancy testing in
which we place a call via a TCP connection and then after failover attempt to
re-use that connection (also re-using the same source port). The connection will
not be established and the LVS director will not timeout the failed connection
from the ipvsadm route table. The timer will count down and reset but will never
timeout correctly.

Our system details.

I have attached a rar file contianing all the information that I have considered
helpful to your investigation.

LVS Director A (primary) : 

Redhat ES 4

uname -a : Linux gbcdff1275l.eu.ubiquity.net 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug
17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux

ipvsadm --version : ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0)

pulse --version :

Program Version:        pulse 1.56
Built:                  03/Mar/2006
A component of:         piranha-0.8.2-1

LVS Director B (backup):

uname -a : Linux gbcdff1379l.eu.ubiquity.net 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug
17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux

ipvsadm --version : ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0)

Comment 1 Ubiquity Software 2006-11-02 11:38:05 UTC
Created attachment 140106 [details]
This file contians the configuration files of the LVS directors

Comment 3 Lon Hohberger 2006-11-03 15:42:34 UTC
(FYI, Red Hat Support is here: http://www.redhat.com/apps/support/ )

Andy Gospodarek proposed a patch to linux-kernel awhile ago to fix this, but it
was declined:

http://www.mail-archive.com/netdev@vger.kernel.org/msg12010.html

Comment 4 Andy Gospodarek 2006-11-06 16:04:22 UTC
I'm not sure this will solve your problem completely, but the patch posted above
is included in builds located here:

http://people.redhat.com/agospoda/#rhel4

These builds allow connections that are failed-over to use the configured
timeout value instead of the hard-coded timeout value of 3 minutes.  I noticedt
that you have a timeout value of 6 seconds, so having a 3 minute timeout is
probably not helping your situation. 

Comment 5 Ubiquity Software 2006-11-07 09:54:08 UTC
Hi Redhat,

I agree that this solution does not appear to completly solve the issue, I will
apply the patch and respond with the result.

Regards

Osman Marks

Ubiquity Support.

Comment 6 Ubiquity Software 2006-11-09 15:00:08 UTC
Hi Redhat,

we still are experiencing the issue of connections not timing out, the problem
does not seem to stop the initiation of new calls, but we do see a period of
about 3 minutes where we cannot establish a new call.

the connection then seems to reset and new calls can be initiated again, where
previously the calls would not work untill we rebooted the box.

so the problem is better but not totally fixed.

regards

Osman Marks

Ubiquity Support

Comment 7 Andy Gospodarek 2006-11-09 16:10:41 UTC
(In reply to comment #6)
> Hi Redhat,
> 
> we still are experiencing the issue of connections not timing out, the problem
> does not seem to stop the initiation of new calls, but we do see a period of
> about 3 minutes where we cannot establish a new call.

I'm confused about the above statement.  I read "initiation of new calls" and
"establish a new call" as the same thing.  Can you please clarify?

> 
> the connection then seems to reset and new calls can be initiated again, where
> previously the calls would not work untill we rebooted the box.
> 

So 3 minutes after the timeout you can then initiate new calls again?  Did you
install the test kernel on both boxes doing LVS?

Comment 8 Ubiquity Software 2006-11-09 16:26:38 UTC
Even if we can reduce the amount of time that a TCP connection is allowed to
hang this still seems like more of a workaround than a real fix.

Yes I did install the kernel on box boxes.

The connections are still showing on the ipvsadm -Lc after the call has been
killed for 30 minutes.

please can you advise what the ipvsadm -L --timeout parameter is for and perhaps
if this could help our tests by make the connections timeout faster, currently
the TCP connections hang so I am not sure if this would even be an active setting. 

The solution though should be a fix to LVS rather than workarounds that allow us
mask over the problem. i.e. that LVS loses state of the connection very easily
when re-using TCP connections.

to clarify on your above comment, establish new call and initiation of new calls
is the same thing. the scenario is that for 3 minutes we cannot make any call,
but after approx 3 minutes the call can be established, and we can keep creating
calls again as if the bug was non existant, but for the period of time in which
the bug appears ( after the failover ) we cannot make any calls until the
connection times out ( I am guessing ).

Reagrds

Osman Marks

Ubiquitry support


Comment 10 Andy Gospodarek 2006-12-05 15:38:58 UTC
Were you able to move the LVS directors off the servers and make this work?

Comment 11 Ubiquity Software 2006-12-05 15:43:19 UTC
Hi Andy,

We have moved the LVS direcotrs away from the real servers, but have not had a
chance to test this.

We are awaiting a third party to schedule testing time.

Regards

Osman Marks

Comment 12 Andy Gospodarek 2007-01-09 22:24:21 UTC
Osman, Did you have a chance to test this after moving the LVS directors off the
real servers?

Comment 13 Ubiquity Software 2007-01-29 12:43:50 UTC
Hi Andy,

Thanks for the follow up. We did separate the Real Servers and the Linux
Directors. I don't believe this fixed the issue, but we've put it down to the
fact that our customer is using a strange configuration (retaining same source
and destination port for subsequent connections). Feel free to set the case
resolved/insufficient_data, as I don't think we'll be re-visiting this scenario
in the near future.

Comment 14 Andy Gospodarek 2007-01-29 17:05:13 UTC
Thanks for the update -- I'll close this one.  If you are able to track down the
problem, please open another bugzilla so we can track it.


Note You need to log in before you can comment on or make changes to this bug.