Hi Redhat Support, We have come across a situation which we suspect is a bug in LVS, whilst doing some integration tests. The problem only arises when we attempt to re-use a TCP connection after a failover, the scenario is that we have performed some redundancy testing in which we place a call via a TCP connection and then after failover attempt to re-use that connection (also re-using the same source port). The connection will not be established and the LVS director will not timeout the failed connection from the ipvsadm route table. The timer will count down and reset but will never timeout correctly. Our system details. I have attached a rar file contianing all the information that I have considered helpful to your investigation. LVS Director A (primary) : Redhat ES 4 uname -a : Linux gbcdff1275l.eu.ubiquity.net 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux ipvsadm --version : ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0) pulse --version : Program Version: pulse 1.56 Built: 03/Mar/2006 A component of: piranha-0.8.2-1 LVS Director B (backup): uname -a : Linux gbcdff1379l.eu.ubiquity.net 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 18:00:32 EDT 2006 i686 i686 i386 GNU/Linux ipvsadm --version : ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0)
Created attachment 140106 [details] This file contians the configuration files of the LVS directors
(FYI, Red Hat Support is here: http://www.redhat.com/apps/support/ ) Andy Gospodarek proposed a patch to linux-kernel awhile ago to fix this, but it was declined: http://www.mail-archive.com/netdev@vger.kernel.org/msg12010.html
I'm not sure this will solve your problem completely, but the patch posted above is included in builds located here: http://people.redhat.com/agospoda/#rhel4 These builds allow connections that are failed-over to use the configured timeout value instead of the hard-coded timeout value of 3 minutes. I noticedt that you have a timeout value of 6 seconds, so having a 3 minute timeout is probably not helping your situation.
Hi Redhat, I agree that this solution does not appear to completly solve the issue, I will apply the patch and respond with the result. Regards Osman Marks Ubiquity Support.
Hi Redhat, we still are experiencing the issue of connections not timing out, the problem does not seem to stop the initiation of new calls, but we do see a period of about 3 minutes where we cannot establish a new call. the connection then seems to reset and new calls can be initiated again, where previously the calls would not work untill we rebooted the box. so the problem is better but not totally fixed. regards Osman Marks Ubiquity Support
(In reply to comment #6) > Hi Redhat, > > we still are experiencing the issue of connections not timing out, the problem > does not seem to stop the initiation of new calls, but we do see a period of > about 3 minutes where we cannot establish a new call. I'm confused about the above statement. I read "initiation of new calls" and "establish a new call" as the same thing. Can you please clarify? > > the connection then seems to reset and new calls can be initiated again, where > previously the calls would not work untill we rebooted the box. > So 3 minutes after the timeout you can then initiate new calls again? Did you install the test kernel on both boxes doing LVS?
Even if we can reduce the amount of time that a TCP connection is allowed to hang this still seems like more of a workaround than a real fix. Yes I did install the kernel on box boxes. The connections are still showing on the ipvsadm -Lc after the call has been killed for 30 minutes. please can you advise what the ipvsadm -L --timeout parameter is for and perhaps if this could help our tests by make the connections timeout faster, currently the TCP connections hang so I am not sure if this would even be an active setting. The solution though should be a fix to LVS rather than workarounds that allow us mask over the problem. i.e. that LVS loses state of the connection very easily when re-using TCP connections. to clarify on your above comment, establish new call and initiation of new calls is the same thing. the scenario is that for 3 minutes we cannot make any call, but after approx 3 minutes the call can be established, and we can keep creating calls again as if the bug was non existant, but for the period of time in which the bug appears ( after the failover ) we cannot make any calls until the connection times out ( I am guessing ). Reagrds Osman Marks Ubiquitry support
Were you able to move the LVS directors off the servers and make this work?
Hi Andy, We have moved the LVS direcotrs away from the real servers, but have not had a chance to test this. We are awaiting a third party to schedule testing time. Regards Osman Marks
Osman, Did you have a chance to test this after moving the LVS directors off the real servers?
Hi Andy, Thanks for the follow up. We did separate the Real Servers and the Linux Directors. I don't believe this fixed the issue, but we've put it down to the fact that our customer is using a strange configuration (retaining same source and destination port for subsequent connections). Feel free to set the case resolved/insufficient_data, as I don't think we'll be re-visiting this scenario in the near future.
Thanks for the update -- I'll close this one. If you are able to track down the problem, please open another bugzilla so we can track it.