Bug 206435

Summary:	Application stuck in recv() when cluster member crashes and IP address relocates
Product:	Red Hat Enterprise Linux 4	Reporter:	Eric Z. Ayers <eric.ayers>
Component:	kernel	Assignee:	Neil Horman <nhorman>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.4	CC:	davem, jbaron, jesse.marlin, tgraf
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-04-17 10:49:26 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Z. Ayers 2006-09-14 12:45:17 UTC

Description of problem:

Our application uses TCP sockets to commuicate between members of the cluster to
 the member hosting one of the clustered IP addresses.  After the node hosting
the IP address is 'fenced' (turned off), the client applications stay hung in a
blocking recv() indefinately.

On other cluster architectures, we are used to the sockets returning EOF or some
kind of error.

If the node were turned off and no other communication occured, I could
understand the hang.  But this IP address was relocated to another node and our
application sucessfully restarted.  I don't know what mechanism triggers the
socket close on other OS's.  Re-arp maybe? Should TCP retransmits from the
client have detected a problem and closed the socket on the client side?  Maybe
there is a way to tune the TCP stack so that it doesn't hang indefinitely?


Version-Release number of selected component (if applicable):

$ rpm -q cman
cman-1.0.11-0
-bash-3.00$ rpm -q kernel
kernel-2.6.9-42.0.2.EL


How reproducible:

Create a client/server relationship between one node in a cluster and a
cluster'ed IP address.
Fence off the node hosting the clustered IP address
The client application will hang, even after the IP address relocates to another
node.
  
Actual results:

client application hangs in recv()

Expected results:

recv() should return an error and the remote end of the socket closed after the
IP address is re-hosted on another node (or after a reboot.) 

Additional info:

This process has been hung for over 10 hours:

cgi      11389  0.0  0.0 17104 3128 ?        SN   Sep13   0:00 dcs_operation_mgr
-oper EPSREJTEST1 -index 32763 -nice 10 -plan EPSTST1 -master
PYGEPS4_INPUT_7_038166.dat -reprocess_id 0 -file PYGEPS4_INPUT_7_038166.dat
-sanity 300 -service blade01-1 25383 UDP

(gdb) where
#0  0xffffe405 in __kernel_vsyscall ()
#1  0x00276d51 in recv () from /lib/tls/libc.so.6
#2  0xf778f6d8 in cgipc_recv (serv=0x80971e0 "10.0.0.240 24037 TCP",
    rserv=0xffff9310 "", cgipc=0x808c308, buf=0x808f240 "",
    pkt_bytes=0xffff930e) at cgipc_recv.c:838
#3  0xf7706d99 in receive_pkt (packet=0xffff93b0) at uio_file_serv_api.c:1874
#4  0xf7704b0c in uiofs_open_isam (handle=0xffffa688,
    file_name=0xf7d62638 "DCS_DB_PATH:dcs_rtgreq_file", info=0xffffbac1,
    options=3) at uio_file_serv_api.c:450
#5  0xf7710f3e in uio_open_isam (handle=0xf7fe0788,
    name=0xf7d62638 "DCS_DB_PATH:dcs_rtgreq_file", options=3)
    at uio_index_api.c:624
#6  0xf7cd511e in dcs_rtgreq_init () at dcs_plan_internal.c:1733
#7  0xf7cd5d5d in dcs_rtgreq_get (rtgreq=0x8084a70, key_type=0, lock=524288,
    option=1) at dcs_plan_internal.c:2347
#8  0xf7cb7a4e in dcs_oper_build_rtgreq (plan=0x8063fec,
    master_file=0x8064000, operation=0x8063fe0, file=0x8064280,
    reprocess_id=0, request=0x8084a70) at dcs_plan_api.c:7893
#9  0x0804f075 in build_req_list (event_driven=15) at dcs_operation_mgr.c:1768
#10 0x0804eee2 in get_command_line_options (argc=1, argv=0xffffad34)
    at dcs_operation_mgr.c:1687
#11 0x0804d151 in init (argc=19, argv=0xffffad34) at dcs_operation_mgr.c:1057
#12 0x0804ce52 in main (argc=19, argv=0xffffad34) at dcs_operation_mgr.c:936
(gdb)

10.0.0.240 is our clustered IP address.  The node blade04 was fenced off 
yesterday at 18:30 (kernel panic).

Comment 1 Christine Caulfield 2006-09-14 12:58:20 UTC

I suspect this needs to be assigned to the kernel, maybe.
 
It's certainly nothing to do with cman.

Comment 2 Eric Z. Ayers 2006-09-14 13:26:29 UTC

I thought maybe some extra step might need to be taken when relocating an IP 
address, but probably this is an issue in the kernel.

Our kernel version:

$ uname -a
Linux blade01-1 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64
x86_64 x86_64 GNU/Linux


We have been running our product on Digital/Compaq/HP TruCluster types of
clusters, HP Service Guard, Sun custers using SunCluster and Veritas without
this issue.  We have been running Linux 2.2 and Linux 2.4 servers for many years
that communicate with a clustered server as well and I've never noticed this
issue.  So my hunch on this is that there must be something different about
Linux 2.6  or linux 2.6 X86_64 that is causing this situation.

Patrick asked me if we had set setsockopt(SO_KEEPALIVE) on.  The answer is no. 
We used to use KEEPALIVE for these types of connections (pre 1999), but they
were apparently detrimental to server restarts (at least, that is what my
comment from sometime in 2000 says.)

Comment 3 Eric Z. Ayers 2006-10-09 20:58:11 UTC

This problem is a showstopper for us.  We don't have it in prodcution anywhere,
but we won't be able to use clustering with our APP on Linux until we can find
out why it behaves differently from HP, OSF, Sun in clusters.  And we only
rarely install our application (telecom industry) without clustering.

Comment 4 Neil Horman 2006-11-06 13:51:52 UTC

I think it would be best to provide tcpdumps from the hanging machine (binary
format please)  of a hanging and a non-hanging instance.  The comparison of the
two dumps will help us track down where the problem is occurring.

Comment 5 Eric Z. Ayers 2006-11-06 15:55:09 UTC

I assume you want the TCP dumps while this is happening.  The situation is that
we have 3 nodes in a cluster, A B C

A is the 'master' node running a TCPserver task and A, B and C are running code
that execute TCP clients.  A shuts down due to a crash and panic(but I assume it
could be for any reason) and then B and C are left with hanging processes, even
while the IP address that A was acting as a server on behalf of migrates to node
B or C.  

I'm guessing you want me to run tcpdump on B and C and then shutoff A.  HOw long
should I keep dumping?  The processes will hang indefinitely.


Note that the client processes on A, B and C are transient worker tasks and
there will be many megabytes of data if I run this test.  I'll try to limit it
as much as possible and get you dumps ASAP.  

Also note that we plan to update to RHEL 5 Beta   sometime soon to test GFS issues.

Comment 6 Neil Horman 2006-11-30 19:20:41 UTC

Are you using LVS or some simmilar cluster package to cluster these nodes?  If
so, you could probably just get a tcpdump from the client, and that would be
sufficient.  You only need to dump until you determine that the connection is
hung.  I understand that they will be large, but if you can do a capture filter
that selects only the traffic to and from the client node that should help a
little, and I'll just manage whatever size they wind up being from there.

Although, now that I think about it,  I'm a little curious about your cluster
setup.  How do you expect TCP sockets to migrate between nodes in a cluster? 
Reading your problem a little more closely (the cluster angle didn't occur to me
earlier), I don't see how socket state for tcp can be migrated between nodes. 
To make that work, the server cluster is going to have go through either a reset
cycle on the connection, or a graceful fin/ack shutdown and connection
re-establishment with the client, which means your server and client code will
have to be prepared for that.

Well, the dumps will tell us more about what exactly is happening during the
failover, and we can know more about exactly what is going wrong.

Comment 7 Jesse Marlin 2006-11-30 19:47:49 UTC

(In reply to comment #6)
> Although, now that I think about it,  I'm a little curious about your cluster
> setup.  How do you expect TCP sockets to migrate between nodes in a cluster? 
> Reading your problem a little more closely (the cluster angle didn't occur to 
me
> earlier), I don't see how socket state for tcp can be migrated between nodes. 
> To make that work, the server cluster is going to have go through either a 
reset
> cycle on the connection, or a graceful fin/ack shutdown and connection
> re-establishment with the client, which means your server and client code will
> have to be prepared for that.

Eric is out of the office for the rest of the week.  From what I understand we 
don't want to migrate the status of the socket.  Whats happening right now is 
that the sockets are hung after the service is migrated to another node.  So 
the processes never exit.  Other unices return EOF or some other error in this 
case.  We were thinking that the socket should no longer be valid anyway since 
the endpoint has moved to another node (same IP different node).

Comment 8 Neil Horman 2006-11-30 20:35:25 UTC

Thats good if you don't want to migrate the status of the socket, since you
can't do it anyway (thats the reset cycle, or the fin/ack and re-establish cycle
I referred to previously).  As for what a socket read will return (or if it will
return) depends on exactly what packets are exchanged between the client and the
server during and after the failover.

The fact that you don't expect the socket to migrate is good, though.

And as I think about it more, while a trace from the client will be really
helpful, additional traces from cluster nodes A/B/and C using a mirrored port
for traffic to A if need be, to capure after the fencing) would also be good.

Also, If you could describe the cluster setup a little more, I would appreciate
it (are you using LVS, or cluster suite, or another utility to do your
clustering), and what the network topology between the client and the cluster
nodes looks like would help me when I read the traces.

Thanks!

Comment 9 Jesse Marlin 2006-11-30 20:58:41 UTC

Sorry.  We are using RH cluster suite.  The traces will have to wait for Eric 
to get back next week since he had a bunch of stuff setup before when we were 
working with Wendy.

Internode communication is on a LAN connected to a switch.  There are 4 blades 
and a single console machine.  There were some issues with our software causing 
a kernel panic on one of the nodes which in turn caused the service to 
migrate.  Clients on other nodes were connected to the service and these are 
the processes that are actually hanging.

Comment 10 Neil Horman 2007-08-08 19:47:23 UTC

ping, any update?

Comment 11 Eric Z. Ayers 2007-08-08 19:53:50 UTC

We have been waiting on a stable version of GFS before getting back to testing
the Red Hat cluster.  We upgraded to RHEL5 and all the old bugs we had were
back, so we can't even run our application anymore.

Comment 12 Neil Horman 2007-08-08 19:55:56 UTC

ok, let me know when you get back to it then.

Comment 13 Neil Horman 2009-02-20 19:38:21 UTC

ping?

Comment 14 Neil Horman 2009-04-17 10:49:26 UTC

closing due to inactivity.