Hide Forgot
Description of problem: Our application uses TCP sockets to commuicate between members of the cluster to the member hosting one of the clustered IP addresses. After the node hosting the IP address is 'fenced' (turned off), the client applications stay hung in a blocking recv() indefinately. On other cluster architectures, we are used to the sockets returning EOF or some kind of error. If the node were turned off and no other communication occured, I could understand the hang. But this IP address was relocated to another node and our application sucessfully restarted. I don't know what mechanism triggers the socket close on other OS's. Re-arp maybe? Should TCP retransmits from the client have detected a problem and closed the socket on the client side? Maybe there is a way to tune the TCP stack so that it doesn't hang indefinitely? Version-Release number of selected component (if applicable): $ rpm -q cman cman-1.0.11-0 -bash-3.00$ rpm -q kernel kernel-2.6.9-42.0.2.EL How reproducible: Create a client/server relationship between one node in a cluster and a cluster'ed IP address. Fence off the node hosting the clustered IP address The client application will hang, even after the IP address relocates to another node. Actual results: client application hangs in recv() Expected results: recv() should return an error and the remote end of the socket closed after the IP address is re-hosted on another node (or after a reboot.) Additional info: This process has been hung for over 10 hours: cgi 11389 0.0 0.0 17104 3128 ? SN Sep13 0:00 dcs_operation_mgr -oper EPSREJTEST1 -index 32763 -nice 10 -plan EPSTST1 -master PYGEPS4_INPUT_7_038166.dat -reprocess_id 0 -file PYGEPS4_INPUT_7_038166.dat -sanity 300 -service blade01-1 25383 UDP (gdb) where #0 0xffffe405 in __kernel_vsyscall () #1 0x00276d51 in recv () from /lib/tls/libc.so.6 #2 0xf778f6d8 in cgipc_recv (serv=0x80971e0 "10.0.0.240 24037 TCP", rserv=0xffff9310 "", cgipc=0x808c308, buf=0x808f240 "", pkt_bytes=0xffff930e) at cgipc_recv.c:838 #3 0xf7706d99 in receive_pkt (packet=0xffff93b0) at uio_file_serv_api.c:1874 #4 0xf7704b0c in uiofs_open_isam (handle=0xffffa688, file_name=0xf7d62638 "DCS_DB_PATH:dcs_rtgreq_file", info=0xffffbac1, options=3) at uio_file_serv_api.c:450 #5 0xf7710f3e in uio_open_isam (handle=0xf7fe0788, name=0xf7d62638 "DCS_DB_PATH:dcs_rtgreq_file", options=3) at uio_index_api.c:624 #6 0xf7cd511e in dcs_rtgreq_init () at dcs_plan_internal.c:1733 #7 0xf7cd5d5d in dcs_rtgreq_get (rtgreq=0x8084a70, key_type=0, lock=524288, option=1) at dcs_plan_internal.c:2347 #8 0xf7cb7a4e in dcs_oper_build_rtgreq (plan=0x8063fec, master_file=0x8064000, operation=0x8063fe0, file=0x8064280, reprocess_id=0, request=0x8084a70) at dcs_plan_api.c:7893 #9 0x0804f075 in build_req_list (event_driven=15) at dcs_operation_mgr.c:1768 #10 0x0804eee2 in get_command_line_options (argc=1, argv=0xffffad34) at dcs_operation_mgr.c:1687 #11 0x0804d151 in init (argc=19, argv=0xffffad34) at dcs_operation_mgr.c:1057 #12 0x0804ce52 in main (argc=19, argv=0xffffad34) at dcs_operation_mgr.c:936 (gdb) 10.0.0.240 is our clustered IP address. The node blade04 was fenced off yesterday at 18:30 (kernel panic).
I suspect this needs to be assigned to the kernel, maybe. It's certainly nothing to do with cman.
I thought maybe some extra step might need to be taken when relocating an IP address, but probably this is an issue in the kernel. Our kernel version: $ uname -a Linux blade01-1 2.6.9-42.0.2.ELsmp #1 SMP Thu Aug 17 17:57:31 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux We have been running our product on Digital/Compaq/HP TruCluster types of clusters, HP Service Guard, Sun custers using SunCluster and Veritas without this issue. We have been running Linux 2.2 and Linux 2.4 servers for many years that communicate with a clustered server as well and I've never noticed this issue. So my hunch on this is that there must be something different about Linux 2.6 or linux 2.6 X86_64 that is causing this situation. Patrick asked me if we had set setsockopt(SO_KEEPALIVE) on. The answer is no. We used to use KEEPALIVE for these types of connections (pre 1999), but they were apparently detrimental to server restarts (at least, that is what my comment from sometime in 2000 says.)
This problem is a showstopper for us. We don't have it in prodcution anywhere, but we won't be able to use clustering with our APP on Linux until we can find out why it behaves differently from HP, OSF, Sun in clusters. And we only rarely install our application (telecom industry) without clustering.
I think it would be best to provide tcpdumps from the hanging machine (binary format please) of a hanging and a non-hanging instance. The comparison of the two dumps will help us track down where the problem is occurring.
I assume you want the TCP dumps while this is happening. The situation is that we have 3 nodes in a cluster, A B C A is the 'master' node running a TCPserver task and A, B and C are running code that execute TCP clients. A shuts down due to a crash and panic(but I assume it could be for any reason) and then B and C are left with hanging processes, even while the IP address that A was acting as a server on behalf of migrates to node B or C. I'm guessing you want me to run tcpdump on B and C and then shutoff A. HOw long should I keep dumping? The processes will hang indefinitely. Note that the client processes on A, B and C are transient worker tasks and there will be many megabytes of data if I run this test. I'll try to limit it as much as possible and get you dumps ASAP. Also note that we plan to update to RHEL 5 Beta sometime soon to test GFS issues.
Are you using LVS or some simmilar cluster package to cluster these nodes? If so, you could probably just get a tcpdump from the client, and that would be sufficient. You only need to dump until you determine that the connection is hung. I understand that they will be large, but if you can do a capture filter that selects only the traffic to and from the client node that should help a little, and I'll just manage whatever size they wind up being from there. Although, now that I think about it, I'm a little curious about your cluster setup. How do you expect TCP sockets to migrate between nodes in a cluster? Reading your problem a little more closely (the cluster angle didn't occur to me earlier), I don't see how socket state for tcp can be migrated between nodes. To make that work, the server cluster is going to have go through either a reset cycle on the connection, or a graceful fin/ack shutdown and connection re-establishment with the client, which means your server and client code will have to be prepared for that. Well, the dumps will tell us more about what exactly is happening during the failover, and we can know more about exactly what is going wrong.
(In reply to comment #6) > Although, now that I think about it, I'm a little curious about your cluster > setup. How do you expect TCP sockets to migrate between nodes in a cluster? > Reading your problem a little more closely (the cluster angle didn't occur to me > earlier), I don't see how socket state for tcp can be migrated between nodes. > To make that work, the server cluster is going to have go through either a reset > cycle on the connection, or a graceful fin/ack shutdown and connection > re-establishment with the client, which means your server and client code will > have to be prepared for that. Eric is out of the office for the rest of the week. From what I understand we don't want to migrate the status of the socket. Whats happening right now is that the sockets are hung after the service is migrated to another node. So the processes never exit. Other unices return EOF or some other error in this case. We were thinking that the socket should no longer be valid anyway since the endpoint has moved to another node (same IP different node).
Thats good if you don't want to migrate the status of the socket, since you can't do it anyway (thats the reset cycle, or the fin/ack and re-establish cycle I referred to previously). As for what a socket read will return (or if it will return) depends on exactly what packets are exchanged between the client and the server during and after the failover. The fact that you don't expect the socket to migrate is good, though. And as I think about it more, while a trace from the client will be really helpful, additional traces from cluster nodes A/B/and C using a mirrored port for traffic to A if need be, to capure after the fencing) would also be good. Also, If you could describe the cluster setup a little more, I would appreciate it (are you using LVS, or cluster suite, or another utility to do your clustering), and what the network topology between the client and the cluster nodes looks like would help me when I read the traces. Thanks!
Sorry. We are using RH cluster suite. The traces will have to wait for Eric to get back next week since he had a bunch of stuff setup before when we were working with Wendy. Internode communication is on a LAN connected to a switch. There are 4 blades and a single console machine. There were some issues with our software causing a kernel panic on one of the nodes which in turn caused the service to migrate. Clients on other nodes were connected to the service and these are the processes that are actually hanging.
ping, any update?
We have been waiting on a stable version of GFS before getting back to testing the Red Hat cluster. We upgraded to RHEL5 and all the old bugs we had were back, so we can't even run our application anymore.
ok, let me know when you get back to it then.
ping?
closing due to inactivity.