Description of problem: ccsd uses reserved ports to authenticate that the local user is, in fact, root. This is good for security purposes. A client handshake / set of gets operates like this: foo = ccs_connect(); while (ccs_get(foo, "query", &response) == 0) { handle_response(response); } ccs_disconnect(foo); For large numbers of queries, however, the connect() will wait for a long time sometimes -- several seconds. My guess is that this is related to the fact that for each ccs_connect(), ccs_disconnect() and ccs_get() call, we're binding to a reserved port and subsequently connect()ing to ccsd. My simple cluster configuration does 531 connect() calls on reserved ports when starting up - and it pauses every few seconds. In that time period, the setup_socket_ipv6() call hangs several times for around 3 seconds. Version-Release number of selected component (if applicable): RHEL4 GA How reproducible: Sometimes. Steps to Reproduce: 1. Create a cluster with lots of services. 2. Start rgmanager with "clurgmgrd -fd". Sometimes, it can take whole minutes to "build resource trees". In this instance, it's simply querying ccsd for information in a systematic fashion. Actual results: rgmanager (and probably other apps) take a long time to read the configuration information from ccsd. Expected results: Fast response time from ccsd. Known workarounds: * This does not happen with "ccsd -4". Rgmanager starts up *very* quickly with the -4 option. Additional info: * There's no specific behavior as to how frequent the connect code hangs. Sometimes it's after 20 connections, sometimes it's after 300. I suspect it's related to running out of reserved ports. * This might be a case of the socket getting SOREUSEADDR in libccs for ipv4, but not ipv6
Correction: SOREUSEADDR is set, but the way we do port selection might not be appropriate.
Created attachment 116155 [details] ccsd local socket patch This patch allows libccs/ccsd to use local (UNIX domain) sockets for communication, which obviates the TIME_WAIT and limited count of available ports we have with IP protocols. The permissions on the socket are &~077 when created, so only root should be allowed to communicate over that socket. This patch is compatible with existing installations: * All applications built statically against the older libccs.a (which only uses IP for communications) are forward-compatible with the new ccsd, and * All apps built against the new libccs (with UNIX domain socket support) will fall back to IPv6/IPv4 if local socket communication with ccsd is unavailable. * Administrators may disable ccsd's use of UNIX domain sockets by running it with the new -I option.
Note: Existing users of linux-cluster will only benefit from this patch after a rebuild of each affected application, as most are (currently) statically built against libccs.
In RHEL4 U2