Bug 162078 - ccsd performance problems
Summary: ccsd performance problems
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: ccs   
(Show other bugs)
Version: 4
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2005-06-29 18:44 UTC by Lon Hohberger
Modified: 2009-04-16 20:17 UTC (History)
6 users (show)

Fixed In Version: RHEL4 U2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2005-10-04 17:33:48 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ccsd local socket patch (11.83 KB, patch)
2005-06-29 23:00 UTC, Lon Hohberger
no flags Details | Diff

Description Lon Hohberger 2005-06-29 18:44:37 UTC
Description of problem:

ccsd uses reserved ports to authenticate that the local user is, in fact, root.
 This is good for security purposes.

A client handshake / set of gets operates like this:

        foo = ccs_connect();
        while (ccs_get(foo, "query", &response) == 0) {
                handle_response(response);
        }
        ccs_disconnect(foo);

For large numbers of queries, however, the connect() will wait for a long time
sometimes -- several seconds.  My guess is that this is related to the fact that
for each ccs_connect(), ccs_disconnect() and ccs_get() call, we're binding to a
reserved port and subsequently connect()ing to ccsd.  My simple cluster
configuration does 531 connect() calls on reserved ports when starting up - and
it pauses every few seconds.  In that time period, the setup_socket_ipv6() call
hangs several times for around 3 seconds.

Version-Release number of selected component (if applicable): RHEL4 GA


How reproducible: Sometimes.

Steps to Reproduce:
1. Create a cluster with lots of services.
2. Start rgmanager with "clurgmgrd -fd".  Sometimes, it can take whole minutes
to "build resource trees".  In this instance, it's simply querying ccsd for
information in a systematic fashion.
  
Actual results:
rgmanager (and probably other apps) take a long time to read the configuration
information from ccsd.

Expected results:
Fast response time from ccsd.


Known workarounds:

* This does not happen with "ccsd -4".  Rgmanager starts up *very* quickly with
the -4 option.


Additional info:

* There's no specific behavior as to how frequent the connect code hangs. 
Sometimes it's after 20 connections, sometimes it's after 300.  I suspect it's
related to running out of reserved ports.
* This might be a case of the socket getting SOREUSEADDR in libccs for ipv4, but
not ipv6

Comment 1 Lon Hohberger 2005-06-29 18:46:41 UTC
Correction: SOREUSEADDR is set, but the way we do port selection might not be
appropriate.

Comment 3 Lon Hohberger 2005-06-29 23:00:29 UTC
Created attachment 116155 [details]
ccsd local socket patch

This patch allows libccs/ccsd to use local (UNIX domain) sockets for
communication, which obviates the TIME_WAIT and limited count of available
ports we have with IP protocols.  The permissions on the socket are &~077 when
created, so only root should be allowed to communicate over that socket.

This patch is compatible with existing installations:

* All applications built statically against the older libccs.a (which only uses
IP for communications) are forward-compatible with the new ccsd, and
* All apps built against the new libccs (with UNIX domain socket support) will
fall back to IPv6/IPv4 if local socket communication with ccsd is unavailable.
* Administrators may disable ccsd's use of UNIX domain sockets by running it
with the new -I option.

Comment 4 Lon Hohberger 2005-06-29 23:03:13 UTC
Note: Existing users of linux-cluster will only benefit from this patch after a
rebuild of each affected application, as most are (currently) statically built
against libccs.

Comment 6 Jonathan Earl Brassow 2005-10-04 17:33:48 UTC
In RHEL4 U2


Note You need to log in before you can comment on or make changes to this bug.