Bug 418961 - Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup.
Summary: Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory an...
Alias: None
Product: Red Hat Cluster Suite
Classification: Retired
Component: ccs
Version: 4
Hardware: All
OS: Linux
Target Milestone: ---
Assignee: Ryan O'Hara
QA Contact: Cluster QE
Depends On: 418741
Blocks: 431575
TreeView+ depends on / blocked
Reported: 2007-12-10 22:53 UTC by Nate Straz
Modified: 2009-04-16 19:48 UTC (History)
2 users (show)

Fixed In Version: RHBA-2008-0795
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2008-07-25 19:08:16 UTC

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0795 normal SHIPPED_LIVE ccs bug fix update 2008-07-25 19:08:13 UTC

Description Nate Straz 2007-12-10 22:53:57 UTC
+++ This bug was initially created as a clone of Bug #418741 +++

Description of problem:
With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was
unable to get the 3 node x86_64 cluster to form.  cman_tool join would fail due
to a timeout and ccsd was spinning and eating all the memory it could.

Tracked this down to broadcast message from the ppc with the following comm header:
[cnx_mgr.c:327] Checking broadcast response.
recvfrom  (A PPC node)
ch->comm_type: 0x8000000
ch->comm_payload_size: 704905216

ccsd was trying to malloc 704905216 bytes every time it got a response from a
ppc node during the broadcast phase.

The values above are still in big endian byte order, process_broadcast() did not
properly call swab_header() prior to generating the buffer it was going to send
on the wire.

The following patch (against CVS HEAD) allows big/little endian clusters on the
same network.  

Index: cnx_mgr.c
RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v
retrieving revision 1.43
diff -u -b -B -r1.43 cnx_mgr.c
--- cnx_mgr.c   8 May 2007 14:25:41 -0000      1.43
+++ cnx_mgr.c   10 Dec 2007 20:17:59 -0000
@@ -1365,12 +1365,13 @@
     ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE;
+  swab_header(ch);
   memcpy(buffer, ch, sizeof(comm_header_t));
+  swab_header(ch);  /* Swab back to dip into ch for payload_size */
   memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size);
   log_dbg("Sending cluster.conf (version %d)...\n",
   sendlen = ch->comm_payload_size + sizeof(comm_header_t);
-  swab_header(ch);
   if(sendto(sfd, buffer, sendlen, 0,
            (struct sockaddr *)&addr, (socklen_t)len) < 0){
     log_sys_err("Sendto failed");

Version-Release number of selected component (if applicable):

cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c

How reproducible:
Every time

Steps to Reproduce:
1. Start quorate ppc cluster
2. Attempt to start little endian cluster on same subnet.
   ccsd will either slow your nodes so cman_tool join fails,
   or you will see ccsd alloc a ton of mem, which it does free after the
   cman_tool join works.

Actual results:

On my cluster -- cman_tool join fails due to nodes becoming so sluggish.
On clusters with more mem, the join will work.

Comment 1 Ryan O'Hara 2007-12-11 20:59:01 UTC

Comment 6 errata-xmlrpc 2008-07-25 19:08:16 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.