Bug 418741 - Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup.
Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory an...
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Ryan O'Hara
GFS Bugs
: ZStream
Depends On:
Blocks: 418961 428325 Cluster5-ppc
  Show dependency treegraph
Reported: 2007-12-10 15:24 EST by Dean Jansa
Modified: 2009-04-16 18:19 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2008-0347
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-05-21 11:58:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Dean Jansa 2007-12-10 15:24:07 EST
Description of problem:
With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was
unable to get the 3 node x86_64 cluster to form.  cman_tool join would fail due
to a timeout and ccsd was spinning and eating all the memory it could.

Tracked this down to broadcast message from the ppc with the following comm header:
[cnx_mgr.c:327] Checking broadcast response.
recvfrom  (A PPC node)
ch->comm_type: 0x8000000
ch->comm_payload_size: 704905216

ccsd was trying to malloc 704905216 bytes every time it got a response from a
ppc node during the broadcast phase.

The values above are still in big endian byte order, process_broadcast() did not
properly call swab_header() prior to generating the buffer it was going to send
on the wire.

The following patch (against CVS HEAD) allows big/little endian clusters on the
same network.  

Index: cnx_mgr.c
RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v
retrieving revision 1.43
diff -u -b -B -r1.43 cnx_mgr.c
--- cnx_mgr.c   8 May 2007 14:25:41 -0000      1.43
+++ cnx_mgr.c   10 Dec 2007 20:17:59 -0000
@@ -1365,12 +1365,13 @@
     ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE;
+  swab_header(ch);
   memcpy(buffer, ch, sizeof(comm_header_t));
+  swab_header(ch);  /* Swab back to dip into ch for payload_size */
   memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size);
   log_dbg("Sending cluster.conf (version %d)...\n",
   sendlen = ch->comm_payload_size + sizeof(comm_header_t);
-  swab_header(ch);
   if(sendto(sfd, buffer, sendlen, 0,
            (struct sockaddr *)&addr, (socklen_t)len) < 0){
     log_sys_err("Sendto failed");

Version-Release number of selected component (if applicable):

cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c

How reproducible:
Every time

Steps to Reproduce:
1. Start quorate ppc cluster
2. Attempt to start little endian cluster on same subnet.
   ccsd will either slow your nodes so cman_tool join fails,
   or you will see ccsd alloc a ton of mem, which it does free after the
   cman_tool join works.

Actual results:

On my cluster -- cman_tool join fails due to nodes becoming so sluggish.
On clusters with more mem, the join will work.
Comment 1 Ryan O'Hara 2007-12-11 15:58:35 EST
Comment 6 errata-xmlrpc 2008-05-21 11:58:36 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.