Bug 418961 - Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup.
Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory an...
Status: CLOSED ERRATA
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: ccs (Show other bugs)
4
All Linux
urgent Severity urgent
: ---
: ---
Assigned To: Ryan O'Hara
Cluster QE
:
Depends On: 418741
Blocks: 431575
  Show dependency treegraph
 
Reported: 2007-12-10 17:53 EST by Nate Straz
Modified: 2009-04-16 15:48 EDT (History)
2 users (show)

See Also:
Fixed In Version: RHBA-2008-0795
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-07-25 15:08:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Nate Straz 2007-12-10 17:53:57 EST
+++ This bug was initially created as a clone of Bug #418741 +++

Description of problem:
With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was
unable to get the 3 node x86_64 cluster to form.  cman_tool join would fail due
to a timeout and ccsd was spinning and eating all the memory it could.

Tracked this down to broadcast message from the ppc with the following comm header:
[cnx_mgr.c:327] Checking broadcast response.
recvfrom 10.15.89.60  (A PPC node)
ch->comm_type: 0x8000000
ch->comm_payload_size: 704905216

ccsd was trying to malloc 704905216 bytes every time it got a response from a
ppc node during the broadcast phase.

The values above are still in big endian byte order, process_broadcast() did not
properly call swab_header() prior to generating the buffer it was going to send
on the wire.

The following patch (against CVS HEAD) allows big/little endian clusters on the
same network.  

Index: cnx_mgr.c
===================================================================
RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v
retrieving revision 1.43
diff -u -b -B -r1.43 cnx_mgr.c
--- cnx_mgr.c   8 May 2007 14:25:41 -0000      1.43
+++ cnx_mgr.c   10 Dec 2007 20:17:59 -0000
@@ -1365,12 +1365,13 @@
     ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE;
   }
 
+  swab_header(ch);
   memcpy(buffer, ch, sizeof(comm_header_t));
+  swab_header(ch);  /* Swab back to dip into ch for payload_size */
   memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size);
 
   log_dbg("Sending cluster.conf (version %d)...\n",
get_doc_version(master_doc->od_doc));
   sendlen = ch->comm_payload_size + sizeof(comm_header_t);
-  swab_header(ch);
   if(sendto(sfd, buffer, sendlen, 0,
            (struct sockaddr *)&addr, (socklen_t)len) < 0){
     log_sys_err("Sendto failed");



Version-Release number of selected component (if applicable):

cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c


How reproducible:
Every time


Steps to Reproduce:
1. Start quorate ppc cluster
2. Attempt to start little endian cluster on same subnet.
   ccsd will either slow your nodes so cman_tool join fails,
   or you will see ccsd alloc a ton of mem, which it does free after the
   cman_tool join works.

  
Actual results:

On my cluster -- cman_tool join fails due to nodes becoming so sluggish.
On clusters with more mem, the join will work.
Comment 1 Ryan O'Hara 2007-12-11 15:59:01 EST
Fixed.
Comment 6 errata-xmlrpc 2008-07-25 15:08:16 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0795.html

Note You need to log in before you can comment on or make changes to this bug.