Bug 418741 - Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup.
Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory an...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: cman (Show other bugs)
5.1
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Ryan O'Hara
GFS Bugs
: ZStream
Depends On:
Blocks: 418961 428325 Cluster5-ppc
  Show dependency treegraph
 
Reported: 2007-12-10 15:24 EST by Dean Jansa
Modified: 2009-04-16 18:19 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2008-0347
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-21 11:58:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Dean Jansa 2007-12-10 15:24:07 EST
Description of problem:
With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was
unable to get the 3 node x86_64 cluster to form.  cman_tool join would fail due
to a timeout and ccsd was spinning and eating all the memory it could.

Tracked this down to broadcast message from the ppc with the following comm header:
[cnx_mgr.c:327] Checking broadcast response.
recvfrom 10.15.89.60  (A PPC node)
ch->comm_type: 0x8000000
ch->comm_payload_size: 704905216

ccsd was trying to malloc 704905216 bytes every time it got a response from a
ppc node during the broadcast phase.

The values above are still in big endian byte order, process_broadcast() did not
properly call swab_header() prior to generating the buffer it was going to send
on the wire.

The following patch (against CVS HEAD) allows big/little endian clusters on the
same network.  

Index: cnx_mgr.c
===================================================================
RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v
retrieving revision 1.43
diff -u -b -B -r1.43 cnx_mgr.c
--- cnx_mgr.c   8 May 2007 14:25:41 -0000      1.43
+++ cnx_mgr.c   10 Dec 2007 20:17:59 -0000
@@ -1365,12 +1365,13 @@
     ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE;
   }
 
+  swab_header(ch);
   memcpy(buffer, ch, sizeof(comm_header_t));
+  swab_header(ch);  /* Swab back to dip into ch for payload_size */
   memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size);
 
   log_dbg("Sending cluster.conf (version %d)...\n",
get_doc_version(master_doc->od_doc));
   sendlen = ch->comm_payload_size + sizeof(comm_header_t);
-  swab_header(ch);
   if(sendto(sfd, buffer, sendlen, 0,
            (struct sockaddr *)&addr, (socklen_t)len) < 0){
     log_sys_err("Sendto failed");



Version-Release number of selected component (if applicable):

cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c


How reproducible:
Every time


Steps to Reproduce:
1. Start quorate ppc cluster
2. Attempt to start little endian cluster on same subnet.
   ccsd will either slow your nodes so cman_tool join fails,
   or you will see ccsd alloc a ton of mem, which it does free after the
   cman_tool join works.

  
Actual results:

On my cluster -- cman_tool join fails due to nodes becoming so sluggish.
On clusters with more mem, the join will work.
Comment 1 Ryan O'Hara 2007-12-11 15:58:35 EST
Fixed.
Comment 6 errata-xmlrpc 2008-05-21 11:58:36 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.