418741 – Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup.

Bug 418741 - Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup.

Summary: Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory an...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	cman
Sub Component:
Version:	5.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Ryan O'Hara
QA Contact:	GFS Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	418961 428325 Cluster5-ppc
TreeView+	depends on / blocked

Reported:	2007-12-10 20:24 UTC by Dean Jansa
Modified:	2009-04-16 22:19 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2008-0347
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:58:36 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0347	0	normal	SHIPPED_LIVE	cman bug fix and enhancement update	2008-05-20 12:39:41 UTC

Description Dean Jansa 2007-12-10 20:24:07 UTC

Description of problem:
With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was
unable to get the 3 node x86_64 cluster to form.  cman_tool join would fail due
to a timeout and ccsd was spinning and eating all the memory it could.

Tracked this down to broadcast message from the ppc with the following comm header:
[cnx_mgr.c:327] Checking broadcast response.
recvfrom 10.15.89.60  (A PPC node)
ch->comm_type: 0x8000000
ch->comm_payload_size: 704905216

ccsd was trying to malloc 704905216 bytes every time it got a response from a
ppc node during the broadcast phase.

The values above are still in big endian byte order, process_broadcast() did not
properly call swab_header() prior to generating the buffer it was going to send
on the wire.

The following patch (against CVS HEAD) allows big/little endian clusters on the
same network.  

Index: cnx_mgr.c
===================================================================
RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v
retrieving revision 1.43
diff -u -b -B -r1.43 cnx_mgr.c
--- cnx_mgr.c   8 May 2007 14:25:41 -0000      1.43
+++ cnx_mgr.c   10 Dec 2007 20:17:59 -0000
@@ -1365,12 +1365,13 @@
     ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE;
   }
 
+  swab_header(ch);
   memcpy(buffer, ch, sizeof(comm_header_t));
+  swab_header(ch);  /* Swab back to dip into ch for payload_size */
   memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size);
 
   log_dbg("Sending cluster.conf (version %d)...\n",
get_doc_version(master_doc->od_doc));
   sendlen = ch->comm_payload_size + sizeof(comm_header_t);
-  swab_header(ch);
   if(sendto(sfd, buffer, sendlen, 0,
            (struct sockaddr *)&addr, (socklen_t)len) < 0){
     log_sys_err("Sendto failed");



Version-Release number of selected component (if applicable):

cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c


How reproducible:
Every time


Steps to Reproduce:
1. Start quorate ppc cluster
2. Attempt to start little endian cluster on same subnet.
   ccsd will either slow your nodes so cman_tool join fails,
   or you will see ccsd alloc a ton of mem, which it does free after the
   cman_tool join works.

  
Actual results:

On my cluster -- cman_tool join fails due to nodes becoming so sluggish.
On clusters with more mem, the join will work.

Comment 1 Ryan O'Hara 2007-12-11 20:58:35 UTC

Fixed.

Comment 6 errata-xmlrpc 2008-05-21 15:58:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0347.html

Note You need to log in before you can comment on or make changes to this bug.