+++ This bug was initially created as a clone of Bug #418741 +++ Description of problem: With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was unable to get the 3 node x86_64 cluster to form. cman_tool join would fail due to a timeout and ccsd was spinning and eating all the memory it could. Tracked this down to broadcast message from the ppc with the following comm header: [cnx_mgr.c:327] Checking broadcast response. recvfrom 10.15.89.60 (A PPC node) ch->comm_type: 0x8000000 ch->comm_payload_size: 704905216 ccsd was trying to malloc 704905216 bytes every time it got a response from a ppc node during the broadcast phase. The values above are still in big endian byte order, process_broadcast() did not properly call swab_header() prior to generating the buffer it was going to send on the wire. The following patch (against CVS HEAD) allows big/little endian clusters on the same network. Index: cnx_mgr.c =================================================================== RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v retrieving revision 1.43 diff -u -b -B -r1.43 cnx_mgr.c --- cnx_mgr.c 8 May 2007 14:25:41 -0000 1.43 +++ cnx_mgr.c 10 Dec 2007 20:17:59 -0000 @@ -1365,12 +1365,13 @@ ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE; } + swab_header(ch); memcpy(buffer, ch, sizeof(comm_header_t)); + swab_header(ch); /* Swab back to dip into ch for payload_size */ memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size); log_dbg("Sending cluster.conf (version %d)...\n", get_doc_version(master_doc->od_doc)); sendlen = ch->comm_payload_size + sizeof(comm_header_t); - swab_header(ch); if(sendto(sfd, buffer, sendlen, 0, (struct sockaddr *)&addr, (socklen_t)len) < 0){ log_sys_err("Sendto failed"); Version-Release number of selected component (if applicable): cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c How reproducible: Every time Steps to Reproduce: 1. Start quorate ppc cluster 2. Attempt to start little endian cluster on same subnet. ccsd will either slow your nodes so cman_tool join fails, or you will see ccsd alloc a ton of mem, which it does free after the cman_tool join works. Actual results: On my cluster -- cman_tool join fails due to nodes becoming so sluggish. On clusters with more mem, the join will work.
Fixed.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0795.html