Bug 418741
| Summary: | Mixed endian clusters on same subnet can cause ccsd to consume 90+% memory and cpu during startup. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Dean Jansa <djansa> |
| Component: | cman | Assignee: | Ryan O'Hara <rohara> |
| Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 5.1 | CC: | cluster-maint |
| Target Milestone: | rc | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHBA-2008-0347 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2008-05-21 15:58:36 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 418961, 428325, 429587 | ||
Fixed. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0347.html |
Description of problem: With a quorate PPC cluster on the same subnet as my 3 node x86_64 cluster I was unable to get the 3 node x86_64 cluster to form. cman_tool join would fail due to a timeout and ccsd was spinning and eating all the memory it could. Tracked this down to broadcast message from the ppc with the following comm header: [cnx_mgr.c:327] Checking broadcast response. recvfrom 10.15.89.60 (A PPC node) ch->comm_type: 0x8000000 ch->comm_payload_size: 704905216 ccsd was trying to malloc 704905216 bytes every time it got a response from a ppc node during the broadcast phase. The values above are still in big endian byte order, process_broadcast() did not properly call swab_header() prior to generating the buffer it was going to send on the wire. The following patch (against CVS HEAD) allows big/little endian clusters on the same network. Index: cnx_mgr.c =================================================================== RCS file: /cvs/cluster/cluster/ccs/daemon/cnx_mgr.c,v retrieving revision 1.43 diff -u -b -B -r1.43 cnx_mgr.c --- cnx_mgr.c 8 May 2007 14:25:41 -0000 1.43 +++ cnx_mgr.c 10 Dec 2007 20:17:59 -0000 @@ -1365,12 +1365,13 @@ ch->comm_flags |= COMM_BROADCAST_FROM_QUORATE; } + swab_header(ch); memcpy(buffer, ch, sizeof(comm_header_t)); + swab_header(ch); /* Swab back to dip into ch for payload_size */ memcpy(buffer+sizeof(comm_header_t), payload, ch->comm_payload_size); log_dbg("Sending cluster.conf (version %d)...\n", get_doc_version(master_doc->od_doc)); sendlen = ch->comm_payload_size + sizeof(comm_header_t); - swab_header(ch); if(sendto(sfd, buffer, sendlen, 0, (struct sockaddr *)&addr, (socklen_t)len) < 0){ log_sys_err("Sendto failed"); Version-Release number of selected component (if applicable): cman-2.0.73-1.el5, but this was introduced in revision 1.37 of cnx_mgr.c How reproducible: Every time Steps to Reproduce: 1. Start quorate ppc cluster 2. Attempt to start little endian cluster on same subnet. ccsd will either slow your nodes so cman_tool join fails, or you will see ccsd alloc a ton of mem, which it does free after the cman_tool join works. Actual results: On my cluster -- cman_tool join fails due to nodes becoming so sluggish. On clusters with more mem, the join will work.