Bug 494977

Summary: segfault in check_rdomain_crash() during failover
Product: Red Hat Enterprise Linux 5 Reporter: Lon Hohberger <lhh>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: low    
Version: 5.3CC: cluster-maint, cward, djansa, dsulliva, edamato, tao
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 11:05:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Patch #1 as attachment.
none
Patch #2 as attachment. none

Description Lon Hohberger 2009-04-08 21:11:57 UTC
Description of problem: rgmanager-

If using rgmanager with a restricted failover domain where some of the nodes of the domain are offline during a failover event, rgmanager can crash with signal-11.

Program terminated with signal 11, Segmentation fault.
#0  0x000000000042c1b1 in s_intersection (left=0xff44020, ll=3, right=0x0, 
    rl=1, ret=0x41dd5eb8, retl=0x41dd5eb4) at sets.c:135
135				if (left[l] != right[r])
(gdb) p right
$1 = (set_type_t *) 0x0
(gdb) p r
$2 = 0
(gdb) p rl
$3 = 1
(gdb) list check_rdomain_crash
427	}
428	
429	
430	int
431	check_rdomain_crash(char *svcName)
432	{
433		int *nodes = NULL, nodecount;
434		int *fd_nodes = NULL, fd_nodecount, fl;
435		int *isect = NULL, icount;
436		char fd_name[256];
(gdb) bt
#0  0x000000000042c1b1 in s_intersection (left=0xff44020, ll=3, right=0x0, 
    rl=1, ret=0x41dd5eb8, retl=0x41dd5eb4) at sets.c:135
#1  0x000000000040d31e in check_rdomain_crash (
    svcName=0x41dd6030 "service:mail3.epbfi.com") at groups.c:448
#2  0x000000000040d799 in consider_start (node=0xff3a2f0, 
    svcName=0x41dd6030 "service:mail3.epbfi.com", svcStatus=0x41dd5fd0, 
    membership=0xff426f0) at groups.c:585
#3  0x000000000040dd24 in eval_groups (local=1, nodeid=1, nodeStatus=1)
    at groups.c:765
#4  0x0000000000419b0e in node_event (local=1, nodeID=1, nodeStatus=1, clean=1)
    at rg_event.c:130
#5  0x000000000041a54f in _event_thread_f (arg=0x0) at rg_event.c:489
#6  0x00000039e3c06367 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) up
#1  0x000000000040d31e in check_rdomain_crash (
    svcName=0x41dd6030 "service:mail3.epbfi.com") at groups.c:448
448		if (s_intersection(fd_nodes, fd_nodecount, nodes, nodecount, 
(gdb) list
443			goto out_free;
444	
445		if (!(fl & FOD_RESTRICTED))
446			goto out_free;
447		
448		if (s_intersection(fd_nodes, fd_nodecount, nodes, nodecount, 
449			    &isect, &icount) < 0)
450			goto out_free;
451	
452		if (icount == 0) {
(gdb) p nodes
$4 = (int *) 0x0
(gdb) p nodecount
$5 = 1

This happens when either malloc() fails or due to the fact that rgmanager's check_rdomain_crash() function didn't correctly form node/nodecount values.

There are two patches which are really necessary:

Patch 1 fixes a log message, but also corrects the fact that node/nodecount values were wrong (this doesn't cause any real problem other than erroneous log messages):

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=587a851e9b8d13e36c17b44b607d1b7fdd2e4840

Patch 2 fixes the segfault in all cases (even if malloc() fails):

http://git.fedorahosted.org/git/?p=cluster.git;a=commit;h=d5dba53a3bc5c20629b449bee5e3b0be4c71b538


Version-Release number of selected component (if applicable): rgmanager-2.0.46-1

How reproducible: Occasionally

Comment 1 Lon Hohberger 2009-04-08 21:18:25 UTC
Created attachment 338803 [details]
Patch #1 as attachment.

Comment 2 Lon Hohberger 2009-04-08 21:18:49 UTC
Created attachment 338804 [details]
Patch #2 as attachment.

Comment 3 Lon Hohberger 2009-04-08 21:22:03 UTC
You can also work around this bugzilla by enabling central_processing in cluster.conf:

  <rm central_processing="1" ... >

Comment 7 Issue Tracker 2009-04-16 17:21:11 UTC
The associated IT has been closed.

Internal Status set to 'Resolved'
Status set to: Closed by Tech

This event sent from IssueTracker by jleddy 
 issue 284271

Comment 12 Chris Ward 2009-07-03 18:29:40 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 16 errata-xmlrpc 2009-09-02 11:05:06 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1339.html