Bug 250085

Summary:	failed assertion in rg_thread.c
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Lon Hohberger <lhh>
Component:	rgmanager	Assignee:	Lon Hohberger <lhh>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4	CC:	cluster-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	RHBA-2007-0149	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-07-30 13:43:38 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lon Hohberger 2007-07-30 13:43:02 UTC

When a customer fails over services randomly the node will panic with the
following error:  failed assertion in rg_thread.c

Customer has identified this as a current bugzilla (181539) which recommended
that rgmanager-1.9.53 be installed. they were still able to reproduce the issue
on that version as well as the latest: rgmanager-1.9.54-1

They've also provided a core which theyve also analyzed, below are their comments:

I attached a core dump from the clurgmgrd from the rgmanager-1.9.54-1.  When I
do a "gdb -c core.19334 /usr/lib/debug/usr/sbin/clurgmgrd.debug" I still get the
same results. 
 
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 
118     rg_thread.c: No such file or directory. 
        in rg_thread.c 

Not sure if this is the exact matching bugzilla, however their core is attached,
will be providing an updated sysreport once received back from customer.

(gdb) bt
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
#2  0x004b5371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6
Previous frame inner to this frame (corrupt stack?)
(gdb) frame 0
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
118     in rg_thread.c
(gdb) list
113     in rg_thread.c
(gdb) print list
$1 = (request_t **) 0xee39d3ec
(gdb) print sizeof(list)
$2 = 4
(gdb) print *0xee39d3ec
$3 = 0
(gdb) print *list
$6 = (request_t *) 0x0
(gdb) print *list
$6 = (request_t *) 0x0
(gdb) bt
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
#2  0x004b5371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6
(gdb) frame 1
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
331     in rg_thread.c
(gdb) print myname
$7 = "oraDmxp", '\\0' <repeats 248 times>
(gdb) print my_queue
$8 = (request_t *) 0x0
(gdb) print my_queue_mutex
$9 = {__m_reserved = 1, __m_count = 0, __m_owner = 0x6174, __m_kind = 0,
__m_lock = {__status = 1, __spinlock = 0}}
(gdb) print *my_queue
$10 = {_list_head = {le_next = 0x0, le_prev = 0x0}, rr_group = '\\0'
<repeats 63 times>, rr_request = 0, rr_errorcode = 0, 
  rr_orig_request = 0, rr_resp_fd = 0, rr_target = 0, rr_arg0 = 0, rr_arg1
= 0, rr_line = 0, _pad_ = 0, rr_file = 0x0, rr_when = 0}
(gdb) print &my_queue
$11 = (request_t **) 0xee39d3ec
(gdb) print my_queue->rr_request
$12 = 0

so looks like my_queue was null, but it keeled over in this area

static void
purge_status_checks(request_t **list)
{
        request_t *curr;

        if (!list)
                return;

        list_do(list, curr) {
                if (curr->rr_request != RG_STATUS)   <-- right here
                        continue;

but list_do is

#define list_do(list, curr) \\
        if (*list && (curr = *list)) do

so we check to see if *list is null, and it wasn't, which would make me
think its getting trampled over somewhere else, but theres not a place
where the mutex is not held when accessing the queue, so I'm not quite
sure how this happened. 

-- Additional comment from lhh on 2006-10-31 17:09 EST --
Created an attachment (id=139908)
Fixes segmentation fault due to incorrect loop semantics


-- Additional comment from [] on 2006-11-03 11:30 EST --
lon,

customer verified that the patch works.

Comment 1 Lon Hohberger 2007-07-30 13:59:47 UTC

Depending on timing, this can cause an assertion failure or a crash.