250085 – failed assertion in rg_thread.c

Bug 250085 - failed assertion in rg_thread.c

Summary: failed assertion in rg_thread.c

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Cluster Suite
Classification:	Retired
Component:	rgmanager
Sub Component:
Version:	4
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Lon Hohberger
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-07-30 13:43 UTC by Lon Hohberger
Modified:	2009-04-16 20:22 UTC (History)
CC List:	1 user (show)
Fixed In Version:	RHBA-2007-0149
Clone Of:
Environment:
Last Closed:	2007-07-30 13:43:38 UTC
Embargoed:

Attachments	(Terms of Use)

Description Lon Hohberger 2007-07-30 13:43:02 UTC

When a customer fails over services randomly the node will panic with the
following error:  failed assertion in rg_thread.c

Customer has identified this as a current bugzilla (181539) which recommended
that rgmanager-1.9.53 be installed. they were still able to reproduce the issue
on that version as well as the latest: rgmanager-1.9.54-1

They've also provided a core which theyve also analyzed, below are their comments:

I attached a core dump from the clurgmgrd from the rgmanager-1.9.54-1.  When I
do a "gdb -c core.19334 /usr/lib/debug/usr/sbin/clurgmgrd.debug" I still get the
same results. 
 
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 
118     rg_thread.c: No such file or directory. 
        in rg_thread.c 

Not sure if this is the exact matching bugzilla, however their core is attached,
will be providing an updated sysreport once received back from customer.

(gdb) bt
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
#2  0x004b5371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6
Previous frame inner to this frame (corrupt stack?)
(gdb) frame 0
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
118     in rg_thread.c
(gdb) list
113     in rg_thread.c
(gdb) print list
$1 = (request_t **) 0xee39d3ec
(gdb) print sizeof(list)
$2 = 4
(gdb) print *0xee39d3ec
$3 = 0
(gdb) print *list
$6 = (request_t *) 0x0
(gdb) print *list
$6 = (request_t *) 0x0
(gdb) bt
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
#2  0x004b5371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6
(gdb) frame 1
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
331     in rg_thread.c
(gdb) print myname
$7 = "oraDmxp", '\\0' <repeats 248 times>
(gdb) print my_queue
$8 = (request_t *) 0x0
(gdb) print my_queue_mutex
$9 = {__m_reserved = 1, __m_count = 0, __m_owner = 0x6174, __m_kind = 0,
__m_lock = {__status = 1, __spinlock = 0}}
(gdb) print *my_queue
$10 = {_list_head = {le_next = 0x0, le_prev = 0x0}, rr_group = '\\0'
<repeats 63 times>, rr_request = 0, rr_errorcode = 0, 
  rr_orig_request = 0, rr_resp_fd = 0, rr_target = 0, rr_arg0 = 0, rr_arg1
= 0, rr_line = 0, _pad_ = 0, rr_file = 0x0, rr_when = 0}
(gdb) print &my_queue
$11 = (request_t **) 0xee39d3ec
(gdb) print my_queue->rr_request
$12 = 0

so looks like my_queue was null, but it keeled over in this area

static void
purge_status_checks(request_t **list)
{
        request_t *curr;

        if (!list)
                return;

        list_do(list, curr) {
                if (curr->rr_request != RG_STATUS)   <-- right here
                        continue;

but list_do is

#define list_do(list, curr) \\
        if (*list && (curr = *list)) do

so we check to see if *list is null, and it wasn't, which would make me
think its getting trampled over somewhere else, but theres not a place
where the mutex is not held when accessing the queue, so I'm not quite
sure how this happened. 

-- Additional comment from lhh on 2006-10-31 17:09 EST --
Created an attachment (id=139908)
Fixes segmentation fault due to incorrect loop semantics


-- Additional comment from [] on 2006-11-03 11:30 EST --
lon,

customer verified that the patch works.

Comment 1 Lon Hohberger 2007-07-30 13:59:47 UTC

Depending on timing, this can cause an assertion failure or a crash.

Note You need to log in before you can comment on or make changes to this bug.