Bug 250085

Summary: failed assertion in rg_thread.c
Product: [Retired] Red Hat Cluster Suite Reporter: Lon Hohberger <lhh>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED ERRATA QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0149 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-30 13:43:38 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Lon Hohberger 2007-07-30 13:43:02 UTC
When a customer fails over services randomly the node will panic with the
following error:  failed assertion in rg_thread.c

Customer has identified this as a current bugzilla (181539) which recommended
that rgmanager-1.9.53 be installed. they were still able to reproduce the issue
on that version as well as the latest: rgmanager-1.9.54-1

They've also provided a core which theyve also analyzed, below are their comments:

I attached a core dump from the clurgmgrd from the rgmanager-1.9.54-1.  When I
do a "gdb -c core.19334 /usr/lib/debug/usr/sbin/clurgmgrd.debug" I still get the
same results. 
 
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 
118     rg_thread.c: No such file or directory. 
        in rg_thread.c 

Not sure if this is the exact matching bugzilla, however their core is attached,
will be providing an updated sysreport once received back from customer.

(gdb) bt
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
#2  0x004b5371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6
Previous frame inner to this frame (corrupt stack?)
(gdb) frame 0
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
118     in rg_thread.c
(gdb) list
113     in rg_thread.c
(gdb) print list
$1 = (request_t **) 0xee39d3ec
(gdb) print sizeof(list)
$2 = 4
(gdb) print *0xee39d3ec
$3 = 0
(gdb) print *list
$6 = (request_t *) 0x0
(gdb) print *list
$6 = (request_t *) 0x0
(gdb) bt
#0  0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at
rg_thread.c:118
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
#2  0x004b5371 in start_thread () from /lib/tls/libpthread.so.0
#3  0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6
(gdb) frame 1
#1  0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331
331     in rg_thread.c
(gdb) print myname
$7 = "oraDmxp", '\\0' <repeats 248 times>
(gdb) print my_queue
$8 = (request_t *) 0x0
(gdb) print my_queue_mutex
$9 = {__m_reserved = 1, __m_count = 0, __m_owner = 0x6174, __m_kind = 0,
__m_lock = {__status = 1, __spinlock = 0}}
(gdb) print *my_queue
$10 = {_list_head = {le_next = 0x0, le_prev = 0x0}, rr_group = '\\0'
<repeats 63 times>, rr_request = 0, rr_errorcode = 0, 
  rr_orig_request = 0, rr_resp_fd = 0, rr_target = 0, rr_arg0 = 0, rr_arg1
= 0, rr_line = 0, _pad_ = 0, rr_file = 0x0, rr_when = 0}
(gdb) print &my_queue
$11 = (request_t **) 0xee39d3ec
(gdb) print my_queue->rr_request
$12 = 0

so looks like my_queue was null, but it keeled over in this area

static void
purge_status_checks(request_t **list)
{
        request_t *curr;

        if (!list)
                return;

        list_do(list, curr) {
                if (curr->rr_request != RG_STATUS)   <-- right here
                        continue;

but list_do is

#define list_do(list, curr) \\
        if (*list && (curr = *list)) do

so we check to see if *list is null, and it wasn't, which would make me
think its getting trampled over somewhere else, but theres not a place
where the mutex is not held when accessing the queue, so I'm not quite
sure how this happened. 

-- Additional comment from lhh on 2006-10-31 17:09 EST --
Created an attachment (id=139908)
Fixes segmentation fault due to incorrect loop semantics


-- Additional comment from [] on 2006-11-03 11:30 EST --
lon,

customer verified that the patch works.

Comment 1 Lon Hohberger 2007-07-30 13:59:47 UTC
Depending on timing, this can cause an assertion failure or a crash.