When a customer fails over services randomly the node will panic with the following error: failed assertion in rg_thread.c Customer has identified this as a current bugzilla (181539) which recommended that rgmanager-1.9.53 be installed. they were still able to reproduce the issue on that version as well as the latest: rgmanager-1.9.54-1 They've also provided a core which theyve also analyzed, below are their comments: I attached a core dump from the clurgmgrd from the rgmanager-1.9.54-1. When I do a "gdb -c core.19334 /usr/lib/debug/usr/sbin/clurgmgrd.debug" I still get the same results. #0 0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 118 rg_thread.c: No such file or directory. in rg_thread.c Not sure if this is the exact matching bugzilla, however their core is attached, will be providing an updated sysreport once received back from customer. (gdb) bt #0 0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 #1 0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331 #2 0x004b5371 in start_thread () from /lib/tls/libpthread.so.0 #3 0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6 Previous frame inner to this frame (corrupt stack?) (gdb) frame 0 #0 0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 118 in rg_thread.c (gdb) list 113 in rg_thread.c (gdb) print list $1 = (request_t **) 0xee39d3ec (gdb) print sizeof(list) $2 = 4 (gdb) print *0xee39d3ec $3 = 0 (gdb) print *list $6 = (request_t *) 0x0 (gdb) print *list $6 = (request_t *) 0x0 (gdb) bt #0 0x0804a9b0 in purge_status_checks (list=0xee39d3ec) at rg_thread.c:118 #1 0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331 #2 0x004b5371 in start_thread () from /lib/tls/libpthread.so.0 #3 0x001d8ffe in phys_pages_info () from /lib/tls/libc.so.6 (gdb) frame 1 #1 0x0804b064 in resgroup_thread_main (arg=0x8f137a0) at rg_thread.c:331 331 in rg_thread.c (gdb) print myname $7 = "oraDmxp", '\\0' <repeats 248 times> (gdb) print my_queue $8 = (request_t *) 0x0 (gdb) print my_queue_mutex $9 = {__m_reserved = 1, __m_count = 0, __m_owner = 0x6174, __m_kind = 0, __m_lock = {__status = 1, __spinlock = 0}} (gdb) print *my_queue $10 = {_list_head = {le_next = 0x0, le_prev = 0x0}, rr_group = '\\0' <repeats 63 times>, rr_request = 0, rr_errorcode = 0, rr_orig_request = 0, rr_resp_fd = 0, rr_target = 0, rr_arg0 = 0, rr_arg1 = 0, rr_line = 0, _pad_ = 0, rr_file = 0x0, rr_when = 0} (gdb) print &my_queue $11 = (request_t **) 0xee39d3ec (gdb) print my_queue->rr_request $12 = 0 so looks like my_queue was null, but it keeled over in this area static void purge_status_checks(request_t **list) { request_t *curr; if (!list) return; list_do(list, curr) { if (curr->rr_request != RG_STATUS) <-- right here continue; but list_do is #define list_do(list, curr) \\ if (*list && (curr = *list)) do so we check to see if *list is null, and it wasn't, which would make me think its getting trampled over somewhere else, but theres not a place where the mutex is not held when accessing the queue, so I'm not quite sure how this happened. -- Additional comment from lhh on 2006-10-31 17:09 EST -- Created an attachment (id=139908) Fixes segmentation fault due to incorrect loop semantics -- Additional comment from [] on 2006-11-03 11:30 EST -- lon, customer verified that the patch works.
Depending on timing, this can cause an assertion failure or a crash.