Bug 619476

Summary: clurgmgrd segfaults with error 6
Product: [Retired] Red Hat Cluster Suite Reporter: Shane Bradley <sbradley>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: medium Docs Contact:
Priority: low    
Version: 4CC: cluster-maint, djansa, edamato, kurt, tao
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-22 14:32:44 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Shane Bradley 2010-07-29 15:56:32 UTC
Description of problem:
rgmanager is generating segfaults on service state change:


Jul  7 14:10:04 nodeA clurgmgrd[500]: <notice> Service Q01STRS_TH started 
....
Jul  8 05:39:27 nodeA clurgmgrd[500]: <notice> Service Q01STRS_MY started
Jul  8 05:44:45 nodeA clurgmgrd[500]: <err> #48: Unable to obtain cluster lock: Unknown error 65539
Jul  8 05:44:45 nodeA clurgmgrd[500]: <notice> Stopping service Q01STRS_AU
Jul  8 05:44:55 nodeA clurgmgrd[500]: <notice> Service Q01STRS_AU is recovering
Jul  8 05:44:55 nodeA clurgmgrd[500]: <notice> Recovering failed service Q01STRS_AU
Jul  8 05:45:16 nodeA clurgmgrd[500]: <notice> Service Q01STRS_AU started
Jul  8 05:56:06 nodeA kernel: clurgmgrd[15838]: segfault at 000000c000000010 rip 0000003000269b40 rsp 000000007204e900 error 6
Jul  8 05:56:06 nodeA clurgmgrd[499]: <crit> Watchdog: Daemon died, rebooting...
Jul  8 05:56:06 nodeA kernel: md: stopping all md devices.
Jul  8 05:56:06 nodeA kernel: md: md0 switched to read-only mode.
Jul  8 05:59:25 nodeA syslogd 1.4.1: restart (remote reception).
....
Jul  8 06:01:12 nodeA clurgmgrd[506]: <notice> Starting stopped service Q01STRS_TH 
Jul  8 06:01:33 nodeA clurgmgrd[506]: <notice> Service Q01STRS_TH started 

The segfault backtrace looks like:
Program terminated with signal 11, Segmentation fault.
#0  _int_malloc (av=0x3000434640, bytes=) at malloc.c:4181
4181            bck->fd = bin;

Thread 1 (process 15838):
#0  _int_malloc (av=0x3000434640, bytes=) at malloc.c:4181
#1  0x000000300026b6d2 in *__GI___libc_malloc (bytes=32) at malloc.c:3346
#2  0x0000000000425028 in clist_insert ()
#3  0x00000000004216bf in msg_open ()
#4  0x000000000041efc6 in vf_write (membership=0x657850, flags=2, keyid=0x7204ec60 "usrm::rg=\"Q01STRS_TH\"", data=0x7204ef20, datalen=104) at vft.c:1315
#5  0x000000000040b515 in set_rg_state (rgname=0x7204efd8 "Q01STRS_TH", svcblk=0x7204ef20) at rg_state.c:306
#6  0x000000000040b595 in init_rg (name=0x7204efd8 "Q01STRS_TH", svcblk=0x7204ef20) at rg_state.c:323
#7  0x000000000040b688 in get_rg_state (rgname=0x7204efd0 "service:Q01STRS_TH", svcblk=0x7204ef20) at rg_state.c:353
#8  0x000000000040c3c7 in svc_status (svcName=0x7204efd0 "service:Q01STRS_TH") at rg_state.c:877
#9  0x0000000000404f10 in resgroup_thread_main (arg=0x414620c0) at rg_thread.c:384
#10 0x0000003527d06137 in start_thread (arg=) at pthread_create.c:274
#11 0x00000030002c9883 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112 from /lib64/tls/libc.so.6
Current language:  auto; currently c
#1  0x000000300026b6d2 in *__GI___libc_malloc (bytes=32) at malloc.c:3346
3346      victim = _int_malloc(ar_ptr, bytes);

Version-Release number of selected component (if applicable):
rgmanager-1.9.87-1.el4_8.1-x86_64 


How reproducible:
Not easily, only happen couple times.

Steps to Reproduce:
1. Appears to happen when Service is changing states
  
Actual results:
clurgmgrd segfaults with error 6

Expected results:
No segfault

Additional info:

Comment 3 Lon Hohberger 2010-09-28 16:10:01 UTC
*** Bug 637263 has been marked as a duplicate of this bug. ***

Comment 5 Lon Hohberger 2010-10-22 14:32:44 UTC
This was fixed some time ago by bug 572695.

Furthermore, it was copied into the z-stream (EUS) as bug 572792.

https://rhn.redhat.com/errata/RHBA-2010-0404.html

*** This bug has been marked as a duplicate of bug 572695 ***