Bug 261381

Summary:	openais segfault while attempting to start cman during revolver testing
Product:	Red Hat Enterprise Linux 5	Reporter:	Corey Marthaler <cmarthal>
Component:	openais	Assignee:	Steven Dake <sdake>
Status:	CLOSED ERRATA	QA Contact:
Severity:	low	Docs Contact:
Priority:	urgent
Version:	5.0	CC:	ccaulfie, cfeist, cluster-maint, iboverma, nstraz, sghosh
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	openais-0.80.5-2	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-09-02 11:29:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	486382, 486386

Description Corey Marthaler 2007-08-28 18:42:30 UTC

When the following three nodes were shot and coming back up, openais on taft-03
died and caused revolver to fail.

Description of problem:
================================================================================
Senario iteration 5.3 started at Mon Aug 27 17:56:08 CDT 2007
Sleeping 1 minute(s) to let the I/O get its lock count up...
Senario: DLM kill Quorum plus one

Those picked to face the revolver... taft-03 taft-04 taft-01
Feeling lucky taft-03? Well do ya? Go'head make my day...
Feeling lucky taft-04? Well do ya? Go'head make my day...
Feeling lucky taft-01? Well do ya? Go'head make my day...

Verify that taft-03 has been removed from cluster on remaining nodes
Verify that taft-04 has been removed from cluster on remaining nodes
Verify that taft-01 has been removed from cluster on remaining nodes
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
Didn't receive heartbeat for 120 seconds
child taft-03_1 (15552) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-01_0 (16025) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-01_1 (16026) exited with status 127
still not all alive, sleeping another 10 seconds
Didn't receive heartbeat for 120 seconds
child taft-04_0 (14082) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-04_1 (14084) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-03_0 (15546) exited with status 127
still not all alive, sleeping another 10 seconds
All killed nodes are back up (able to be pinged), making sure they're qarshable...
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
All killed nodes are now qarshable

Verifying that recovery properly took place (on the nodes that stayed in the
cluster)
checking that all of the cluster nodes are now/still cman members...
checking fence recovery (state of each service)...
checking dlm recovery (state of each service)...
checking gfs recovery (state of each service)...
checking gfs2 recovery (state of each service)...
checking fence recovery (node membership of each service)...
checking dlm recovery (node membership of each service)...
checking gfs recovery (node membership of each service)...
checking gfs2 recovery (node membership of each service)...

Verifying that clvmd was started properly on the dueler(s)
  connect() failed on local socket: Connection refused
  Skipping clustered volume group taft
  connect() failed on local socket: Connection refused
  Skipping clustered volume group taft

mounting /dev/mapper/taft-lv1 on /mnt/gfs1 on taft-03
/sbin/mount.gfs: invalid device path "/dev/mapper/taft-lv1"
couldn't mount /dev/mapper/taft-lv1 on /mnt/gfs1 on taft-03
Halting nanny due to failure
run_iteration() method failed at
/home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.


[root@taft-01 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M   2492   2007-08-27 17:58:24  taft-02.lab.msp.redhat.com
   3   X      0                        taft-03.lab.msp.redhat.com
   4   M   2496   2007-08-27 17:58:25  taft-04.lab.msp.redhat.com
   5   M   2488   2007-08-27 17:58:24  taft-01.lab.msp.redhat.com


[root@taft-02 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M   2456   2007-08-27 17:44:46  taft-02.lab.msp.redhat.com
   3   X   2476                        taft-03.lab.msp.redhat.com
   4   M   2496   2007-08-27 17:58:21  taft-04.lab.msp.redhat.com
   5   M   2492   2007-08-27 17:58:20  taft-01.lab.msp.redhat.com


[root@taft-03 tmp]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?

[root@taft-04 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M   2496   2007-08-27 17:58:20  taft-02.lab.msp.redhat.com
   3   X      0                        taft-03.lab.msp.redhat.com
   4   M   2496   2007-08-27 17:58:20  taft-04.lab.msp.redhat.com
   5   M   2496   2007-08-27 17:58:20  taft-01.lab.msp.redhat.com

Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] New Configuration:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Left:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Joined:
Aug 27 17:58:22 taft-03 openais[5920]: [SYNC ] This node is within the primary
component and will provide service.
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] New Configuration:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.67)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.68)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.69)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.70)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Left:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Joined:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.67)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.68)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.69)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.70)
Aug 27 17:58:22 taft-03 openais[5920]: [SYNC ] This node is within the primary
component and will provide service.
Aug 27 17:58:22 taft-03 openais[5920]: [TOTEM] entering OPERATIONAL state.
Aug 27 17:58:22 taft-03 openais[5920]: [TOTEM] Message continuation doesn't
match previous frag e: 0 - a: 63
Aug 27 17:58:22 taft-03 openais[5920]: [TOTEM] Throwing away broken message:
continuation 0, index 0
Aug 27 17:58:22 taft-03 ccsd[5911]: Initial status:: Quorate
Aug 27 17:58:23 taft-03 clvmd: Can't open cluster manager socket: Connection refused



Version-Release number of selected component (if applicable):
2.6.18-40.el5
openais-0.80.3-3.el5

Comment 1 Steven Dake 2007-08-28 18:57:27 UTC

changed word assertion to segfault since it was not an assertion resulting in
failure

Comment 2 Steven Dake 2007-08-28 19:07:20 UTC

stack backtrace:
#0  group_matches (iovec=0x7fffeb333490, iov_len=<value optimized out>, 
    groups_b=0x1e8968e0, group_b_cnt=1, adjust_iovec=0x7fffeb3334c4)
    at totempg.c:364
#1  0x000000000041444b in app_deliver_fn (nodeid=2, 
    iovec=<value optimized out>, iov_len=1, endian_conversion_required=0)
    at totempg.c:414
#2  0x00000000004147b8 in totempg_deliver_fn (nodeid=2, 
    iovec=<value optimized out>, iov_len=<value optimized out>, 
    endian_conversion_required=0) at totempg.c:591
#3  0x000000000040e19f in messages_deliver_to_app (instance=0x2aaaac54d010, 
    skip=0, end_point=<value optimized out>) at totemsrp.c:3480
#4  0x000000000040eda7 in message_handler_mcast (instance=0x2aaaac54d010, 
    msg=0x4, msg_len=1436, endian_conversion_needed=<value optimized out>)
    at totemsrp.c:3617
#5  0x000000000040987e in rrp_deliver_fn (context=0x1e8980c0, msg=0x1e8ba130, 
    msg_len=1436) at totemrrp.c:1319
364                     *adjust_iovec += group_len[i];
(gdb) print i
$1 = <value optimized out>
(gdb) print group_len[i]
$2 = 16034
(gdb) print *adjust_iovec
$3 = 118989103
(gdb) print adjust_iovec
$4 = (unsigned int *) 0x7fffeb3334c4
(gdb) print group_len[0]
$5 = 16034
(gdb)

Comment 3 Steven Dake 2007-10-08 21:45:27 UTC

this is a new problem I have never seen and is apparently very rare.

I'll see what I can do about making some test cases to duplicate i.t

Comment 4 Steven Dake 2007-10-08 22:03:22 UTC

did you have /var/log/messages from this crash?

Comment 8 Steven Dake 2008-08-26 16:22:02 UTC

Chrissie,

Have you seen this bug recently in the last year?

Thanks
-steve

Comment 10 Steven Dake 2008-09-03 22:05:23 UTC

hasn't been produced in over a year and I believe along the way I fixed this problem in one of the patches.  Putting in needinfo for further verification from QE that this problem has been resolved.

Comment 11 Steven Dake 2008-12-05 20:58:16 UTC

have test case to reproduce this issue now.  Investigating wrt qpidd development.

Comment 12 Steven Dake 2008-12-05 21:01:59 UTC

*** Bug 472815 has been marked as a duplicate of this bug. ***

Comment 13 Steven Dake 2009-01-20 20:49:56 UTC

*** Bug 450161 has been marked as a duplicate of this bug. ***

Comment 15 Steven Dake 2009-02-18 05:32:09 UTC

fixed in openais-0.80.5-2

Comment 20 errata-xmlrpc 2009-09-02 11:29:10 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1366.html