Bug 261381 - openais segfault while attempting to start cman during revolver testing
openais segfault while attempting to start cman during revolver testing
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.0
All All
urgent Severity low
: rc
: ---
Assigned To: Steven Dake
: ZStream
: 450161 472815 (view as bug list)
Depends On:
Blocks: 486382 486386
  Show dependency treegraph
 
Reported: 2007-08-28 14:42 EDT by Corey Marthaler
Modified: 2016-04-26 11:53 EDT (History)
6 users (show)

See Also:
Fixed In Version: openais-0.80.5-2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 07:29:10 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)

  None (edit)
Description Corey Marthaler 2007-08-28 14:42:30 EDT
When the following three nodes were shot and coming back up, openais on taft-03
died and caused revolver to fail.

Description of problem:
================================================================================
Senario iteration 5.3 started at Mon Aug 27 17:56:08 CDT 2007
Sleeping 1 minute(s) to let the I/O get its lock count up...
Senario: DLM kill Quorum plus one

Those picked to face the revolver... taft-03 taft-04 taft-01
Feeling lucky taft-03? Well do ya? Go'head make my day...
Feeling lucky taft-04? Well do ya? Go'head make my day...
Feeling lucky taft-01? Well do ya? Go'head make my day...

Verify that taft-03 has been removed from cluster on remaining nodes
Verify that taft-04 has been removed from cluster on remaining nodes
Verify that taft-01 has been removed from cluster on remaining nodes
Verifying that the dueler(s) are alive
still not all alive, sleeping another 10 seconds
Didn't receive heartbeat for 120 seconds
child taft-03_1 (15552) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-01_0 (16025) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-01_1 (16026) exited with status 127
still not all alive, sleeping another 10 seconds
Didn't receive heartbeat for 120 seconds
child taft-04_0 (14082) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-04_1 (14084) exited with status 127
Didn't receive heartbeat for 120 seconds
child taft-03_0 (15546) exited with status 127
still not all alive, sleeping another 10 seconds
All killed nodes are back up (able to be pinged), making sure they're qarshable...
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
still not all qarshable, sleeping another 10 seconds
All killed nodes are now qarshable

Verifying that recovery properly took place (on the nodes that stayed in the
cluster)
checking that all of the cluster nodes are now/still cman members...
checking fence recovery (state of each service)...
checking dlm recovery (state of each service)...
checking gfs recovery (state of each service)...
checking gfs2 recovery (state of each service)...
checking fence recovery (node membership of each service)...
checking dlm recovery (node membership of each service)...
checking gfs recovery (node membership of each service)...
checking gfs2 recovery (node membership of each service)...

Verifying that clvmd was started properly on the dueler(s)
  connect() failed on local socket: Connection refused
  Skipping clustered volume group taft
  connect() failed on local socket: Connection refused
  Skipping clustered volume group taft

mounting /dev/mapper/taft-lv1 on /mnt/gfs1 on taft-03
/sbin/mount.gfs: invalid device path "/dev/mapper/taft-lv1"
couldn't mount /dev/mapper/taft-lv1 on /mnt/gfs1 on taft-03
Halting nanny due to failure
run_iteration() method failed at
/home/msp/cmarthal/work/rhel5/sts-root/lib/FI_engine.pm line 21.


[root@taft-01 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M   2492   2007-08-27 17:58:24  taft-02.lab.msp.redhat.com
   3   X      0                        taft-03.lab.msp.redhat.com
   4   M   2496   2007-08-27 17:58:25  taft-04.lab.msp.redhat.com
   5   M   2488   2007-08-27 17:58:24  taft-01.lab.msp.redhat.com


[root@taft-02 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M   2456   2007-08-27 17:44:46  taft-02.lab.msp.redhat.com
   3   X   2476                        taft-03.lab.msp.redhat.com
   4   M   2496   2007-08-27 17:58:21  taft-04.lab.msp.redhat.com
   5   M   2492   2007-08-27 17:58:20  taft-01.lab.msp.redhat.com


[root@taft-03 tmp]# cman_tool nodes
cman_tool: Cannot open connection to cman, is it running ?

[root@taft-04 tmp]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   2   M   2496   2007-08-27 17:58:20  taft-02.lab.msp.redhat.com
   3   X      0                        taft-03.lab.msp.redhat.com
   4   M   2496   2007-08-27 17:58:20  taft-04.lab.msp.redhat.com
   5   M   2496   2007-08-27 17:58:20  taft-01.lab.msp.redhat.com

Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] New Configuration:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Left:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Joined:
Aug 27 17:58:22 taft-03 openais[5920]: [SYNC ] This node is within the primary
component and will provide service.
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] CLM CONFIGURATION CHANGE
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] New Configuration:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.67)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.68)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.69)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.70)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Left:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ] Members Joined:
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.67)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.68)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.69)
Aug 27 17:58:22 taft-03 openais[5920]: [CLM  ]  r(0) ip(10.15.89.70)
Aug 27 17:58:22 taft-03 openais[5920]: [SYNC ] This node is within the primary
component and will provide service.
Aug 27 17:58:22 taft-03 openais[5920]: [TOTEM] entering OPERATIONAL state.
Aug 27 17:58:22 taft-03 openais[5920]: [TOTEM] Message continuation doesn't
match previous frag e: 0 - a: 63
Aug 27 17:58:22 taft-03 openais[5920]: [TOTEM] Throwing away broken message:
continuation 0, index 0
Aug 27 17:58:22 taft-03 ccsd[5911]: Initial status:: Quorate
Aug 27 17:58:23 taft-03 clvmd: Can't open cluster manager socket: Connection refused



Version-Release number of selected component (if applicable):
2.6.18-40.el5
openais-0.80.3-3.el5
Comment 1 Steven Dake 2007-08-28 14:57:27 EDT
changed word assertion to segfault since it was not an assertion resulting in
failure
Comment 2 Steven Dake 2007-08-28 15:07:20 EDT
stack backtrace:
#0  group_matches (iovec=0x7fffeb333490, iov_len=<value optimized out>, 
    groups_b=0x1e8968e0, group_b_cnt=1, adjust_iovec=0x7fffeb3334c4)
    at totempg.c:364
#1  0x000000000041444b in app_deliver_fn (nodeid=2, 
    iovec=<value optimized out>, iov_len=1, endian_conversion_required=0)
    at totempg.c:414
#2  0x00000000004147b8 in totempg_deliver_fn (nodeid=2, 
    iovec=<value optimized out>, iov_len=<value optimized out>, 
    endian_conversion_required=0) at totempg.c:591
#3  0x000000000040e19f in messages_deliver_to_app (instance=0x2aaaac54d010, 
    skip=0, end_point=<value optimized out>) at totemsrp.c:3480
#4  0x000000000040eda7 in message_handler_mcast (instance=0x2aaaac54d010, 
    msg=0x4, msg_len=1436, endian_conversion_needed=<value optimized out>)
    at totemsrp.c:3617
#5  0x000000000040987e in rrp_deliver_fn (context=0x1e8980c0, msg=0x1e8ba130, 
    msg_len=1436) at totemrrp.c:1319
364                     *adjust_iovec += group_len[i];
(gdb) print i
$1 = <value optimized out>
(gdb) print group_len[i]
$2 = 16034
(gdb) print *adjust_iovec
$3 = 118989103
(gdb) print adjust_iovec
$4 = (unsigned int *) 0x7fffeb3334c4
(gdb) print group_len[0]
$5 = 16034
(gdb) 
Comment 3 Steven Dake 2007-10-08 17:45:27 EDT
this is a new problem I have never seen and is apparently very rare.

I'll see what I can do about making some test cases to duplicate i.t
Comment 4 Steven Dake 2007-10-08 18:03:22 EDT
did you have /var/log/messages from this crash?
Comment 8 Steven Dake 2008-08-26 12:22:02 EDT
Chrissie,

Have you seen this bug recently in the last year?

Thanks
-steve
Comment 10 Steven Dake 2008-09-03 18:05:23 EDT
hasn't been produced in over a year and I believe along the way I fixed this problem in one of the patches.  Putting in needinfo for further verification from QE that this problem has been resolved.
Comment 11 Steven Dake 2008-12-05 15:58:16 EST
have test case to reproduce this issue now.  Investigating wrt qpidd development.
Comment 12 Steven Dake 2008-12-05 16:01:59 EST
*** Bug 472815 has been marked as a duplicate of this bug. ***
Comment 13 Steven Dake 2009-01-20 15:49:56 EST
*** Bug 450161 has been marked as a duplicate of this bug. ***
Comment 15 Steven Dake 2009-02-18 00:32:09 EST
fixed in openais-0.80.5-2
Comment 20 errata-xmlrpc 2009-09-02 07:29:10 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1366.html

Note You need to log in before you can comment on or make changes to this bug.