Bug 432531

Summary: aisexec dies during recovery when plock_ownership is enabled
Product: Red Hat Enterprise Linux 5 Reporter: Nate Straz <nstraz>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact: GFS Bugs <gfs-bugs>
Severity: low Docs Contact:
Priority: high    
Version: 5.2CC: cluster-maint, edamato, jbrassow, kanderso, rkenna, sdake, teigland
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2008-0411 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-05-21 14:31:21 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 441413    
Attachments:
Description Flags
aisexec core file from morph-01, compressed
none
aisexec core file from morph-03, compressed
none
aisexec core file from morph-04, compressed
none
aisexec core file from tank-04, compressed
none
aisexec core file from morph-02, compressed
none
aisexec core file from morph-03, compressed
none
aisexec core file from morph-04, compressed none

Description Nate Straz 2008-02-12 17:16:39 UTC
Description of problem:

While running recovery testing with revolver, one file system, and
plock_ownership enabled in the cluster, the entire cluster dies after one node
is shot.  The node which was shot reboots and returns to find it is the only
cluster node.

I was able to get aisexec core files from the systems.  I will attach those shortly.

Version-Release number of selected component (if applicable):
openais-0.80.3-10.el5
cman-2.0.80-1.el5

How reproducible:
Within 3 iterations of revolver.

Steps to Reproduce:
1. create a cluster with one GFS file sytem mounted
2. run revolver
  
Actual results:

Feb 11 18:33:04 morph-03 openais[2355]: [TOTEM] Did not need to originate any
messages in recovery. 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 11 18:33:04 morph-03 kernel: dlm: closing connection to node 2
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] New Configuration: 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.61)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.63)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.64)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] Members Left: 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.62)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] Members Joined: 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] CLM CONFIGURATION CHANGE 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] New Configuration: 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.61)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.63)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ]         r(0) ip(10.15.89.64)  
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] Members Left: 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] Members Joined: 
Feb 11 18:33:04 morph-03 openais[2355]: [SYNC ] This node is within the primary
component and will provide service. 
Feb 11 18:33:04 morph-03 openais[2355]: [TOTEM] entering OPERATIONAL state. 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] got nodejoin message 10.15.89.61 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] got nodejoin message 10.15.89.63 
Feb 11 18:33:04 morph-03 openais[2355]: [CLM  ] got nodejoin message 10.15.89.64 
Feb 11 18:33:05 morph-03 fenced[2371]: cluster is down, exiting
Feb 11 18:33:05 morph-03 groupd[2363]: cpg_mcast_joined error 2 handle
6b8b456700000000
Feb 11 18:33:05 morph-03 groupd[2363]: cpg_mcast_joined error 2 handle
6b8b456700000000
Feb 11 18:33:05 morph-03 gfs_controld[2383]: groupd_dispatch error -1 errno 11
Feb 11 18:33:05 morph-03 gfs_controld[2383]: groupd connection died
Feb 11 18:33:05 morph-03 gfs_controld[2383]: cluster is down, exiting
Feb 11 18:33:05 morph-03 dlm_controld[2377]: cluster is down, exiting


Expected results:
Recovery should work with plock_ownership enabled.

Additional info:

Comment 1 Nate Straz 2008-02-12 17:25:26 UTC
Created attachment 294679 [details]
aisexec core file from morph-01, compressed

Comment 2 Nate Straz 2008-02-12 17:28:44 UTC
Created attachment 294680 [details]
aisexec core file from morph-03, compressed

Comment 3 Nate Straz 2008-02-12 17:29:06 UTC
Created attachment 294681 [details]
aisexec core file from morph-04, compressed

Comment 4 Steven Dake 2008-02-12 17:40:57 UTC
backtrace:
#0  0x0050a402 in __kernel_vsyscall ()
#1  0x00be5c50 in raise () from /lib/libc.so.6
#2  0x00be7561 in abort () from /lib/libc.so.6
#3  0x00bdf26b in __assert_fail () from /lib/libc.so.6
#4  0x003d8683 in message_handler_req_exec_ckpt_sync_checkpoint_refcount (
    message=0xbfdfa4b0, nodeid=1) at ckpt.c:3844
#5  0x0806114b in deliver_fn (nodeid=1, iovec=0xbfdfd788, iov_len=1, 
    endian_conversion_required=0) at main.c:383
#6  0x0805d892 in app_deliver_fn (nodeid=1, iovec=0x807bce8, iov_len=1, 
    endian_conversion_required=0) at totempg.c:432
#7  0x0805dbdb in totempg_deliver_fn (nodeid=1, iovec=0x89e30c0, iov_len=3, 
    endian_conversion_required=0) at totempg.c:591
#8  0x0805c275 in totemmrp_deliver_fn (nodeid=1, iovec=0x89e30c0, iov_len=3, 
    endian_conversion_required=0) at totemmrp.c:82
#9  0x0805818a in messages_deliver_to_app (instance=0xb7531008, skip=0, 
    end_point=<value optimized out>) at totemsrp.c:3516
#10 0x0805a820 in message_handler_orf_token (instance=0xb7531008, 
    msg=0x89e6420, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3388
#11 0x08053562 in main_deliver_fn (context=0xb7531008, msg=0x89e6420, 
    msg_len=2499) at totemsrp.c:4099
#12 0x08050d00 in none_token_recv (rrp_instance=0x89d16d0, iface_no=0, 
    context=0xb7531008, msg=0x89e6420, msg_len=70, token_seq=164)
    at totemrrp.c:506
#13 0x08050dd6 in rrp_deliver_fn (context=0x89d0d08, msg=0x89e6420, msg_len=70)
    at totemrrp.c:1308
#14 0x0804f04b in net_deliver_fn (handle=0, fd=8, revents=1, data=0x89e5dd0)
    at totemnet.c:676
#15 0x0804c5fe in poll_run (handle=0) at aispoll.c:382
#16 0x080617d5 in main () at main.c:587

Comment 5 Steven Dake 2008-02-12 17:41:49 UTC
reassigning bug to sdake


Comment 6 Steven Dake 2008-02-12 17:44:21 UTC
Nate to try openais 0.80.3-7 to determine if this is a regression or was a
preexisting issue.  The test case is new implemented last week which generates
this issue so it could be an existing bug.

Comment 7 Nate Straz 2008-02-12 19:07:35 UTC
I was able to reproduce the backtrace from comment #4 with openais-0.80.3-7.el5.

Comment 8 Steven Dake 2008-02-12 19:21:04 UTC
this bug is not a regression but instead a preexisting problem with checkpoint
synchronization under high load that has probably always existed in openais. 
Not sure what the issue is exactly.  Current status is getting load setup on
smoke cluster to duplicate in a debuggable environment.

Comment 10 RHEL Program Management 2008-02-18 20:18:58 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 11 Nate Straz 2008-03-10 16:41:11 UTC
Created attachment 297458 [details]
aisexec core file from tank-04, compressed

I was able to recreate the bug with openais-0.80.3-13.el5.  

(gdb) bt
#0  0x0076f402 in __kernel_vsyscall ()
#1  0x00a1bc50 in raise () from /lib/libc.so.6
#2  0x00a1d561 in abort () from /lib/libc.so.6
#3  0x00a1526b in __assert_fail () from /lib/libc.so.6
#4  0x00922683 in message_handler_req_exec_ckpt_sync_checkpoint_refcount (
    message=0xbfcbe030, nodeid=1) at ckpt.c:3836
#5  0x0806118b in deliver_fn (nodeid=1, iovec=0xbfcc1308, iov_len=1, 
    endian_conversion_required=0) at main.c:381
#6  0x0805d8f2 in app_deliver_fn (nodeid=1, iovec=0x807b188, iov_len=1, 
    endian_conversion_required=0) at totempg.c:432
#7  0x0805dc2b in totempg_deliver_fn (nodeid=1, iovec=0x990c99c, iov_len=1, 
    endian_conversion_required=0) at totempg.c:591
#8  0x0805c2f5 in totemmrp_deliver_fn (nodeid=1, iovec=0x990c99c, iov_len=1, 
    endian_conversion_required=0) at totemmrp.c:82
#9  0x080581d1 in messages_deliver_to_app (instance=0xb74ec008, skip=0, 
    end_point=<value optimized out>) at totemsrp.c:3525
#10 0x08058584 in message_handler_mcast (instance=0xb74ec008, msg=0x9912460, 
    msg_len=927, endian_conversion_needed=0) at totemsrp.c:3656
#11 0x08053602 in main_deliver_fn (context=0xb74ec008, msg=0x9912460, 
    msg_len=2274) at totemsrp.c:4099
#12 0x08050d80 in none_mcast_recv (rrp_instance=0x98fd6d0, iface_no=0, 
    context=0xb74ec008, msg=0x9912460, msg_len=927) at totemrrp.c:476
#13 0x08050eb4 in rrp_deliver_fn (context=0x98fcd08, msg=0x9912460, 
    msg_len=927) at totemrrp.c:1319
#14 0x0804f11b in net_deliver_fn (handle=0, fd=6, revents=1, data=0x9911e10)
    at totemnet.c:676
#15 0x0804c64e in poll_run (handle=0) at aispoll.c:382
#16 0x080618c8 in main (argc=Cannot access memory at address 0x8e2
) at main.c:603

Comment 13 Steven Dake 2008-03-18 15:29:48 UTC
fixed in openais-0.80.3-14

Comment 14 Steven Dake 2008-03-18 16:45:19 UTC
*** Bug 432877 has been marked as a duplicate of this bug. ***

Comment 15 Nate Straz 2008-03-18 17:29:10 UTC
While running -14 I was able to hit the same symptom, one node gets shot and
aisexec on all other nodes core dumps.  Here is the backtrace from one of them.
 They all look similar.  I'll attach core dumps shortly.

(gdb) bt
#0  0x00b6a402 in __kernel_vsyscall ()
#1  0x00138c50 in raise () from /lib/libc.so.6
#2  0x0013a561 in abort () from /lib/libc.so.6
#3  0x0013226b in __assert_fail () from /lib/libc.so.6
#4  0x00d043a2 in message_handler_req_exec_ckpt_sync_checkpoint_section (
    message=0xbfc1c8f0, nodeid=2) at ckpt.c:3722
#5  0x0806119b in deliver_fn (nodeid=2, iovec=0xbfc1cc18, iov_len=1, 
    endian_conversion_required=0) at main.c:381
#6  0x0805d8e2 in app_deliver_fn (nodeid=2, iovec=0x807b188, iov_len=1, 
    endian_conversion_required=0) at totempg.c:432
#7  0x0805dc2b in totempg_deliver_fn (nodeid=2, iovec=0x94817d0, iov_len=3, 
    endian_conversion_required=0) at totempg.c:591
#8  0x0805c2c5 in totemmrp_deliver_fn (nodeid=2, iovec=0x94817d0, iov_len=3, 
    endian_conversion_required=0) at totemmrp.c:82
#9  0x080581da in messages_deliver_to_app (instance=0xb74a1008, skip=0, 
    end_point=<value optimized out>) at totemsrp.c:3516
#10 0x0805a870 in message_handler_orf_token (instance=0xb74a1008, 
    msg=0x9487418, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3388
#11 0x080535b2 in main_deliver_fn (context=0xb74a1008, msg=0x9487418, 
    msg_len=2356) at totemsrp.c:4099
#12 0x08050d50 in none_token_recv (rrp_instance=0x94726d0, iface_no=0, 
    context=0xb74a1008, msg=0x9487418, msg_len=70, token_seq=77)
    at totemrrp.c:506
#13 0x08050e26 in rrp_deliver_fn (context=0x9471d08, msg=0x9487418, msg_len=70)
    at totemrrp.c:1308
#14 0x0804f09b in net_deliver_fn (handle=0, fd=8, revents=1, data=0x9486dc8)
    at totemnet.c:676
#15 0x0804c64e in poll_run (handle=0) at aispoll.c:382
#16 0x080618d8 in main (argc=Cannot access memory at address 0x934
) at main.c:603

Reproduced on morph-0{1,2,3,4}.

Comment 16 Nate Straz 2008-03-18 17:32:44 UTC
Created attachment 298423 [details]
aisexec core file from morph-02, compressed

Comment 17 Nate Straz 2008-03-18 17:33:24 UTC
Created attachment 298424 [details]
aisexec core file from morph-03, compressed

Comment 18 Nate Straz 2008-03-18 17:33:47 UTC
Created attachment 298425 [details]
aisexec core file from morph-04, compressed

Comment 20 Nate Straz 2008-03-19 17:49:40 UTC
Found a new way for aisexec to fail before this got to ON_QA.

Comment 23 Steven Dake 2008-04-01 23:12:18 UTC
Update
I have been investigating more why the cluster seems to lockup with revolver and
also my test case that doesn't take several hours to run.  The other bugs I have
fixed with these patches applied against this bugzilla have I think resolved the
segfault because I can't duplicate it with my test program I wrote to duplicate
the issue any longer and revolver also doesn't duplicate the segfault, but now
the machines eventually OOM when running revolver and the OOM killer goes wild
basically causing revolver to fail.

The machines oomed when I was running my test program but I thought it was just
because I created too many checkpoints.  This oom issue is hopefully relatively
simple to solve and then we may be good to go on this bug.

Essentially my belief is somewhere in the checkpoint recovery algorithm,
checkpoint data should be freed but it is not being freed.  This is relatively
easy to find with tools such as valgrind and I'm about to get on that right now.

Thanks
-steve

Comment 25 Steven Dake 2008-04-07 21:56:21 UTC
Many issues relating to checkpoint synchornization have been fixed using this
bugzilla id but the original issue still occurs infrequently.  Since this isn't
a regression (it is a new load originated by QE) engineering, qe, and pm have
decided to clone this bug into a 5.3 bugzilla and let the fixes that have been
made on this bug id go out in the errheata.

the memory leak issue disscussed above was fixed by reloading the os with a
later version of a rhel build.

Comment 27 errata-xmlrpc 2008-05-21 14:31:21 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0411.html