Description of problem: While running recovery testing with revolver, one file system, and plock_ownership enabled in the cluster, the entire cluster dies after one node is shot. The node which was shot reboots and returns to find it is the only cluster node. I was able to get aisexec core files from the systems. I will attach those shortly. Version-Release number of selected component (if applicable): openais-0.80.3-10.el5 cman-2.0.80-1.el5 How reproducible: Within 3 iterations of revolver. Steps to Reproduce: 1. create a cluster with one GFS file sytem mounted 2. run revolver Actual results: Feb 11 18:33:04 morph-03 openais[2355]: [TOTEM] Did not need to originate any messages in recovery. Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 18:33:04 morph-03 kernel: dlm: closing connection to node 2 Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] New Configuration: Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.61) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.63) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.64) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] Members Left: Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.62) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] Members Joined: Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] CLM CONFIGURATION CHANGE Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] New Configuration: Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.61) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.63) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] r(0) ip(10.15.89.64) Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] Members Left: Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] Members Joined: Feb 11 18:33:04 morph-03 openais[2355]: [SYNC ] This node is within the primary component and will provide service. Feb 11 18:33:04 morph-03 openais[2355]: [TOTEM] entering OPERATIONAL state. Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] got nodejoin message 10.15.89.61 Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] got nodejoin message 10.15.89.63 Feb 11 18:33:04 morph-03 openais[2355]: [CLM ] got nodejoin message 10.15.89.64 Feb 11 18:33:05 morph-03 fenced[2371]: cluster is down, exiting Feb 11 18:33:05 morph-03 groupd[2363]: cpg_mcast_joined error 2 handle 6b8b456700000000 Feb 11 18:33:05 morph-03 groupd[2363]: cpg_mcast_joined error 2 handle 6b8b456700000000 Feb 11 18:33:05 morph-03 gfs_controld[2383]: groupd_dispatch error -1 errno 11 Feb 11 18:33:05 morph-03 gfs_controld[2383]: groupd connection died Feb 11 18:33:05 morph-03 gfs_controld[2383]: cluster is down, exiting Feb 11 18:33:05 morph-03 dlm_controld[2377]: cluster is down, exiting Expected results: Recovery should work with plock_ownership enabled. Additional info:
Created attachment 294679 [details] aisexec core file from morph-01, compressed
Created attachment 294680 [details] aisexec core file from morph-03, compressed
Created attachment 294681 [details] aisexec core file from morph-04, compressed
backtrace: #0 0x0050a402 in __kernel_vsyscall () #1 0x00be5c50 in raise () from /lib/libc.so.6 #2 0x00be7561 in abort () from /lib/libc.so.6 #3 0x00bdf26b in __assert_fail () from /lib/libc.so.6 #4 0x003d8683 in message_handler_req_exec_ckpt_sync_checkpoint_refcount ( message=0xbfdfa4b0, nodeid=1) at ckpt.c:3844 #5 0x0806114b in deliver_fn (nodeid=1, iovec=0xbfdfd788, iov_len=1, endian_conversion_required=0) at main.c:383 #6 0x0805d892 in app_deliver_fn (nodeid=1, iovec=0x807bce8, iov_len=1, endian_conversion_required=0) at totempg.c:432 #7 0x0805dbdb in totempg_deliver_fn (nodeid=1, iovec=0x89e30c0, iov_len=3, endian_conversion_required=0) at totempg.c:591 #8 0x0805c275 in totemmrp_deliver_fn (nodeid=1, iovec=0x89e30c0, iov_len=3, endian_conversion_required=0) at totemmrp.c:82 #9 0x0805818a in messages_deliver_to_app (instance=0xb7531008, skip=0, end_point=<value optimized out>) at totemsrp.c:3516 #10 0x0805a820 in message_handler_orf_token (instance=0xb7531008, msg=0x89e6420, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3388 #11 0x08053562 in main_deliver_fn (context=0xb7531008, msg=0x89e6420, msg_len=2499) at totemsrp.c:4099 #12 0x08050d00 in none_token_recv (rrp_instance=0x89d16d0, iface_no=0, context=0xb7531008, msg=0x89e6420, msg_len=70, token_seq=164) at totemrrp.c:506 #13 0x08050dd6 in rrp_deliver_fn (context=0x89d0d08, msg=0x89e6420, msg_len=70) at totemrrp.c:1308 #14 0x0804f04b in net_deliver_fn (handle=0, fd=8, revents=1, data=0x89e5dd0) at totemnet.c:676 #15 0x0804c5fe in poll_run (handle=0) at aispoll.c:382 #16 0x080617d5 in main () at main.c:587
reassigning bug to sdake
Nate to try openais 0.80.3-7 to determine if this is a regression or was a preexisting issue. The test case is new implemented last week which generates this issue so it could be an existing bug.
I was able to reproduce the backtrace from comment #4 with openais-0.80.3-7.el5.
this bug is not a regression but instead a preexisting problem with checkpoint synchronization under high load that has probably always existed in openais. Not sure what the issue is exactly. Current status is getting load setup on smoke cluster to duplicate in a debuggable environment.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 297458 [details] aisexec core file from tank-04, compressed I was able to recreate the bug with openais-0.80.3-13.el5. (gdb) bt #0 0x0076f402 in __kernel_vsyscall () #1 0x00a1bc50 in raise () from /lib/libc.so.6 #2 0x00a1d561 in abort () from /lib/libc.so.6 #3 0x00a1526b in __assert_fail () from /lib/libc.so.6 #4 0x00922683 in message_handler_req_exec_ckpt_sync_checkpoint_refcount ( message=0xbfcbe030, nodeid=1) at ckpt.c:3836 #5 0x0806118b in deliver_fn (nodeid=1, iovec=0xbfcc1308, iov_len=1, endian_conversion_required=0) at main.c:381 #6 0x0805d8f2 in app_deliver_fn (nodeid=1, iovec=0x807b188, iov_len=1, endian_conversion_required=0) at totempg.c:432 #7 0x0805dc2b in totempg_deliver_fn (nodeid=1, iovec=0x990c99c, iov_len=1, endian_conversion_required=0) at totempg.c:591 #8 0x0805c2f5 in totemmrp_deliver_fn (nodeid=1, iovec=0x990c99c, iov_len=1, endian_conversion_required=0) at totemmrp.c:82 #9 0x080581d1 in messages_deliver_to_app (instance=0xb74ec008, skip=0, end_point=<value optimized out>) at totemsrp.c:3525 #10 0x08058584 in message_handler_mcast (instance=0xb74ec008, msg=0x9912460, msg_len=927, endian_conversion_needed=0) at totemsrp.c:3656 #11 0x08053602 in main_deliver_fn (context=0xb74ec008, msg=0x9912460, msg_len=2274) at totemsrp.c:4099 #12 0x08050d80 in none_mcast_recv (rrp_instance=0x98fd6d0, iface_no=0, context=0xb74ec008, msg=0x9912460, msg_len=927) at totemrrp.c:476 #13 0x08050eb4 in rrp_deliver_fn (context=0x98fcd08, msg=0x9912460, msg_len=927) at totemrrp.c:1319 #14 0x0804f11b in net_deliver_fn (handle=0, fd=6, revents=1, data=0x9911e10) at totemnet.c:676 #15 0x0804c64e in poll_run (handle=0) at aispoll.c:382 #16 0x080618c8 in main (argc=Cannot access memory at address 0x8e2 ) at main.c:603
fixed in openais-0.80.3-14
*** Bug 432877 has been marked as a duplicate of this bug. ***
While running -14 I was able to hit the same symptom, one node gets shot and aisexec on all other nodes core dumps. Here is the backtrace from one of them. They all look similar. I'll attach core dumps shortly. (gdb) bt #0 0x00b6a402 in __kernel_vsyscall () #1 0x00138c50 in raise () from /lib/libc.so.6 #2 0x0013a561 in abort () from /lib/libc.so.6 #3 0x0013226b in __assert_fail () from /lib/libc.so.6 #4 0x00d043a2 in message_handler_req_exec_ckpt_sync_checkpoint_section ( message=0xbfc1c8f0, nodeid=2) at ckpt.c:3722 #5 0x0806119b in deliver_fn (nodeid=2, iovec=0xbfc1cc18, iov_len=1, endian_conversion_required=0) at main.c:381 #6 0x0805d8e2 in app_deliver_fn (nodeid=2, iovec=0x807b188, iov_len=1, endian_conversion_required=0) at totempg.c:432 #7 0x0805dc2b in totempg_deliver_fn (nodeid=2, iovec=0x94817d0, iov_len=3, endian_conversion_required=0) at totempg.c:591 #8 0x0805c2c5 in totemmrp_deliver_fn (nodeid=2, iovec=0x94817d0, iov_len=3, endian_conversion_required=0) at totemmrp.c:82 #9 0x080581da in messages_deliver_to_app (instance=0xb74a1008, skip=0, end_point=<value optimized out>) at totemsrp.c:3516 #10 0x0805a870 in message_handler_orf_token (instance=0xb74a1008, msg=0x9487418, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3388 #11 0x080535b2 in main_deliver_fn (context=0xb74a1008, msg=0x9487418, msg_len=2356) at totemsrp.c:4099 #12 0x08050d50 in none_token_recv (rrp_instance=0x94726d0, iface_no=0, context=0xb74a1008, msg=0x9487418, msg_len=70, token_seq=77) at totemrrp.c:506 #13 0x08050e26 in rrp_deliver_fn (context=0x9471d08, msg=0x9487418, msg_len=70) at totemrrp.c:1308 #14 0x0804f09b in net_deliver_fn (handle=0, fd=8, revents=1, data=0x9486dc8) at totemnet.c:676 #15 0x0804c64e in poll_run (handle=0) at aispoll.c:382 #16 0x080618d8 in main (argc=Cannot access memory at address 0x934 ) at main.c:603 Reproduced on morph-0{1,2,3,4}.
Created attachment 298423 [details] aisexec core file from morph-02, compressed
Created attachment 298424 [details] aisexec core file from morph-03, compressed
Created attachment 298425 [details] aisexec core file from morph-04, compressed
Found a new way for aisexec to fail before this got to ON_QA.
Update I have been investigating more why the cluster seems to lockup with revolver and also my test case that doesn't take several hours to run. The other bugs I have fixed with these patches applied against this bugzilla have I think resolved the segfault because I can't duplicate it with my test program I wrote to duplicate the issue any longer and revolver also doesn't duplicate the segfault, but now the machines eventually OOM when running revolver and the OOM killer goes wild basically causing revolver to fail. The machines oomed when I was running my test program but I thought it was just because I created too many checkpoints. This oom issue is hopefully relatively simple to solve and then we may be good to go on this bug. Essentially my belief is somewhere in the checkpoint recovery algorithm, checkpoint data should be freed but it is not being freed. This is relatively easy to find with tools such as valgrind and I'm about to get on that right now. Thanks -steve
Many issues relating to checkpoint synchornization have been fixed using this bugzilla id but the original issue still occurs infrequently. Since this isn't a regression (it is a new load originated by QE) engineering, qe, and pm have decided to clone this bug into a 5.3 bugzilla and let the fixes that have been made on this bug id go out in the errheata. the memory leak issue disscussed above was fixed by reloading the os with a later version of a rhel build.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0411.html