Bug 251082

Summary: Openais: Assertion in checkpoint synchronization with revolver
Product: Red Hat Enterprise Linux 5 Reporter: Robert Peterson <rpeterso>
Component: openaisAssignee: Steven Dake <sdake>
Status: CLOSED ERRATA QA Contact:
Severity: medium Docs Contact:
Priority: medium    
Version: 5.0CC: cluster-maint, swhiteho
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: RHBA-2007-0599 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-11-07 17:00:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Syslog from roth-01 starting with the most recent reboot
none
Syslog from roth-02 starting with the most recent reboot
none
Syslog from roth-03 starting with the most recent reboot none

Description Robert Peterson 2007-08-06 23:22:00 UTC
Description of problem:


Version-Release number of selected component (if applicable):
RHEL51 beta

How reproducible:
Rare

Steps to Reproduce:
1. Run the revolver test with a gfs2 file system for eight hours.
  
Actual results:
The cluster comes crashing to the ground.  Core files left in openais.

Expected results:
Revolver should keep running.

Additional info:
I'll attach backtraces from the cores and syslogs.

Comment 1 Robert Peterson 2007-08-06 23:25:06 UTC
Created attachment 160782 [details]
Syslog from roth-01 starting with the most recent reboot

Comment 2 Robert Peterson 2007-08-06 23:25:50 UTC
Created attachment 160783 [details]
Syslog from roth-02 starting with the most recent reboot

Comment 3 Robert Peterson 2007-08-06 23:26:15 UTC
Created attachment 160784 [details]
Syslog from roth-03 starting with the most recent reboot

Comment 4 Robert Peterson 2007-08-06 23:32:07 UTC
This is the output of the "where" command in gdb of the core file
produced on roth-02.

Core was generated by `aisexec'.
Program terminated with signal 6, Aborted.
#0  0x0000003f34430045 in raise () from /lib64/libc.so.6
(gdb) where
#0  0x0000003f34430045 in raise () from /lib64/libc.so.6
#1  0x0000003f34431ae0 in abort () from /lib64/libc.so.6
#2  0x0000003f34429756 in __assert_fail () from /lib64/libc.so.6
#3  0x00002aaaabd09b37 in message_handler_req_exec_ckpt_sync_checkpoint_section (
    message=<value optimized out>, nodeid=<value optimized out>) at ckpt.c:3727
#4  0x00000000004144a5 in app_deliver_fn (nodeid=2, iovec=<value optimized out>,
iov_len=1, 
    endian_conversion_required=0) at totempg.c:432
#5  0x00000000004147b8 in totempg_deliver_fn (nodeid=2, iovec=<value optimized
out>, 
    iov_len=<value optimized out>, endian_conversion_required=0) at totempg.c:591
#6  0x000000000040e254 in messages_deliver_to_app (instance=0x2aaaac54d010, skip=0, 
    end_point=<value optimized out>) at totemsrp.c:3471
#7  0x0000000000411a40 in message_handler_orf_token (instance=0x2aaaac54d010, 
    msg=<value optimized out>, msg_len=<value optimized out>, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3348
#8  0x0000000000409853 in rrp_deliver_fn (context=0x1754bf20, msg=0x1756bc68,
msg_len=70)
    at totemrrp.c:1308
#9  0x0000000000407cb6 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, 
    revents=<value optimized out>, data=0x1756b530) at totemnet.c:676
#10 0x0000000000405822 in poll_run (handle=0) at aispoll.c:382
#11 0x0000000000418423 in main (argc=<value optimized out>, argv=<value
optimized out>)
    at main.c:586


Comment 5 Robert Peterson 2007-08-06 23:33:23 UTC
This is the output from "where" in gdb from roth-03's core file:

(gdb) where
#0  0x0000003650c30045 in raise () from /lib64/libc.so.6
#1  0x0000003650c31ae0 in abort () from /lib64/libc.so.6
#2  0x0000003650c29756 in __assert_fail () from /lib64/libc.so.6
#3  0x00002aaaabd09b37 in message_handler_req_exec_ckpt_sync_checkpoint_section (
    message=<value optimized out>, nodeid=<value optimized out>) at ckpt.c:3727
#4  0x00000000004144a5 in app_deliver_fn (nodeid=2, iovec=<value optimized out>,
iov_len=1, 
    endian_conversion_required=0) at totempg.c:432
#5  0x00000000004147b8 in totempg_deliver_fn (nodeid=2, iovec=<value optimized
out>, 
    iov_len=<value optimized out>, endian_conversion_required=0) at totempg.c:591
#6  0x000000000040e19f in messages_deliver_to_app (instance=0x2aaaac54d010, skip=0, 
    end_point=<value optimized out>) at totemsrp.c:3480
#7  0x000000000040eda7 in message_handler_mcast (instance=0x2aaaac54d010,
msg=0x30, msg_len=1436, 
    endian_conversion_needed=<value optimized out>) at totemsrp.c:3617
#8  0x000000000040987e in rrp_deliver_fn (context=0x2aaab0018430,
msg=0x2aaab0018ba8, msg_len=1436)
    at totemrrp.c:1319
#9  0x0000000000407cb6 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, 
    revents=<value optimized out>, data=0x2aaab0018470) at totemnet.c:676
#10 0x0000000000405822 in poll_run (handle=0) at aispoll.c:382
#11 0x0000000000418423 in main (argc=<value optimized out>, argv=<value
optimized out>)
    at main.c:586


Comment 6 Robert Peterson 2007-08-06 23:39:54 UTC
Basically, this is a rare recovery situation.  The revolver test started
up a whole bunch of IO on a gfs2 file system.  Then it chose to shoot
node roth-01, leaving nodes roth-02 and roth-03 to pick up the pieces.
Instead of recovering the state of roth-01 when it came back, aisexec
took a core dump on both of the recovering nodes and the cluster was down
(but the machines remained up).

This is a very rare problem: it occurred only after about eight hours of
recovery testing.  That corresponds to revolver iteration 36.2, which
tested 108 node crashes in different combinations.


Comment 7 Robert Peterson 2007-08-07 17:35:02 UTC
I recreated this again today after only 4.1 iterations of revolver
on roth-02:

(gdb) where
#0  0x0000003f34430045 in raise () from /lib64/libc.so.6
#1  0x0000003f34431ae0 in abort () from /lib64/libc.so.6
#2  0x0000003f34429756 in __assert_fail () from /lib64/libc.so.6
#3  0x00002aaaabd09b37 in message_handler_req_exec_ckpt_sync_checkpoint_section
    (message=<value optimized out>, nodeid=<value optimized out>)
    at ckpt.c:3727
#4  0x00000000004144a5 in app_deliver_fn (nodeid=1, 
    iovec=<value optimized out>, iov_len=1, endian_conversion_required=0)
    at totempg.c:432
#5  0x00000000004147b8 in totempg_deliver_fn (nodeid=1, 
    iovec=<value optimized out>, iov_len=<value optimized out>, 
    endian_conversion_required=0) at totempg.c:591
#6  0x000000000040e19f in messages_deliver_to_app (instance=0x2aaaac54d010, 
    skip=0, end_point=<value optimized out>) at totemsrp.c:3480
#7  0x000000000040eda7 in message_handler_mcast (instance=0x2aaaac54d010, 
    msg=0xf, msg_len=1436, endian_conversion_needed=<value optimized out>)
    at totemsrp.c:3617
#8  0x000000000040987e in rrp_deliver_fn (context=0x2aaab0006a40, 
    msg=0x2aaab00071b8, msg_len=1436) at totemrrp.c:1319
#9  0x0000000000407cb6 in net_deliver_fn (handle=<value optimized out>, 
    fd=<value optimized out>, revents=<value optimized out>, 
    data=0x2aaab0006a80) at totemnet.c:676
#10 0x0000000000405822 in poll_run (handle=0) at aispoll.c:382


Comment 8 Robert Peterson 2007-08-08 14:03:28 UTC
I've had about four more occurrences of this.  I left the core files
on the respective machines for you to look at.

I'm guessing this has something to do with the amount of network traffic.
Last night, revolver ran successfully for 48 iterations overnight when
there was little network traffic, other than my cluster.

It hit this problem in the morning, soon after I started a large scp
from a machine in the Minneapolis branch office to my computer across
the Internet, thus increasing the network traffic.


Comment 9 Robert Peterson 2007-08-09 21:57:14 UTC
I had revolver crash again at 2:00 in the morning.  This time it had
different symptoms, and aisexec was still running, so I asked Dave
Teigland to take a look.  According to Dave:

<dct> bob: openais/cpg seems to be the source of the problem with the roth nodes
<dct> the problem seems to be in syncing cpg state when nodes join

The scenario was this:
roth-01 and -03 were shot, and -02 was left to pick up the pieces.
Now roth-02 claims: fence            0     default  00010002 JOIN_STOP_WAIT 1
100020001 1
[1 2]

But roth-01 and -03 both claim:
fence            0     default  00010003 JOIN_START_WAIT 2 200030001 1
[1 2 3]

which seems wrong to me.  I'll keep the cluster in this state so you
can look at it with gdb, since it's still running.  However, I need
my cluster back soon in order to work on other problems.


Comment 10 Steven Dake 2007-08-14 00:32:30 UTC
suggested blocker since total cluster failure results.

Have been unable to reproduce this using mp5 and a simulated test environment. 
Still working on a good test case to make debugging and resolving the problem
simpler then running the full test suite with the entire gfs stack.

Comment 11 Steven Dake 2007-08-14 00:33:38 UTC
My current theory is unlink of a checkpoint occurs during recovery and the
checkpoint is expired as a result of some bug in the checkpoint recovery code.

Comment 12 Steven Dake 2007-08-14 00:35:25 UTC
comment #9 is invalid and related to some other issue bz 251966.

Comment 13 Corey Marthaler 2007-08-23 16:26:40 UTC
FYI - QA is hitting this bug now too.

Comment 14 Corey Marthaler 2007-08-23 16:32:55 UTC
Here is the iteration I hit this at:

================================================================================
Senario iteration 2.1 started at Wed Aug 22 17:25:00 CDT 2007
Sleeping 1 minute(s) to let the I/O get its lock count up...
Senario: DLM kill one node

Those picked to face the revolver... taft-02


Comment 17 errata-xmlrpc 2007-11-07 17:00:13 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0599.html