Bug 251082
Summary: | Openais: Assertion in checkpoint synchronization with revolver | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Robert Peterson <rpeterso> |
Component: | openais | Assignee: | Steven Dake <sdake> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.0 | CC: | cluster-maint, swhiteho |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0599 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-07 17:00:13 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Robert Peterson
2007-08-06 23:22:00 UTC
Created attachment 160782 [details]
Syslog from roth-01 starting with the most recent reboot
Created attachment 160783 [details]
Syslog from roth-02 starting with the most recent reboot
Created attachment 160784 [details]
Syslog from roth-03 starting with the most recent reboot
This is the output of the "where" command in gdb of the core file produced on roth-02. Core was generated by `aisexec'. Program terminated with signal 6, Aborted. #0 0x0000003f34430045 in raise () from /lib64/libc.so.6 (gdb) where #0 0x0000003f34430045 in raise () from /lib64/libc.so.6 #1 0x0000003f34431ae0 in abort () from /lib64/libc.so.6 #2 0x0000003f34429756 in __assert_fail () from /lib64/libc.so.6 #3 0x00002aaaabd09b37 in message_handler_req_exec_ckpt_sync_checkpoint_section ( message=<value optimized out>, nodeid=<value optimized out>) at ckpt.c:3727 #4 0x00000000004144a5 in app_deliver_fn (nodeid=2, iovec=<value optimized out>, iov_len=1, endian_conversion_required=0) at totempg.c:432 #5 0x00000000004147b8 in totempg_deliver_fn (nodeid=2, iovec=<value optimized out>, iov_len=<value optimized out>, endian_conversion_required=0) at totempg.c:591 #6 0x000000000040e254 in messages_deliver_to_app (instance=0x2aaaac54d010, skip=0, end_point=<value optimized out>) at totemsrp.c:3471 #7 0x0000000000411a40 in message_handler_orf_token (instance=0x2aaaac54d010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3348 #8 0x0000000000409853 in rrp_deliver_fn (context=0x1754bf20, msg=0x1756bc68, msg_len=70) at totemrrp.c:1308 #9 0x0000000000407cb6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x1756b530) at totemnet.c:676 #10 0x0000000000405822 in poll_run (handle=0) at aispoll.c:382 #11 0x0000000000418423 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:586 This is the output from "where" in gdb from roth-03's core file: (gdb) where #0 0x0000003650c30045 in raise () from /lib64/libc.so.6 #1 0x0000003650c31ae0 in abort () from /lib64/libc.so.6 #2 0x0000003650c29756 in __assert_fail () from /lib64/libc.so.6 #3 0x00002aaaabd09b37 in message_handler_req_exec_ckpt_sync_checkpoint_section ( message=<value optimized out>, nodeid=<value optimized out>) at ckpt.c:3727 #4 0x00000000004144a5 in app_deliver_fn (nodeid=2, iovec=<value optimized out>, iov_len=1, endian_conversion_required=0) at totempg.c:432 #5 0x00000000004147b8 in totempg_deliver_fn (nodeid=2, iovec=<value optimized out>, iov_len=<value optimized out>, endian_conversion_required=0) at totempg.c:591 #6 0x000000000040e19f in messages_deliver_to_app (instance=0x2aaaac54d010, skip=0, end_point=<value optimized out>) at totemsrp.c:3480 #7 0x000000000040eda7 in message_handler_mcast (instance=0x2aaaac54d010, msg=0x30, msg_len=1436, endian_conversion_needed=<value optimized out>) at totemsrp.c:3617 #8 0x000000000040987e in rrp_deliver_fn (context=0x2aaab0018430, msg=0x2aaab0018ba8, msg_len=1436) at totemrrp.c:1319 #9 0x0000000000407cb6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x2aaab0018470) at totemnet.c:676 #10 0x0000000000405822 in poll_run (handle=0) at aispoll.c:382 #11 0x0000000000418423 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:586 Basically, this is a rare recovery situation. The revolver test started up a whole bunch of IO on a gfs2 file system. Then it chose to shoot node roth-01, leaving nodes roth-02 and roth-03 to pick up the pieces. Instead of recovering the state of roth-01 when it came back, aisexec took a core dump on both of the recovering nodes and the cluster was down (but the machines remained up). This is a very rare problem: it occurred only after about eight hours of recovery testing. That corresponds to revolver iteration 36.2, which tested 108 node crashes in different combinations. I recreated this again today after only 4.1 iterations of revolver on roth-02: (gdb) where #0 0x0000003f34430045 in raise () from /lib64/libc.so.6 #1 0x0000003f34431ae0 in abort () from /lib64/libc.so.6 #2 0x0000003f34429756 in __assert_fail () from /lib64/libc.so.6 #3 0x00002aaaabd09b37 in message_handler_req_exec_ckpt_sync_checkpoint_section (message=<value optimized out>, nodeid=<value optimized out>) at ckpt.c:3727 #4 0x00000000004144a5 in app_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=1, endian_conversion_required=0) at totempg.c:432 #5 0x00000000004147b8 in totempg_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=<value optimized out>, endian_conversion_required=0) at totempg.c:591 #6 0x000000000040e19f in messages_deliver_to_app (instance=0x2aaaac54d010, skip=0, end_point=<value optimized out>) at totemsrp.c:3480 #7 0x000000000040eda7 in message_handler_mcast (instance=0x2aaaac54d010, msg=0xf, msg_len=1436, endian_conversion_needed=<value optimized out>) at totemsrp.c:3617 #8 0x000000000040987e in rrp_deliver_fn (context=0x2aaab0006a40, msg=0x2aaab00071b8, msg_len=1436) at totemrrp.c:1319 #9 0x0000000000407cb6 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x2aaab0006a80) at totemnet.c:676 #10 0x0000000000405822 in poll_run (handle=0) at aispoll.c:382 I've had about four more occurrences of this. I left the core files on the respective machines for you to look at. I'm guessing this has something to do with the amount of network traffic. Last night, revolver ran successfully for 48 iterations overnight when there was little network traffic, other than my cluster. It hit this problem in the morning, soon after I started a large scp from a machine in the Minneapolis branch office to my computer across the Internet, thus increasing the network traffic. I had revolver crash again at 2:00 in the morning. This time it had different symptoms, and aisexec was still running, so I asked Dave Teigland to take a look. According to Dave: <dct> bob: openais/cpg seems to be the source of the problem with the roth nodes <dct> the problem seems to be in syncing cpg state when nodes join The scenario was this: roth-01 and -03 were shot, and -02 was left to pick up the pieces. Now roth-02 claims: fence 0 default 00010002 JOIN_STOP_WAIT 1 100020001 1 [1 2] But roth-01 and -03 both claim: fence 0 default 00010003 JOIN_START_WAIT 2 200030001 1 [1 2 3] which seems wrong to me. I'll keep the cluster in this state so you can look at it with gdb, since it's still running. However, I need my cluster back soon in order to work on other problems. suggested blocker since total cluster failure results. Have been unable to reproduce this using mp5 and a simulated test environment. Still working on a good test case to make debugging and resolving the problem simpler then running the full test suite with the entire gfs stack. My current theory is unlink of a checkpoint occurs during recovery and the checkpoint is expired as a result of some bug in the checkpoint recovery code. comment #9 is invalid and related to some other issue bz 251966. FYI - QA is hitting this bug now too. Here is the iteration I hit this at: ================================================================================ Senario iteration 2.1 started at Wed Aug 22 17:25:00 CDT 2007 Sleeping 1 minute(s) to let the I/O get its lock count up... Senario: DLM kill one node Those picked to face the revolver... taft-02 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0599.html |