Bug 432531
Summary: | aisexec dies during recovery when plock_ownership is enabled | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Nate Straz <nstraz> |
Component: | openais | Assignee: | Steven Dake <sdake> |
Status: | CLOSED ERRATA | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | low | Docs Contact: | |
Priority: | high | ||
Version: | 5.2 | CC: | cluster-maint, edamato, jbrassow, kanderso, rkenna, sdake, teigland |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2008-0411 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-05-21 14:31:21 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 441413 | ||
Attachments: |
Description
Nate Straz
2008-02-12 17:16:39 UTC
Created attachment 294679 [details]
aisexec core file from morph-01, compressed
Created attachment 294680 [details]
aisexec core file from morph-03, compressed
Created attachment 294681 [details]
aisexec core file from morph-04, compressed
backtrace: #0 0x0050a402 in __kernel_vsyscall () #1 0x00be5c50 in raise () from /lib/libc.so.6 #2 0x00be7561 in abort () from /lib/libc.so.6 #3 0x00bdf26b in __assert_fail () from /lib/libc.so.6 #4 0x003d8683 in message_handler_req_exec_ckpt_sync_checkpoint_refcount ( message=0xbfdfa4b0, nodeid=1) at ckpt.c:3844 #5 0x0806114b in deliver_fn (nodeid=1, iovec=0xbfdfd788, iov_len=1, endian_conversion_required=0) at main.c:383 #6 0x0805d892 in app_deliver_fn (nodeid=1, iovec=0x807bce8, iov_len=1, endian_conversion_required=0) at totempg.c:432 #7 0x0805dbdb in totempg_deliver_fn (nodeid=1, iovec=0x89e30c0, iov_len=3, endian_conversion_required=0) at totempg.c:591 #8 0x0805c275 in totemmrp_deliver_fn (nodeid=1, iovec=0x89e30c0, iov_len=3, endian_conversion_required=0) at totemmrp.c:82 #9 0x0805818a in messages_deliver_to_app (instance=0xb7531008, skip=0, end_point=<value optimized out>) at totemsrp.c:3516 #10 0x0805a820 in message_handler_orf_token (instance=0xb7531008, msg=0x89e6420, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3388 #11 0x08053562 in main_deliver_fn (context=0xb7531008, msg=0x89e6420, msg_len=2499) at totemsrp.c:4099 #12 0x08050d00 in none_token_recv (rrp_instance=0x89d16d0, iface_no=0, context=0xb7531008, msg=0x89e6420, msg_len=70, token_seq=164) at totemrrp.c:506 #13 0x08050dd6 in rrp_deliver_fn (context=0x89d0d08, msg=0x89e6420, msg_len=70) at totemrrp.c:1308 #14 0x0804f04b in net_deliver_fn (handle=0, fd=8, revents=1, data=0x89e5dd0) at totemnet.c:676 #15 0x0804c5fe in poll_run (handle=0) at aispoll.c:382 #16 0x080617d5 in main () at main.c:587 reassigning bug to sdake Nate to try openais 0.80.3-7 to determine if this is a regression or was a preexisting issue. The test case is new implemented last week which generates this issue so it could be an existing bug. I was able to reproduce the backtrace from comment #4 with openais-0.80.3-7.el5. this bug is not a regression but instead a preexisting problem with checkpoint synchronization under high load that has probably always existed in openais. Not sure what the issue is exactly. Current status is getting load setup on smoke cluster to duplicate in a debuggable environment. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 297458 [details]
aisexec core file from tank-04, compressed
I was able to recreate the bug with openais-0.80.3-13.el5.
(gdb) bt
#0 0x0076f402 in __kernel_vsyscall ()
#1 0x00a1bc50 in raise () from /lib/libc.so.6
#2 0x00a1d561 in abort () from /lib/libc.so.6
#3 0x00a1526b in __assert_fail () from /lib/libc.so.6
#4 0x00922683 in message_handler_req_exec_ckpt_sync_checkpoint_refcount (
message=0xbfcbe030, nodeid=1) at ckpt.c:3836
#5 0x0806118b in deliver_fn (nodeid=1, iovec=0xbfcc1308, iov_len=1,
endian_conversion_required=0) at main.c:381
#6 0x0805d8f2 in app_deliver_fn (nodeid=1, iovec=0x807b188, iov_len=1,
endian_conversion_required=0) at totempg.c:432
#7 0x0805dc2b in totempg_deliver_fn (nodeid=1, iovec=0x990c99c, iov_len=1,
endian_conversion_required=0) at totempg.c:591
#8 0x0805c2f5 in totemmrp_deliver_fn (nodeid=1, iovec=0x990c99c, iov_len=1,
endian_conversion_required=0) at totemmrp.c:82
#9 0x080581d1 in messages_deliver_to_app (instance=0xb74ec008, skip=0,
end_point=<value optimized out>) at totemsrp.c:3525
#10 0x08058584 in message_handler_mcast (instance=0xb74ec008, msg=0x9912460,
msg_len=927, endian_conversion_needed=0) at totemsrp.c:3656
#11 0x08053602 in main_deliver_fn (context=0xb74ec008, msg=0x9912460,
msg_len=2274) at totemsrp.c:4099
#12 0x08050d80 in none_mcast_recv (rrp_instance=0x98fd6d0, iface_no=0,
context=0xb74ec008, msg=0x9912460, msg_len=927) at totemrrp.c:476
#13 0x08050eb4 in rrp_deliver_fn (context=0x98fcd08, msg=0x9912460,
msg_len=927) at totemrrp.c:1319
#14 0x0804f11b in net_deliver_fn (handle=0, fd=6, revents=1, data=0x9911e10)
at totemnet.c:676
#15 0x0804c64e in poll_run (handle=0) at aispoll.c:382
#16 0x080618c8 in main (argc=Cannot access memory at address 0x8e2
) at main.c:603
fixed in openais-0.80.3-14 *** Bug 432877 has been marked as a duplicate of this bug. *** While running -14 I was able to hit the same symptom, one node gets shot and aisexec on all other nodes core dumps. Here is the backtrace from one of them. They all look similar. I'll attach core dumps shortly. (gdb) bt #0 0x00b6a402 in __kernel_vsyscall () #1 0x00138c50 in raise () from /lib/libc.so.6 #2 0x0013a561 in abort () from /lib/libc.so.6 #3 0x0013226b in __assert_fail () from /lib/libc.so.6 #4 0x00d043a2 in message_handler_req_exec_ckpt_sync_checkpoint_section ( message=0xbfc1c8f0, nodeid=2) at ckpt.c:3722 #5 0x0806119b in deliver_fn (nodeid=2, iovec=0xbfc1cc18, iov_len=1, endian_conversion_required=0) at main.c:381 #6 0x0805d8e2 in app_deliver_fn (nodeid=2, iovec=0x807b188, iov_len=1, endian_conversion_required=0) at totempg.c:432 #7 0x0805dc2b in totempg_deliver_fn (nodeid=2, iovec=0x94817d0, iov_len=3, endian_conversion_required=0) at totempg.c:591 #8 0x0805c2c5 in totemmrp_deliver_fn (nodeid=2, iovec=0x94817d0, iov_len=3, endian_conversion_required=0) at totemmrp.c:82 #9 0x080581da in messages_deliver_to_app (instance=0xb74a1008, skip=0, end_point=<value optimized out>) at totemsrp.c:3516 #10 0x0805a870 in message_handler_orf_token (instance=0xb74a1008, msg=0x9487418, msg_len=70, endian_conversion_needed=0) at totemsrp.c:3388 #11 0x080535b2 in main_deliver_fn (context=0xb74a1008, msg=0x9487418, msg_len=2356) at totemsrp.c:4099 #12 0x08050d50 in none_token_recv (rrp_instance=0x94726d0, iface_no=0, context=0xb74a1008, msg=0x9487418, msg_len=70, token_seq=77) at totemrrp.c:506 #13 0x08050e26 in rrp_deliver_fn (context=0x9471d08, msg=0x9487418, msg_len=70) at totemrrp.c:1308 #14 0x0804f09b in net_deliver_fn (handle=0, fd=8, revents=1, data=0x9486dc8) at totemnet.c:676 #15 0x0804c64e in poll_run (handle=0) at aispoll.c:382 #16 0x080618d8 in main (argc=Cannot access memory at address 0x934 ) at main.c:603 Reproduced on morph-0{1,2,3,4}. Created attachment 298423 [details]
aisexec core file from morph-02, compressed
Created attachment 298424 [details]
aisexec core file from morph-03, compressed
Created attachment 298425 [details]
aisexec core file from morph-04, compressed
Found a new way for aisexec to fail before this got to ON_QA. Update I have been investigating more why the cluster seems to lockup with revolver and also my test case that doesn't take several hours to run. The other bugs I have fixed with these patches applied against this bugzilla have I think resolved the segfault because I can't duplicate it with my test program I wrote to duplicate the issue any longer and revolver also doesn't duplicate the segfault, but now the machines eventually OOM when running revolver and the OOM killer goes wild basically causing revolver to fail. The machines oomed when I was running my test program but I thought it was just because I created too many checkpoints. This oom issue is hopefully relatively simple to solve and then we may be good to go on this bug. Essentially my belief is somewhere in the checkpoint recovery algorithm, checkpoint data should be freed but it is not being freed. This is relatively easy to find with tools such as valgrind and I'm about to get on that right now. Thanks -steve Many issues relating to checkpoint synchornization have been fixed using this bugzilla id but the original issue still occurs infrequently. Since this isn't a regression (it is a new load originated by QE) engineering, qe, and pm have decided to clone this bug into a 5.3 bugzilla and let the fixes that have been made on this bug id go out in the errheata. the memory leak issue disscussed above was fixed by reloading the os with a later version of a rhel build. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0411.html |