Bug 474277 - aisexec core dumps under load
aisexec core dumps under load
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.2
All Linux
urgent Severity urgent
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: ZStream
: 485754 (view as bug list)
Depends On:
Blocks: 474173 486108
  Show dependency treegraph
 
Reported: 2008-12-02 21:56 EST by Alan Conway
Modified: 2016-04-26 11:02 EDT (History)
8 users (show)

See Also:
Fixed In Version: openais-0.80.5-2
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-09-02 07:29:14 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
core file (13.02 MB, application/octet-stream)
2008-12-02 21:56 EST, Alan Conway
no flags Details

  None (edit)
Description Alan Conway 2008-12-02 21:56:12 EST
Created attachment 325469 [details]
core file

Description of problem:

Tests with MRG qpid cluster are seeing corrupt messages accompanied by an 
aisexec core dump. The situation is fairly repeatable but the test setup is complicated. Occurs with rhel 5.2 openais version and with 0.80.3-19

How reproducible:

Fairly easy.

Steps to Reproduce:
1. Start a 4 node qpidd cluster
2. Run tsxtest --messages 10000 --rate 0
  
Additional info:

I'll try to extract a simpler test.
Comment 2 mick 2008-12-03 14:36:31 EST
Just tested on 3-22, saw aisexec crash again.  (After four or five tries. )
Did not get core file this time...
Comment 3 Nate Straz 2008-12-03 14:55:29 EST
Core files from openais usually land in /var/lib/openais.
Comment 4 Lon Hohberger 2008-12-03 14:57:36 EST
Need a core file before doing anything.
Comment 7 Lon Hohberger 2008-12-03 15:04:15 EST
So...

* 4 node cluster.
* Doing stress tests of qpid

Results:

* aisexec dies on one node after delivering a corrupted message
* qpid daemons die on all nodes from receiving the corrupted message
* aisexec is still running on the other 3 nodes.
Comment 9 Steven Dake 2008-12-03 17:23:37 EST
Ok here is my analysis of the three separate problems.

aisexec segfaults in message free operation.  Alan and I were not able to reproduce this issue after 10+ runs.  Alan is using the cluster at the moment but tonight I will continue to try to reproduce the segfault.  This appears rare.

qpidd fails 100% of the time when running the test case.  The reason is because of a corrupted message.  What isn't known is if aisexec delivers a corrupted message or if qpidd formulates a corrupted message.

I have run tests in the past which generate a random data area and sha1 hash it.  Once it is delivered it is verified for accuracy.  I ran this test case for 2+ weeks on a 3 node cluster with zero failures also with node reconfigs etc.

Finally there appears to be some flow control issue which we encountered once where flow control isn't being propogated up to the library properly.
Comment 11 Steven Dake 2008-12-03 17:26:00 EST
please note also the qpidd message corruption and core dumps appear to be unrelated.

Alan is going to generate an input validator to make certain the messages sent are validly formed.
Comment 12 Alan Conway 2008-12-04 12:18:45 EST
I can reliably reproduce the bug on fedora 9 using openais-0.80.3-22.el5.i386 built from the brewroot srpm as follows:

1. start aisexec on a single node.
2. start qpidd on the same node in cluster mode.
3. run the test in a loop: while ./perftest --count 5000 --nsubs=4 --npubs=4 -s ; do true; done
4. while that's running, start openais on a second node.

qpidd on the first node crashes almost immediately. I've seen a couple of scenarios for what aisexec does:
1. also crashes
2. still running but if you restart qpidd it crashes.
3. still running and functioning.

Here's a glibc dump produced by one crash, I have some core files as well.

*** glibc detected *** aisexec: corrupted double-linked list: 0xb7428980 ***
======= Backtrace: =========
/lib/libc.so.6[0x17e874]
/lib/libc.so.6[0x180687]
/lib/libc.so.6(cfree+0x96)[0x1808d6]
aisexec[0x805a190]
aisexec(main_deliver_fn+0x92)[0x8053622]
aisexec[0x8050dd0]
aisexec(rrp_deliver_fn+0x56)[0x8050ea6]
aisexec[0x804f13b]
aisexec(poll_run+0x22e)[0x804c64e]
aisexec(main+0x568)[0x8061ca8]
/lib/libc.so.6(__libc_start_main+0xe6)[0x1275d6]
aisexec[0x804c361]
======= Memory map: ========
00110000-00111000 r-xp 00110000 00:00 0          [vdso]
00111000-00274000 r-xp 00000000 fd:00 2687097    /lib/libc-2.8.so
00274000-00276000 r--p 00163000 fd:00 2687097    /lib/libc-2.8.so
00276000-00277000 rw-p 00165000 fd:00 2687097    /lib/libc-2.8.so
00277000-0027a000 rw-p 00277000 00:00 0 
0027a000-0027c000 r-xp 00000000 fd:00 2656064    /usr/libexec/lcrso/objdb.lcrso
0027c000-0027d000 rw-p 00001000 fd:00 2656064    /usr/libexec/lcrso/objdb.lcrso
0027d000-0027e000 r-xp 00000000 fd:00 2656063    /usr/libexec/lcrso/aisparser.lcrso
0027e000-0027f000 rw-p 00001000 fd:00 2656063    /usr/libexec/lcrso/aisparser.lcrso
0027f000-00283000 r-xp 00000000 fd:00 2656071    /usr/libexec/lcrso/service_cpg.lcrso
00283000-00284000 rw-p 00003000 fd:00 2656071    /usr/libexec/lcrso/service_cpg.lcrso
00284000-00286000 r-xp 00000000 fd:00 2656066    /usr/libexec/lcrso/service_cfg.lcrso
00286000-00287000 rw-p 00001000 fd:00 2656066    /usr/libexec/lcrso/service_cfg.lcrso
00287000-0028b000 r-xp 00000000 fd:00 2656077    /usr/libexec/lcrso/service_msg.lcrso
0028b000-0028c000 rw-p 00004000 fd:00 2656077    /usr/libexec/lcrso/service_msg.lcrso
0028c000-00290000 r-xp 00000000 fd:00 2656075    /usr/libexec/lcrso/service_lck.lcrso
00290000-00291000 rw-p 00004000 fd:00 2656075    /usr/libexec/lcrso/service_lck.lcrso
00291000-0029d000 r-xp 00000000 fd:00 2656074    /usr/libexec/lcrso/service_evt.lcrso
0029d000-0029e000 rw-p 0000c000 fd:00 2656074    /usr/libexec/lcrso/service_evt.lcrso
0029e000-002a9000 r-xp 00000000 fd:00 2656068    /usr/libexec/lcrso/service_ckpt.lcrso
002a9000-002aa000 rw-p 0000b000 fd:00 2656068    /usr/libexec/lcrso/service_ckpt.lcrso
002aa000-002bd000 r-xp 00000000 fd:00 2656065    /usr/libexec/lcrso/service_amf.lcrso
002bd000-002be000 rw-p 00013000 fd:00 2656065    /usr/libexec/lcrso/service_amf.lcrso
002be000-002c1000 r-xp 00000000 fd:00 2656069    /usr/libexec/lcrso/service_clm.lcrso
002c1000-002c2000 rw-p 00003000 fd:00 2656069    /usr/libexec/lcrso/service_clm.lcrso
002c2000-002e5000 rw-p 002c2000 00:00 0 
002e5000-002e7000 r-xp 00000000 fd:00 2656072    /usr/libexec/lcrso/service_evs.lcrso
002e7000-002e8000 rw-p 00001000 fd:00 2656072    /usr/libexec/lcrso/service_evs.lcrso
002e8000-002e9000 rw-p 002e8000 00:00 0 
002e9000-002f3000 r-xp 00000000 fd:00 2687096    /lib/libnss_files-2.8.so
002f3000-002f4000 r--p 0000a000 fd:00 2687096    /lib/libnss_files-2.8.so
002f4000-002f5000 rw-p 0000b000 fd:00 2687096    /lib/libnss_files-2.8.so
00d84000-00da0000 r-xp 00000000 fd:00 2687095    /lib/ld-2.8.so
00da0000-00da1000 r--p 0001c000 fd:00 2687095    /lib/ld-2.8.so
00da1000-00da2000 rw-p 0001d000 fd:00 2687095    /lib/ld-2.8.so
00dcf000-00dd2000 r-xp 00000000 fd:00 2687173    /lib/libdl-2.8.so
00dd2000-00dd3000 r--p 00002000 fd:00 2687173    /lib/libdl-2.8.so
00dd3000-00dd4000 rw-p 00003000 fd:00 2687173    /lib/libdl-2.8.so
00dd6000-00deb000 r-xp 00000000 fd:00 2687210    /lib/libpthread-2.8.so
00deb000-00dec000 r--p 00014000 fd:00 2687210    /lib/libpthread-2.8.so
00dec000-00ded000 rw-p 00015000 fd:00 2687210    /lib/libpthread-2.8.so
00ded000-00def000 rw-p 00ded000 00:00 0 
049c5000-049d2000 r-xp 00000000 fd:00 2687164    /lib/libgcc_s-4.3.0-20080428.so.1
049d2000-049d3000 rw-p 0000c000 fd:00 2687164    /lib/libgcc_s-4.3.0-20080428.so.1
08048000-08073000 r-xp 00000000 fd:00 2548504    /usr/sbin/aisexec
08073000-08074000 rw-p 0002a000 fd:00 2548504    /usr/sbin/aisexec
08074000-08180000 rw-p 08074000 00:00 0 
09c4c000-09d88000 rw-p 09c4c000 00:00 0          [heap]
b70cc000-b70cd000 ---p b70cc000 00:00 0
Comment 13 Steven Dake 2008-12-04 12:47:36 EST
glibc traces like that are useless.  Always need a coredump "where" operation for backtraces.  It may be that something in recovery isn't working properly in your test cluster.

totally different test case and likely different bug unless your test case before with the segfault was caused by nodes joining.

The original bug was that "during normal operation, no config changes" segfault.

This is during config change segfault.

The code for handling of each of these is totally different.

Regardless can this be done on rhel5?  I am happy to fix any segfault on your test cluster with an easy reproduction case.
Comment 14 Alan Conway 2008-12-04 13:22:25 EST
Here's a stack trace from the config change case, I'll generate one for the normal-operation case and attach that shortly.

    Thread 3 (process 24510):
    #0  0x049cc631 in ?? () from /lib/libgcc_s.so.1
    #1  0x049cca45 in ?? () from /lib/libgcc_s.so.1
    #2  0x049ccbaf in _Unwind_ForcedUnwind () from /lib/libgcc_s.so.1
    #3  0x00de4a76 in _Unwind_ForcedUnwind (exc=Could not find the frame base for
    "_Unwind_ForcedUnwind".
    ) at ../nptl/sysdeps/pthread/unwind-forcedunwind.c:100
    #4  0x00de24a1 in __pthread_unwind (buf=<value optimized out>) at unwind.c:130
    #5  0x00dda33b in sigcancel_handler (sig=<value optimized out>, si=<value
    optimized out>, ctx=Could not find the frame base for "sigcancel_handler".
    )
        at ../nptl/pthreadP.h:264
    #6  <signal handler called>
    #7  0x00110416 in __kernel_vsyscall ()
    #8  0x00ddfba5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/libpthread-2.8.so
    #9  0x0806104e in worker_thread (thread_data_in=0x9c52c54) at wthread.c:73
    #10 0x00ddc32f in start_thread (arg=<value optimized out>) at
    pthread_create.c:297
    #11 0x001ef20e in clone () from /lib/libc-2.8.so

    Thread 2 (process 24509):
    #0  0x00110416 in __kernel_vsyscall ()
    #1  0x001e4a57 in __poll (fds=<value optimized out>, nfds=<value optimized
    out>, 
        timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:87
    #2  0x0806799e in prioritized_timer_thread (data=0x0) at timer.c:125
    #3  0x00ddc32f in start_thread (arg=<value optimized out>) at
    pthread_create.c:297
    #4  0x001ef20e in clone () from /lib/libc-2.8.so

    Thread 1 (process 24508):
    #0  0x00110416 in __kernel_vsyscall ()
    #1  0x0013b660 in raise (sig=<value optimized out>) at
    ../nptl/sysdeps/unix/sysv/linux/raise.c:64
    #2  0x0013d028 in abort () at abort.c:88
    #3  0x0017864d in __libc_message (do_abort=<value optimized out>, fmt=<value
    optimized out>)
        at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
    #4  0x0017e874 in malloc_printerr (action=<value optimized out>, str=<value
    optimized out>, 
        ptr=<value optimized out>) at malloc.c:5949
    #5  0x00180687 in _int_free (av=<value optimized out>, mem=<value optimized
    out>) at malloc.c:4672
    #6  0x001808d6 in __libc_free (mem=<value optimized out>) at malloc.c:3625
    #7  0x0805a190 in message_handler_orf_token (instance=0xb756e008,
    msg=0xb7406fdc, msg_len=70, 
        endian_conversion_needed=0) at totemsrp.c:2185
    #8  0x08053622 in main_deliver_fn (context=0xb756e008, msg=0xb7406fdc,
    msg_len=24508) at totemsrp.c:4132
    #9  0x08050dd0 in none_token_recv (rrp_instance=0xb7406518, iface_no=0,
    context=0xb756e008, msg=0xb7406fdc, 
        msg_len=70, token_seq=1659) at totemrrp.c:506
    #10 0x08050ea6 in rrp_deliver_fn (context=0xb7406990, msg=0xb7406fdc,
    msg_len=70) at totemrrp.c:1308
    #11 0x0804f13b in net_deliver_fn (handle=0, fd=7, revents=1, data=0xb74069b0)
    at totemnet.c:676
    #12 0x0804c64e in poll_run (handle=0) at aispoll.c:382
    #13 0x08061ca8 in main (argc=Cannot access memory at address 0x5fbc
    ) at main.c:603
Comment 15 Steven Dake 2008-12-04 14:49:21 EST
that stack backtrace is the same error condition that is occuring with the original bz.  Perhaps you had config changes before and weren't aware of it.

So it looks like to me the issue is that when a config change happens on occasion aisexec will segfault under some circumstances.

Alan is currently setting up his software on my test cluster so I can have a look at fixing this issue since his mrg cluster is under test.
Comment 19 Alan Conway 2008-12-11 08:01:43 EST
Trying to do a distributed test on my cluster (previously I've just been hammering on one node) and I get a new symptom. I start a 4 node cluster of nodes 20.0.10.7,8,9,10. 

Mid test I see the log output below on node 7 - it looks like node 9 has left the cluster. However aisexec is still running on all nodes. I restarted qpidd on all nodes, it appears that the nodes have inconsistent view of cluster state based on their initial config change message:

2008-dec-11 07:48:20 debug 20.0.10.7:6295(INIT) config change: 20.0.10.7:6295 20.0.10.9:6849 (left)
2008-dec-11 07:48:21 debug 20.0.10.8:8319(INIT) config change: 20.0.10.8:8319 20.0.10.9:6849 (left) 20.0.10.7:6295
9 failed to start at all
2008-dec-11 08:51:30 debug 20.0.10.10:2787(INIT) config change: 20.0.10.10:2787 20.0.10.9:6849 20.0.10.7:6295 20.0.10.8:8319 (joined)

So it does seem like there are bogus config changes being generated, which would explain why I'm seeing the crash even when I'm not causing config changes. 

Steve if you want, I think I can set up a test that reproduces this after lunch today. Remind me the details to  log into your boxes.

Here's the log output:

Dec 11  7:43:54.969889 [TOTEM] entering GATHER state from 12.
Dec 11  7:43:55.772481 [TOTEM] entering GATHER state from 11.
Dec 11  7:43:55.830470 [TOTEM] Creating commit token because I am the rep.
Dec 11  7:43:55.830496 [TOTEM] Saving state aru 8535 high seq received 8535
Dec 11  7:43:55.830534 [TOTEM] Storing new sequence id for ring d24
Dec 11  7:43:55.830624 [TOTEM] entering COMMIT state.
Dec 11  7:43:55.831963 [TOTEM] entering RECOVERY state.
Dec 11  7:43:55.832040 [TOTEM] position [0] member 20.0.10.7:
Dec 11  7:43:55.832062 [TOTEM] previous ring seq 3360 rep 20.0.10.7
Dec 11  7:43:55.832072 [TOTEM] aru 8535 high delivered 8535 received flag 1
Dec 11  7:43:55.832082 [TOTEM] position [1] member 20.0.10.8:
Dec 11  7:43:55.832139 [TOTEM] previous ring seq 3360 rep 20.0.10.7
Dec 11  7:43:55.832178 [TOTEM] aru 8535 high delivered 8535 received flag 1
Dec 11  7:43:55.832192 [TOTEM] position [2] member 20.0.10.10:
Dec 11  7:43:55.832202 [TOTEM] previous ring seq 3360 rep 20.0.10.7
Dec 11  7:43:55.832210 [TOTEM] aru 8535 high delivered 8535 received flag 1
Dec 11  7:43:55.832219 [TOTEM] Did not need to originate any messages in recovery.
Dec 11  7:43:55.832258 [TOTEM] Sending initial ORF token
Dec 11  7:43:55.833272 [CLM  ] CLM CONFIGURATION CHANGE
Dec 11  7:43:55.833304 [CLM  ] New Configuration:
Dec 11  7:43:55.833319 [CLM  ]  r(0) ip(20.0.10.7)
Dec 11  7:43:55.833329 [CLM  ]  r(0) ip(20.0.10.8)
Dec 11  7:43:55.833343 [CLM  ]  r(0) ip(20.0.10.10)
Dec 11  7:43:55.833382 [CLM  ] Members Left:
Dec 11  7:43:55.833393 [CLM  ]  r(0) ip(20.0.10.9)
Dec 11  7:43:55.833428 [CLM  ] Members Joined:
Dec 11  7:43:55.833460 [CLM  ] CLM CONFIGURATION CHANGE
Dec 11  7:43:55.833469 [CLM  ] New Configuration:
Dec 11  7:43:55.833479 [CLM  ]  r(0) ip(20.0.10.7)
Dec 11  7:43:55.833488 [CLM  ]  r(0) ip(20.0.10.8)
Dec 11  7:43:55.833497 [CLM  ]  r(0) ip(20.0.10.10)
Dec 11  7:43:55.833505 [CLM  ] Members Left:
Dec 11  7:43:55.833513 [CLM  ] Members Joined:
Dec 11  7:43:55.833525 [SYNC ] This node is within the primary component and will provide service.
Dec 11  7:43:55.833557 [TOTEM] entering OPERATIONAL state.
Dec 11  7:43:55.851479 [CLM  ] got nodejoin message 20.0.10.8
Dec 11  7:43:55.851560 [CLM  ] got nodejoin message 20.0.10.10
Dec 11  7:43:55.851617 [CLM  ] got nodejoin message 20.0.10.7
Dec 11  7:43:55.855316 [CPG  ] got joinlist message from node 118095892
Dec 11  7:43:55.855467 [CPG  ] got joinlist message from node 134873108
Dec 11  7:43:55.855559 [CPG  ] got joinlist message from node 168427540
Comment 22 Steven Dake 2008-12-17 14:08:09 EST
If you have a reproducible test case for comment #19, file a separate bug report and I will address it once I resolve the two segfaults I am working on currently.  Please do not put a ton of bug reports in one bugzilla it makes it extremely difficult to track what is fixed and what is not.

Regards
-steve
Comment 23 Steven Dake 2009-01-20 09:38:25 EST
i believe the ipc rewrite has fixed comment #19 (pthread deadlock condition).  if you can reproduce it with the latest patch, please file a seperate bug report.
Comment 24 Alan Conway 2009-01-20 09:56:06 EST
Testing with the ipc rewrite I have not seen any issues in 2 days so it looks good.
Comment 26 Steven Dake 2009-02-17 18:32:03 EST
*** Bug 485754 has been marked as a duplicate of this bug. ***
Comment 27 Steven Dake 2009-02-18 00:30:20 EST
resolved in version 0.80.5-2
Comment 32 errata-xmlrpc 2009-09-02 07:29:14 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1366.html

Note You need to log in before you can comment on or make changes to this bug.