Bug 490856 - openais-0.80.3-22.el5_3.3 exits with assertion `sq->items_inuse[sq_position] == 0' failed
Summary: openais-0.80.3-22.el5_3.3 exits with assertion `sq->items_inuse[sq_position] ...
Keywords:
Status: CLOSED DUPLICATE of bug 598689
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais
Version: 5.3
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Jan Friesse
QA Contact: Cluster QE
URL:
Whiteboard:
: 543364 548798 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-03-18 10:51 UTC by Frantisek Reznicek
Modified: 2016-04-26 15:11 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-03 14:59:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Frantisek Reznicek 2009-03-18 10:51:22 UTC
Description of problem:
This BZ is openais bug version of bug 490855.

When running MRG clustering test attached to 490855, there were two observations of unexpected opeanis's aisexec with assertion:
aisexec: ../include/sq.h:171: sq_item_add: Assertion `sq->items_inuse[sq_position] == 0' failed

One observation on RHTS RHEL5.3 i386 and another one on x86_64.

No aisexec core files found, but most probably because of aisexec assertion MRG broker application crashes. See qpidd details in bug 490855.

Version-Release number of selected component (if applicable):
[root@hp-ml370g4-01 fsoak]# rpm -qa | egrep '(ais|qpid|rhm)'
openais-0.80.3-22.el5_3.3
qpidc-0.5.752581-1.el5
qpidd-acl-0.5.752581-1.el5
rhm-docs-0.5.753238-1.el5
qpid-java-common-0.5.751061-1.el5
qpidc-rdma-0.5.752581-1.el5
qpidd-ssl-0.5.752581-1.el5
qpidc-perftest-0.5.752581-1.el5
qpidd-0.5.752581-1.el5
qpidc-devel-0.5.752581-1.el5
qpidd-rdma-0.5.752581-1.el5
rhm-0.5.3153-1.el5
python-qpid-0.5.752581-1.el5
qpid-java-client-0.5.751061-1.el5
qpidd-cluster-0.5.752581-1.el5
qpidc-ssl-0.5.752581-1.el5
qpidd-xml-0.5.752581-1.el5
qpidd-devel-0.5.752581-1.el5


How reproducible:
>50% (sometimes it show up just after ~30 runs sometime after ~800 runs)


Steps to Reproduce:
1.run failover_soak test in a loop (see attachement for reproducer)
2.watch results
  
Actual results:
aisexec exits with above mentioned assertion
(clustered qpidd crashes)


Expected results:
both aisexec and qpidd should continue working w/o any issue.


Additional info:
See attached reproducer in bug 490855 or here:
https://bugzilla.redhat.com/attachment.cgi?id=335676

Comment 2 Steven Dake 2009-03-18 21:06:52 UTC
There is no core file because you likely did not set 

ulimit -c unlimited

Comment 3 Irina Boverman 2009-03-24 19:56:00 UTC
Is there an rpm with the fix? MRG group would like to test it.

Comment 4 Irina Boverman 2009-03-25 14:07:35 UTC
MRG team is working of getting the core file.

Comment 5 Frantisek Reznicek 2009-03-26 09:43:20 UTC
One more occurrence including backtrace triggered yesterday, when failover soak test is run on dell-pesc1425-01.rhts.bos.redhat.com.

There were two machines in ais cluster at that moment dell-pesc1425-01.rhts.bos.redhat.com and ibm-maple.rhts.bos.redhat.com.

[root@dell-pesc1425-01 ~]# gdb /usr/sbin/aisexec /var/lib/openais/core.10366
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...

warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /usr/libexec/lcrso/objdb.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/objdb.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/objdb.lcrso
Reading symbols from /usr/libexec/lcrso/aisparser.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/aisparser.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/aisparser.lcrso
Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cpg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso
Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cfg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso
Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_msg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso
Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_lck.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso
Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evt.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso
Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_ckpt.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso
Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_amf.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso
Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_clm.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso
Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evs.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1

warning: Can't read pathname for load map: Input/output error.

warning: Can't read pathname for load map: Input/output error.
Core was generated by `/usr/sbin/aisexec'.
Program terminated with signal 6, Aborted.
[New process 10366]
[New process 21770]
[New process 21752]
[New process 21734]
[New process 10369]
[New process 10367]
#0  0x0000003c13030215 in raise () from /lib64/libc.so.6
(gdb) backtrace
#0  0x0000003c13030215 in raise () from /lib64/libc.so.6
#1  0x0000003c13031cc0 in abort () from /lib64/libc.so.6
#2  0x0000003c13029696 in __assert_fail () from /lib64/libc.so.6
#3  0x0000000000411cdc in message_handler_orf_token (instance=0x2aaaac523010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at ../include/sq.h:171
#4  0x0000000000409c03 in rrp_deliver_fn (context=0x2aaab0000960, msg=0x137dece4, msg_len=70) at totemrrp.c:1308
#5  0x0000000000408096 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>,
    data=0x137de660) at totemnet.c:676
#6  0x0000000000405ade in poll_run (handle=0) at aispoll.c:409
#7  0x00000000004183d8 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:599
(gdb) thread apply all bt

Thread 6 (process 10367):
#0  0x0000003c130ca566 in poll () from /lib64/libc.so.6
#1  0x000000000041c0d5 in prioritized_timer_thread (data=<value optimized out>) at timer.c:125
#2  0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c130d30ad in clone () from /lib64/libc.so.6

Thread 5 (process 10369):
#0  0x0000003c1380a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000000000041793f in worker_thread (thread_data_in=<value optimized out>) at wthread.c:73
#2  0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c130d30ad in clone () from /lib64/libc.so.6

Thread 4 (process 21734):
#0  0x0000003c130d4477 in semop () from /lib64/libc.so.6
#1  0x000000000041ade1 in pthread_ipc_consumer (conn=0x2aaab0018f40) at ipc.c:281
#2  0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c130d30ad in clone () from /lib64/libc.so.6

Thread 3 (process 21752):
#0  0x0000003c130d4477 in semop () from /lib64/libc.so.6
#1  0x000000000041ade1 in pthread_ipc_consumer (conn=0x2aaab0018240) at ipc.c:281
#2  0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c130d30ad in clone () from /lib64/libc.so.6

Thread 2 (process 21770):
#0  0x0000003c130d4477 in semop () from /lib64/libc.so.6
#1  0x000000000041ade1 in pthread_ipc_consumer (conn=0x2aaab0018e30) at ipc.c:281
#2  0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c130d30ad in clone () from /lib64/libc.so.6

Thread 1 (process 10366):
#0  0x0000003c13030215 in raise () from /lib64/libc.so.6
#1  0x0000003c13031cc0 in abort () from /lib64/libc.so.6
#2  0x0000003c13029696 in __assert_fail () from /lib64/libc.so.6
#3  0x0000000000411cdc in message_handler_orf_token (instance=0x2aaaac523010, msg=<value optimized out>, msg_len=<value optimized out>,
    endian_conversion_needed=<value optimized out>) at ../include/sq.h:171
#4  0x0000000000409c03 in rrp_deliver_fn (context=0x2aaab0000960, msg=0x137dece4, msg_len=70) at totemrrp.c:1308
#5  0x0000000000408096 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>,
    data=0x137de660) at totemnet.c:676
---Type <return> to continue, or q <return> to quit---
#6  0x0000000000405ade in poll_run (handle=0) at aispoll.c:409
#7  0x00000000004183d8 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:599


Trying to get there once more on single machine (single ais cluster), so far no success, so it seems it show up only in non-single ais cluster... to be confirmed in next hours (test running now).

Comment 6 Frantisek Reznicek 2009-03-26 15:00:46 UTC
The test was running for more than 10x longer before it exited for other reason.
So now it's 99.9% sure this BZ case (Assertion crash) happens only when multiple ais nodes are set-up for that test.

Let me know if you want to do another test case.

Comment 7 Steven Dake 2009-03-27 01:36:02 UTC
Something is terribly wrong with the network configuration between these two nodes.

Starting aisexec on two nodes with logging enabled shows that each node cannot see the other's UDP multicast traffic (verified with tcpdump).  After about two hours of running, apparently the switch lets in some multicast traffic which results in a validly formed configuration:


Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ] Members Joined:
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ] CLM CONFIGURATION CHANGE 
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ] New Configuration:
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ]         r(0) ip(10.16.64.52)
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ]         r(0) ip(10.16.64.145)
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ] Members Left:
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ] Members Joined:
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM  ]         r(0) ip(10.16.64.145)
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [SYNC ] This node is within the primary component and will provide service.
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] entering OPERATIONAL state.
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] FAILED TO RECEIVE
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] entering GATHER state from 6.
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] FAILED TO RECEIVE
Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] entering GATHER state from 6.
Mar 26 13:51:21 dell-pesc1425-01 kernel: qpidd[29802]: segfault at 0000000002a53000 rip 0000003c1307b83b rsp 00007fffdb237758 error 4

Then it again turns off the multicast traffic which results in FAILED TO RECEIVE error state.

The logic in FAILED TO RECEIVE state has never really been tested because it requires faulty hardware and apparently is broken.  It indicates that multicast traffic cannot be received during normal operation.

Note at 13:51:21 openais segfaults after the failed to receive happens.

I'll come up with a test case that models this real-world network failure.

I recommend making sure your network is functional when running openais.

Also ibm-maple is spamming the following errors:
Mar 26 13:51:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out
Mar 26 13:51:41 ibm-maple kernel: hda: lost interrupt
Mar 26 13:52:41 ibm-maple kernel: hda: lost interrupt
Mar 26 13:53:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out
Mar 26 13:53:41 ibm-maple kernel: hda: lost interrupt
Mar 26 13:54:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out
Mar 26 13:54:41 ibm-maple kernel: hda: lost interrupt
Mar 26 13:55:41 ibm-maple kernel: hda: lost interrupt
Mar 26 13:56:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out

I think you will find this is extremely rare in the field and probably not worth the urgency applied to the bugzilla.

Regards
-steve

Comment 8 Frantisek Reznicek 2009-04-07 14:51:01 UTC
Currently retesting on machines in more clean network environment.
Results coming soon.

Comment 10 Frantisek Reznicek 2009-04-14 07:00:54 UTC
Retested in clean network environment and no such case triggered, which confirms this issue is very rare and most probably caused by multicast traffic issue in RHTS lab.

Comment 12 Steven Dake 2009-12-03 17:14:44 UTC
*** Bug 543364 has been marked as a duplicate of this bug. ***

Comment 13 Steven Dake 2009-12-21 17:55:17 UTC
*** Bug 548798 has been marked as a duplicate of this bug. ***

Comment 27 RHEL Program Management 2011-01-11 20:07:41 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated in the
current release, Red Hat is unfortunately unable to address this
request at this time. Red Hat invites you to ask your support
representative to propose this request, if appropriate and relevant,
in the next release of Red Hat Enterprise Linux.

Comment 28 RHEL Program Management 2011-01-11 23:15:15 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 29 Jan Friesse 2011-05-03 14:59:08 UTC
This bug is partly fixed in 0.80.6-19 by SVN 2145 (rhbz#598689) so closing as duplicate of that bug.

During testing I was able to reach rhbz#671575, but never original bug.

*** This bug has been marked as a duplicate of bug 598689 ***

Comment 30 Steven Dake 2012-06-07 20:09:41 UTC
removing flags as this is a dup.


Note You need to log in before you can comment on or make changes to this bug.