Bug 490856
Summary: | openais-0.80.3-22.el5_3.3 exits with assertion `sq->items_inuse[sq_position] == 0' failed | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Frantisek Reznicek <freznice> |
Component: | openais | Assignee: | Jan Friesse <jfriesse> |
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 5.3 | CC: | cluster-maint, esammons, gsim, iannis, iboverma, sdake, syeghiay |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-05-03 14:59:08 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frantisek Reznicek
2009-03-18 10:51:22 UTC
There is no core file because you likely did not set ulimit -c unlimited Is there an rpm with the fix? MRG group would like to test it. MRG team is working of getting the core file. One more occurrence including backtrace triggered yesterday, when failover soak test is run on dell-pesc1425-01.rhts.bos.redhat.com. There were two machines in ais cluster at that moment dell-pesc1425-01.rhts.bos.redhat.com and ibm-maple.rhts.bos.redhat.com. [root@dell-pesc1425-01 ~]# gdb /usr/sbin/aisexec /var/lib/openais/core.10366 GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu"... warning: Can't read pathname for load map: Input/output error. Reading symbols from /lib64/libdl.so.2...done. Loaded symbols for /lib64/libdl.so.2 Reading symbols from /lib64/libpthread.so.0...done. Loaded symbols for /lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded symbols for /lib64/libc.so.6 Reading symbols from /usr/libexec/lcrso/objdb.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/objdb.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/objdb.lcrso Reading symbols from /usr/libexec/lcrso/aisparser.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/aisparser.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/aisparser.lcrso Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cpg.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cfg.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_msg.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_lck.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evt.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_ckpt.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_amf.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_clm.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evs.lcrso.debug...done. done. Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso Reading symbols from /lib64/libnss_files.so.2...done. Loaded symbols for /lib64/libnss_files.so.2 Reading symbols from /lib64/libgcc_s.so.1...done. Loaded symbols for /lib64/libgcc_s.so.1 warning: Can't read pathname for load map: Input/output error. warning: Can't read pathname for load map: Input/output error. Core was generated by `/usr/sbin/aisexec'. Program terminated with signal 6, Aborted. [New process 10366] [New process 21770] [New process 21752] [New process 21734] [New process 10369] [New process 10367] #0 0x0000003c13030215 in raise () from /lib64/libc.so.6 (gdb) backtrace #0 0x0000003c13030215 in raise () from /lib64/libc.so.6 #1 0x0000003c13031cc0 in abort () from /lib64/libc.so.6 #2 0x0000003c13029696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000411cdc in message_handler_orf_token (instance=0x2aaaac523010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at ../include/sq.h:171 #4 0x0000000000409c03 in rrp_deliver_fn (context=0x2aaab0000960, msg=0x137dece4, msg_len=70) at totemrrp.c:1308 #5 0x0000000000408096 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x137de660) at totemnet.c:676 #6 0x0000000000405ade in poll_run (handle=0) at aispoll.c:409 #7 0x00000000004183d8 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:599 (gdb) thread apply all bt Thread 6 (process 10367): #0 0x0000003c130ca566 in poll () from /lib64/libc.so.6 #1 0x000000000041c0d5 in prioritized_timer_thread (data=<value optimized out>) at timer.c:125 #2 0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0 #3 0x0000003c130d30ad in clone () from /lib64/libc.so.6 Thread 5 (process 10369): #0 0x0000003c1380a899 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000041793f in worker_thread (thread_data_in=<value optimized out>) at wthread.c:73 #2 0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0 #3 0x0000003c130d30ad in clone () from /lib64/libc.so.6 Thread 4 (process 21734): #0 0x0000003c130d4477 in semop () from /lib64/libc.so.6 #1 0x000000000041ade1 in pthread_ipc_consumer (conn=0x2aaab0018f40) at ipc.c:281 #2 0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0 #3 0x0000003c130d30ad in clone () from /lib64/libc.so.6 Thread 3 (process 21752): #0 0x0000003c130d4477 in semop () from /lib64/libc.so.6 #1 0x000000000041ade1 in pthread_ipc_consumer (conn=0x2aaab0018240) at ipc.c:281 #2 0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0 #3 0x0000003c130d30ad in clone () from /lib64/libc.so.6 Thread 2 (process 21770): #0 0x0000003c130d4477 in semop () from /lib64/libc.so.6 #1 0x000000000041ade1 in pthread_ipc_consumer (conn=0x2aaab0018e30) at ipc.c:281 #2 0x0000003c13806367 in start_thread () from /lib64/libpthread.so.0 #3 0x0000003c130d30ad in clone () from /lib64/libc.so.6 Thread 1 (process 10366): #0 0x0000003c13030215 in raise () from /lib64/libc.so.6 #1 0x0000003c13031cc0 in abort () from /lib64/libc.so.6 #2 0x0000003c13029696 in __assert_fail () from /lib64/libc.so.6 #3 0x0000000000411cdc in message_handler_orf_token (instance=0x2aaaac523010, msg=<value optimized out>, msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at ../include/sq.h:171 #4 0x0000000000409c03 in rrp_deliver_fn (context=0x2aaab0000960, msg=0x137dece4, msg_len=70) at totemrrp.c:1308 #5 0x0000000000408096 in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>, revents=<value optimized out>, data=0x137de660) at totemnet.c:676 ---Type <return> to continue, or q <return> to quit--- #6 0x0000000000405ade in poll_run (handle=0) at aispoll.c:409 #7 0x00000000004183d8 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:599 Trying to get there once more on single machine (single ais cluster), so far no success, so it seems it show up only in non-single ais cluster... to be confirmed in next hours (test running now). The test was running for more than 10x longer before it exited for other reason. So now it's 99.9% sure this BZ case (Assertion crash) happens only when multiple ais nodes are set-up for that test. Let me know if you want to do another test case. Something is terribly wrong with the network configuration between these two nodes. Starting aisexec on two nodes with logging enabled shows that each node cannot see the other's UDP multicast traffic (verified with tcpdump). After about two hours of running, apparently the switch lets in some multicast traffic which results in a validly formed configuration: Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] Members Joined: Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] CLM CONFIGURATION CHANGE Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] New Configuration: Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] r(0) ip(10.16.64.52) Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] r(0) ip(10.16.64.145) Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] Members Left: Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] Members Joined: Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [CLM ] r(0) ip(10.16.64.145) Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [SYNC ] This node is within the primary component and will provide service. Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] entering OPERATIONAL state. Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] FAILED TO RECEIVE Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] entering GATHER state from 6. Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] FAILED TO RECEIVE Mar 26 13:51:21 dell-pesc1425-01 openais[4742]: [TOTEM] entering GATHER state from 6. Mar 26 13:51:21 dell-pesc1425-01 kernel: qpidd[29802]: segfault at 0000000002a53000 rip 0000003c1307b83b rsp 00007fffdb237758 error 4 Then it again turns off the multicast traffic which results in FAILED TO RECEIVE error state. The logic in FAILED TO RECEIVE state has never really been tested because it requires faulty hardware and apparently is broken. It indicates that multicast traffic cannot be received during normal operation. Note at 13:51:21 openais segfaults after the failed to receive happens. I'll come up with a test case that models this real-world network failure. I recommend making sure your network is functional when running openais. Also ibm-maple is spamming the following errors: Mar 26 13:51:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out Mar 26 13:51:41 ibm-maple kernel: hda: lost interrupt Mar 26 13:52:41 ibm-maple kernel: hda: lost interrupt Mar 26 13:53:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out Mar 26 13:53:41 ibm-maple kernel: hda: lost interrupt Mar 26 13:54:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out Mar 26 13:54:41 ibm-maple kernel: hda: lost interrupt Mar 26 13:55:41 ibm-maple kernel: hda: lost interrupt Mar 26 13:56:41 ibm-maple kernel: ide-cd: cmd 0x3 timed out I think you will find this is extremely rare in the field and probably not worth the urgency applied to the bugzilla. Regards -steve Currently retesting on machines in more clean network environment. Results coming soon. Retested in clean network environment and no such case triggered, which confirms this issue is very rare and most probably caused by multicast traffic issue in RHTS lab. *** Bug 543364 has been marked as a duplicate of this bug. *** *** Bug 548798 has been marked as a duplicate of this bug. *** This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. This bug is partly fixed in 0.80.6-19 by SVN 2145 (rhbz#598689) so closing as duplicate of that bug. During testing I was able to reach rhbz#671575, but never original bug. *** This bug has been marked as a duplicate of bug 598689 *** removing flags as this is a dup. |