DescriptionEverett Bennett
2009-10-29 14:22:51 UTC
+++ This bug was initially created as a clone of Bug #525280 +++
Description of problem:
-----------------------
Occassionally a node within a 5-node RHEL 5.4 cluster is fenced for no apparent reason resulting in a core dump of aisexec.
Version-Release number of selected component (if applicable):
root@med1:/var/lib/openais> uname -a
Linux med1 2.6.18-164.2.1.el5 #1 SMP Mon Sep 21 04:37:42 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
root@med1:/var/lib/openais> rpm -qa|grep openais
openais-0.80.6-8.el5_4.1
openais-debuginfo-0.80.6-8.el5_4.1
root@med1:/var/lib/openais> ls -ld core*
-rw------- 1 root root 29044736 Oct 22 16:25 core.14301
How reproducible:
Intermittent, but has happened more than 1 time. 2 Core traces are listed below.
Actual results:
root@med1:/var/lib/openais> alias ls=ls
root@med1:/var/lib/openais> rpm -ivh /root/openais-debuginfo-0.80.6-8.el5_4.1.x86_64.rpm
Preparing... ########################################### [100%]
1:openais-debuginfo ########################################### [100%]
root@med1:/var/lib/openais> gdb /usr/sbin/aisexec core.14301
GNU gdb Fedora (6.8-37.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
warning: Can't read pathname for load map: Input/output error.
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /usr/libexec/lcrso/objdb.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/objdb.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/objdb.lcrso
Reading symbols from /usr/libexec/lcrso/service_cman.lcrso...done.
Loaded symbols for /usr/libexec/lcrso/service_cman.lcrso
Reading symbols from /lib64/libnss_files.so.2...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /usr/libexec/lcrso/service_evs.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evs.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_evs.lcrso
Reading symbols from /usr/libexec/lcrso/service_clm.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_clm.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_clm.lcrso
Reading symbols from /usr/libexec/lcrso/service_amf.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_amf.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_amf.lcrso
Reading symbols from /usr/libexec/lcrso/service_ckpt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_ckpt.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_ckpt.lcrso
Reading symbols from /usr/libexec/lcrso/service_evt.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_evt.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_evt.lcrso
Reading symbols from /usr/libexec/lcrso/service_lck.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_lck.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_lck.lcrso
Reading symbols from /usr/libexec/lcrso/service_msg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_msg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_msg.lcrso
Reading symbols from /usr/libexec/lcrso/service_cfg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cfg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_cfg.lcrso
Reading symbols from /usr/libexec/lcrso/service_cpg.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_cpg.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_cpg.lcrso
Reading symbols from /usr/libexec/lcrso/service_confdb.lcrso...Reading symbols from /usr/lib/debug/usr/libexec/lcrso/service_confdb.lcrso.debug...done.
done.
Loaded symbols for /usr/libexec/lcrso/service_confdb.lcrso
Reading symbols from /lib64/libgcc_s.so.1...done.
Loaded symbols for /lib64/libgcc_s.so.1
warning: Can't read pathname for load map: Input/output error.
warning: Can't read pathname for load map: Input/output error.
Core was generated by `aisexec'.
Program terminated with signal 11, Segmentation fault.
[New process 14301]
[New process 14362]
[New process 14337]
[New process 14333]
[New process 14314]
[New process 14302]
#0 0x00002aaaaaeb614b in unbind_con () from /usr/libexec/lcrso/service_cman.lcrso
(gdb) bt
#0 0x00002aaaaaeb614b in unbind_con () from /usr/libexec/lcrso/service_cman.lcrso
#1 0x00002aaaaaeb3728 in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#2 0x00002aaaaaeb38db in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#3 0x00002aaaaaeb3a61 in send_status_return () from /usr/libexec/lcrso/service_cman.lcrso
#4 0x00002aaaaaeb6f55 in send_to_userport () from /usr/libexec/lcrso/service_cman.lcrso
#5 0x00002aaaaaeb4c49 in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#6 0x0000000000415165 in app_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=1, endian_conversion_required=0)
at totempg.c:460
#7 0x00000000004155ec in totempg_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=<value optimized out>,
endian_conversion_required=0) at totempg.c:604
#8 0x0000000000410418 in messages_deliver_to_app (instance=0x2aaaaaaae010, skip=0, end_point=<value optimized out>)
at totemsrp.c:3548
#9 0x000000000041203c in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>,
msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3420
#10 0x0000000000409f43 in rrp_deliver_fn (context=0x1a537430, msg=0x1a550e28, msg_len=70) at totemrrp.c:1308
#11 0x00000000004084fb in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>,
revents=<value optimized out>, data=0x1a550780) at totemnet.c:695
#12 0x0000000000405d10 in poll_run (handle=0) at aispoll.c:402
#13 0x0000000000418834 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:620
(gdb)
#0 0x00002aaaaaeb614b in unbind_con () from /usr/libexec/lcrso/service_cman.lcrso
#1 0x00002aaaaaeb3728 in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#2 0x00002aaaaaeb38db in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#3 0x00002aaaaaeb3a61 in send_status_return () from /usr/libexec/lcrso/service_cman.lcrso
#4 0x00002aaaaaeb6f55 in send_to_userport () from /usr/libexec/lcrso/service_cman.lcrso
#5 0x00002aaaaaeb4c49 in ?? () from /usr/libexec/lcrso/service_cman.lcrso
#6 0x0000000000415165 in app_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=1, endian_conversion_required=0)
at totempg.c:460
#7 0x00000000004155ec in totempg_deliver_fn (nodeid=1, iovec=<value optimized out>, iov_len=<value optimized out>,
endian_conversion_required=0) at totempg.c:604
#8 0x0000000000410418 in messages_deliver_to_app (instance=0x2aaaaaaae010, skip=0, end_point=<value optimized out>)
at totemsrp.c:3548
#9 0x000000000041203c in message_handler_orf_token (instance=0x2aaaaaaae010, msg=<value optimized out>,
msg_len=<value optimized out>, endian_conversion_needed=<value optimized out>) at totemsrp.c:3420
#10 0x0000000000409f43 in rrp_deliver_fn (context=0x1a537430, msg=0x1a550e28, msg_len=70) at totemrrp.c:1308
#11 0x00000000004084fb in net_deliver_fn (handle=<value optimized out>, fd=<value optimized out>,
revents=<value optimized out>, data=0x1a550780) at totemnet.c:695
#12 0x0000000000405d10 in poll_run (handle=0) at aispoll.c:402
#13 0x0000000000418834 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:620
(gdb) quit
I cloned the bug report that we saw at at RHEL 5.3 site for another site which was running RHEL 5.4. If you have a patch to test, please advise and note this report.
Regards
Everett
core dump appears to be in cman. Reassigning to Chrissie for further investigation.
Comment 3Christine Caulfield
2009-10-29 15:43:14 UTC
Is this a very busy system? It looks like it could possibly be a large number of queued messages from CMAN. However it's a very large number so I'm not really sure at the moment. It would account for the different core dumps from 5.3 and 5.4 though.
Is it possible to add this to cluster.conf, inside the <cluster> section"
<cman max_queued="1024"/>
thanks
Chrissie