Description of problem: There are observed repeated occurrences of qpid-cluster hangs when ran against cluster node and cluster is loaded. Let's assume following scenario: - 4 node cluster on RHEL 5.5 i686 / x86_64 (09, 10, 11, 12) - 09 node runs tsxtest on different randomly selected nodes, in loop - 10 node runs qpid-tools clients (qpid-cluster, qpid-stat -b), in loop - 11 node runs perftest with --base-name XYZ on randomly selected broker in the cluster, in loop - 12 node runs perftest with --base-name XYZ on node 12, in loop I can see that when I launch above lister clients at once (i.e. with gap of one second) all clients perftests are taking extremely long to perform first run and moreover qpid-cluster hangs following way (simple qpid-cluster performed): Quering local broker Exception in thread Thread-2: Traceback (most recent call last): File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap self.run() File "/usr/lib64/python2.4/threading.py", line 422, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 168, in run if self.aborted(): File "/usr/lib/python2.4/site-packages/qmf/console.py", line 2056, in aborted raise Timeout("Waiting for connection to be established with broker") Timeout: Waiting for connection to be established with broker In such a situation I would expect to have timeout after 60 seconds or so, but reality is different it hangs forever (>30 minutes) Having run the detailed gdb-pstack on the qpid-cluster process is detailed in the Additional info chapter. Version-Release number of selected component (if applicable): python-qmf-0.7.946106-5.el5 python-qpid-0.7.946106-4.el5 qmf-0.7.946106-6.el5 qmf-devel-0.7.946106-6.el5 qpid-cpp-client-0.7.946106-6.el5 qpid-cpp-client-devel-0.7.946106-6.el5 qpid-cpp-client-devel-docs-0.7.946106-6.el5 qpid-cpp-client-ssl-0.7.946106-6.el5 qpid-cpp-mrg-debuginfo-0.7.946106-6.el5 qpid-cpp-server-0.7.946106-6.el5 qpid-cpp-server-cluster-0.7.946106-6.el5 qpid-cpp-server-devel-0.7.946106-6.el5 qpid-cpp-server-ssl-0.7.946106-6.el5 qpid-cpp-server-store-0.7.946106-6.el5 qpid-cpp-server-xml-0.7.946106-6.el5 qpid-java-client-0.7.946106-5.el5 qpid-java-common-0.7.946106-5.el5 qpid-tools-0.7.946106-6.el5 How reproducible: 90% Steps to Reproduce: 1. set up cluster (4 nodes) 2. run openais and qpidd 3. loop tsxtest on one node 4. loop perftest on other two nedes 5. loop qpid-stat -b and qpid-cluster on the last remaining node Actual results: qpid-cluster hangs. Expected results: qpid-cluster should not hang. Additional info: qpid-cluster process dump (pstack like): GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. (gdb) Attaching to process 31091 Reading symbols from /usr/bin/python...(no debugging symbols found)...done. Reading symbols from /usr/lib64/libpython2.4.so.1.0...(no debugging symbols found)...done. Loaded symbols for /usr/lib64/libpython2.4.so.1.0 Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done. [Thread debugging using libthread_db enabled] [New Thread 0x40c1a940 (LWP 31094)] Loaded symbols for /lib64/libpthread.so.0 ... 0x00002af77b42dd01 in sem_wait () from /lib64/libpthread.so.0 (gdb) rax 0xfffffffffffffffc -4 rbx 0x4 4 rcx 0xffffffffffffffff -1 rdx 0x0 0 rsi 0x80 128 rdi 0x607f5b0 101184944 rbp 0x1 0x1 rsp 0x7fff9cda9b88 0x7fff9cda9b88 r8 0x0 0 r9 0x0 0 r10 0x0 0 r11 0x246 582 r12 0x607f5b0 101184944 r13 0x607f5b0 101184944 r14 0x5ffd800 100653056 r15 0x0 0 rip 0x2af77b42dd01 0x2af77b42dd01 <sem_wait+65> eflags 0x246 [ PF ZF IF ] cs 0x33 51 ss 0x2b 43 ds 0x0 0 es 0x0 0 fs 0x0 0 gs 0x0 0 fctrl 0x37f 895 fstat 0x0 0 ftag 0xffff 65535 fiseg 0x0 0 fioff 0x0 0 foseg 0x0 0 fooff 0x0 0 fop 0x0 0 mxcsr 0x1fa0 [ PE IM DM ZM OM UM PM ] (gdb) Using memory regions provided by the target. There are no memory regions defined. (gdb) 2 Thread 0x40c1a940 (LWP 31094) 0x00002af77bd93fc2 in select () from /lib64/libc.so.6 * 1 Thread 0x2af77c01f190 (LWP 31091) 0x00002af77b42dd01 in sem_wait () from /lib64/libpthread.so.0 (gdb) Thread 2 (Thread 0x40c1a940 (LWP 31094)): #0 0x00002af77bd93fc2 in select () from /lib64/libc.so.6 #1 0x00002af7819f34cf in ?? () from /usr/lib64/python2.4/lib-dynload/timemodule.so #2 0x00002af77b1840ba in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #3 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #4 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #5 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #6 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #7 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #8 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #9 0x00002af77b183b66 in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #10 0x00002af77b183b66 in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #11 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #12 0x00002af77b13b367 in ?? () from /usr/lib64/libpython2.4.so.1.0 #13 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0 #14 0x00002af77b12b1ef in ?? () from /usr/lib64/libpython2.4.so.1.0 #15 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0 #16 0x00002af77b17ec3d in PyEval_CallObjectWithKeywords () from /usr/lib64/libpython2.4.so.1.0 #17 0x00002af77b1aabbd in ?? () from /usr/lib64/libpython2.4.so.1.0 #18 0x00002af77b42773d in start_thread () from /lib64/libpthread.so.0 #19 0x00002af77bd9ad1d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x2af77c01f190 (LWP 31091)): #0 0x00002af77b42dd01 in sem_wait () from /lib64/libpthread.so.0 #1 0x00002af77b1a7148 in PyThread_acquire_lock () from /usr/lib64/libpython2.4.so.1.0 #2 0x00002af77b1aa83a in ?? () from /usr/lib64/libpython2.4.so.1.0 #3 0x00002af77b1840ba in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #4 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #5 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #6 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #7 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #8 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #9 0x00002af77b13b45a in ?? () from /usr/lib64/libpython2.4.so.1.0 #10 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0 #11 0x00002af77b12b1ef in ?? () from /usr/lib64/libpython2.4.so.1.0 #12 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0 #13 0x00002af77b16290c in ?? () from /usr/lib64/libpython2.4.so.1.0 #14 0x00002af77b15e8e8 in ?? () from /usr/lib64/libpython2.4.so.1.0 #15 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0 #16 0x00002af77b1818ce in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #17 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #18 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #19 0x00002af77b183b66 in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #20 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #21 0x00002af77b18373f in PyEval_EvalFrame () from /usr/lib64/libpython2.4.so.1.0 #22 0x00002af77b184fe5 in PyEval_EvalCodeEx () from /usr/lib64/libpython2.4.so.1.0 #23 0x00002af77b185032 in PyEval_EvalCode () from /usr/lib64/libpython2.4.so.1.0 #24 0x00002af77b1a1729 in ?? () from /usr/lib64/libpython2.4.so.1.0 #25 0x00002af77b1a2bd8 in PyRun_SimpleFileExFlags () from /usr/lib64/libpython2.4.so.1.0 #26 0x00002af77b1a900d in Py_Main () from /usr/lib64/libpython2.4.so.1.0 #27 0x00002af77bce4994 in __libc_start_main () from /lib64/libc.so.6 #28 0x0000000000400629 in _start () (gdb) Detaching from program: /usr/bin/python, process 31091 (gdb) quit
Created attachment 430283 [details] qpid-cluster and qpidd process dumps + reproducer The attachment contains process dumps for qpidd and qpid-cluster when hanged (dump_pid_*) and reproducer info_thread.sh and its log info_thread.sh.log
Let me share one more observation about the problem. I can see the above detailed issue only after cluster is up for few seconds (the really begin of the cluster life). Afterwards I can see that qpid-cluster is functional as expected.
The broker configuration for above described issue is: [root@mrg-qe-09 fcluster]# cat /etc/qpidd.conf # # Licensed to the Apache Software Foundation (ASF) under one ... # "qpidd --help" or "man qpidd" for more details. cluster-mechanism=ANONYMOUS auth=no log-to-file=/tmp/qpidd.log log-enable=info+ log-enable=debug+:cluster cluster-name=fcluster
I ran the following on a 4 node cluster over the weekend, 64 hours so far with no error. Top shows qpidd processes are stable at between 700M and 1024M node1: info_thread.sh node2: tsxtest_thread.sh node3&4: perftest_thread.sh No core files, grep 'ecode:[^0]' in logs found nothing, tail of logs looks normal. This was with trunk r985356 which corresponds to release tags mrg1.3-beta5 qpid-cpp-mrg-0.7.946106-12