Bug 612458 - qpid-cluster hangs on cluster under load
Summary: qpid-cluster hangs on cluster under load
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-qmf
Version: Development
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-08 09:39 UTC by Frantisek Reznicek
Modified: 2015-11-16 01:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-10-22 09:32:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
qpid-cluster and qpidd process dumps + reproducer (25.51 KB, application/x-tbz)
2010-07-08 09:44 UTC, Frantisek Reznicek
no flags Details

Description Frantisek Reznicek 2010-07-08 09:39:43 UTC
Description of problem:

There are observed repeated occurrences of qpid-cluster hangs when ran against cluster node and cluster is loaded.

Let's assume following scenario:
- 4 node cluster on RHEL 5.5 i686 / x86_64 (09, 10, 11, 12)
  - 09 node runs tsxtest on different randomly selected nodes, in loop
  - 10 node runs qpid-tools clients (qpid-cluster, qpid-stat -b), in loop
  - 11 node runs perftest with --base-name XYZ on randomly selected broker in the cluster, in loop
  - 12 node runs perftest with --base-name XYZ on node 12, in loop

I can see that when I launch above lister clients at once (i.e. with gap of one second) all clients perftests are taking extremely long to perform first run and moreover qpid-cluster hangs following way (simple qpid-cluster performed):

Quering local broker
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib64/python2.4/threading.py", line 442, in __bootstrap
    self.run()
  File "/usr/lib64/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib/python2.4/site-packages/qpid/connection.py", line 168, in run
    if self.aborted():
  File "/usr/lib/python2.4/site-packages/qmf/console.py", line 2056, in aborted
    raise Timeout("Waiting for connection to be established with broker")
Timeout: Waiting for connection to be established with broker


In such a situation I would expect to have timeout after 60 seconds or so, but reality is different it hangs forever (>30 minutes)

Having run the detailed gdb-pstack on the qpid-cluster process is detailed in the Additional info chapter.



Version-Release number of selected component (if applicable):
python-qmf-0.7.946106-5.el5
python-qpid-0.7.946106-4.el5
qmf-0.7.946106-6.el5
qmf-devel-0.7.946106-6.el5
qpid-cpp-client-0.7.946106-6.el5
qpid-cpp-client-devel-0.7.946106-6.el5
qpid-cpp-client-devel-docs-0.7.946106-6.el5
qpid-cpp-client-ssl-0.7.946106-6.el5
qpid-cpp-mrg-debuginfo-0.7.946106-6.el5
qpid-cpp-server-0.7.946106-6.el5
qpid-cpp-server-cluster-0.7.946106-6.el5
qpid-cpp-server-devel-0.7.946106-6.el5
qpid-cpp-server-ssl-0.7.946106-6.el5
qpid-cpp-server-store-0.7.946106-6.el5
qpid-cpp-server-xml-0.7.946106-6.el5
qpid-java-client-0.7.946106-5.el5
qpid-java-common-0.7.946106-5.el5
qpid-tools-0.7.946106-6.el5


How reproducible:
90%

Steps to Reproduce:
1. set up cluster (4 nodes)
2. run openais and qpidd
3. loop tsxtest on one node
4. loop perftest on other two nedes
5. loop qpid-stat -b and qpid-cluster on the last remaining node


Actual results:
qpid-cluster hangs.

Expected results:
qpid-cluster should not hang.

Additional info:
qpid-cluster process dump (pstack like):
  GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5_5.1)
  Copyright (C) 2009 Free Software Foundation, Inc.
  License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
  This is free software: you are free to change and redistribute it.
  There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
  and "show warranty" for details.
  This GDB was configured as "x86_64-redhat-linux-gnu".
  For bug reporting instructions, please see:
  <http://www.gnu.org/software/gdb/bugs/>.
  (gdb) Attaching to process 31091
  Reading symbols from /usr/bin/python...(no debugging symbols found)...done.
  Reading symbols from /usr/lib64/libpython2.4.so.1.0...(no debugging symbols found)...done.
  Loaded symbols for /usr/lib64/libpython2.4.so.1.0
  Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
  [Thread debugging using libthread_db enabled]
  [New Thread 0x40c1a940 (LWP 31094)]
  Loaded symbols for /lib64/libpthread.so.0
  ...
  0x00002af77b42dd01 in sem_wait () from /lib64/libpthread.so.0
  (gdb) rax            0xfffffffffffffffc -4
  rbx            0x4      4
  rcx            0xffffffffffffffff       -1
  rdx            0x0      0
  rsi            0x80     128
  rdi            0x607f5b0        101184944
  rbp            0x1      0x1
  rsp            0x7fff9cda9b88   0x7fff9cda9b88
  r8             0x0      0
  r9             0x0      0
  r10            0x0      0
  r11            0x246    582
  r12            0x607f5b0        101184944
  r13            0x607f5b0        101184944
  r14            0x5ffd800        100653056
  r15            0x0      0
  rip            0x2af77b42dd01   0x2af77b42dd01 <sem_wait+65>
  eflags         0x246    [ PF ZF IF ]
  cs             0x33     51
  ss             0x2b     43
  ds             0x0      0
  es             0x0      0
  fs             0x0      0
  gs             0x0      0
  fctrl          0x37f    895
  fstat          0x0      0
  ftag           0xffff   65535
  fiseg          0x0      0
  fioff          0x0      0
  foseg          0x0      0
  fooff          0x0      0
  fop            0x0      0
  mxcsr          0x1fa0   [ PE IM DM ZM OM UM PM ]
  (gdb) Using memory regions provided by the target.
  There are no memory regions defined.
  (gdb)   2 Thread 0x40c1a940 (LWP 31094)  0x00002af77bd93fc2 in select ()
    from /lib64/libc.so.6
  * 1 Thread 0x2af77c01f190 (LWP 31091)  0x00002af77b42dd01 in sem_wait ()
    from /lib64/libpthread.so.0
  (gdb)
  Thread 2 (Thread 0x40c1a940 (LWP 31094)):
  #0  0x00002af77bd93fc2 in select () from /lib64/libc.so.6
  #1  0x00002af7819f34cf in ?? ()
    from /usr/lib64/python2.4/lib-dynload/timemodule.so
  #2  0x00002af77b1840ba in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #3  0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #4  0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #5  0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #6  0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #7  0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #8  0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #9  0x00002af77b183b66 in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #10 0x00002af77b183b66 in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #11 0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #12 0x00002af77b13b367 in ?? () from /usr/lib64/libpython2.4.so.1.0
  #13 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0
  #14 0x00002af77b12b1ef in ?? () from /usr/lib64/libpython2.4.so.1.0
  #15 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0
  #16 0x00002af77b17ec3d in PyEval_CallObjectWithKeywords ()
    from /usr/lib64/libpython2.4.so.1.0
  #17 0x00002af77b1aabbd in ?? () from /usr/lib64/libpython2.4.so.1.0
  #18 0x00002af77b42773d in start_thread () from /lib64/libpthread.so.0
  #19 0x00002af77bd9ad1d in clone () from /lib64/libc.so.6
  
  Thread 1 (Thread 0x2af77c01f190 (LWP 31091)):
  #0  0x00002af77b42dd01 in sem_wait () from /lib64/libpthread.so.0
  #1  0x00002af77b1a7148 in PyThread_acquire_lock ()
    from /usr/lib64/libpython2.4.so.1.0
  #2  0x00002af77b1aa83a in ?? () from /usr/lib64/libpython2.4.so.1.0
  #3  0x00002af77b1840ba in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #4  0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #5  0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #6  0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #7  0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #8  0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #9  0x00002af77b13b45a in ?? () from /usr/lib64/libpython2.4.so.1.0
  #10 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0
  #11 0x00002af77b12b1ef in ?? () from /usr/lib64/libpython2.4.so.1.0
  #12 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0
  #13 0x00002af77b16290c in ?? () from /usr/lib64/libpython2.4.so.1.0
  #14 0x00002af77b15e8e8 in ?? () from /usr/lib64/libpython2.4.so.1.0
  #15 0x00002af77b1250f0 in PyObject_Call () from /usr/lib64/libpython2.4.so.1.0
  #16 0x00002af77b1818ce in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #17 0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #18 0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #19 0x00002af77b183b66 in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #20 0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #21 0x00002af77b18373f in PyEval_EvalFrame ()
    from /usr/lib64/libpython2.4.so.1.0
  #22 0x00002af77b184fe5 in PyEval_EvalCodeEx ()
    from /usr/lib64/libpython2.4.so.1.0
  #23 0x00002af77b185032 in PyEval_EvalCode ()
    from /usr/lib64/libpython2.4.so.1.0
  #24 0x00002af77b1a1729 in ?? () from /usr/lib64/libpython2.4.so.1.0
  #25 0x00002af77b1a2bd8 in PyRun_SimpleFileExFlags ()
    from /usr/lib64/libpython2.4.so.1.0
  #26 0x00002af77b1a900d in Py_Main () from /usr/lib64/libpython2.4.so.1.0
  #27 0x00002af77bce4994 in __libc_start_main () from /lib64/libc.so.6
  #28 0x0000000000400629 in _start ()
  (gdb) Detaching from program: /usr/bin/python, process 31091
  (gdb) quit

Comment 1 Frantisek Reznicek 2010-07-08 09:44:44 UTC
Created attachment 430283 [details]
qpid-cluster and qpidd process dumps + reproducer

The attachment contains process dumps for qpidd and qpid-cluster when hanged (dump_pid_*) and reproducer info_thread.sh and its log info_thread.sh.log

Comment 2 Frantisek Reznicek 2010-07-08 14:52:57 UTC
Let me share one more observation about the problem.
I can see the above detailed issue only after cluster is up for few seconds (the really begin of the cluster life). Afterwards I can see that qpid-cluster is functional as expected.

Comment 3 Frantisek Reznicek 2010-07-12 06:14:10 UTC
The broker configuration for above described issue is:

[root@mrg-qe-09 fcluster]# cat /etc/qpidd.conf
#
# Licensed to the Apache Software Foundation (ASF) under one
...
# "qpidd --help" or "man qpidd" for more details.
cluster-mechanism=ANONYMOUS
auth=no
log-to-file=/tmp/qpidd.log
log-enable=info+
log-enable=debug+:cluster
cluster-name=fcluster

Comment 4 Alan Conway 2010-08-16 14:57:56 UTC
I ran the following on a 4 node cluster over the weekend, 64  hours so far with no error. Top shows qpidd processes are stable at between 700M and 1024M

node1: info_thread.sh
node2: tsxtest_thread.sh
node3&4: perftest_thread.sh

No core files, grep 'ecode:[^0]' in logs found nothing, tail of logs looks normal.

This was with trunk r985356 which corresponds to release tags
 mrg1.3-beta5
 qpid-cpp-mrg-0.7.946106-12


Note You need to log in before you can comment on or make changes to this bug.