Bug 577362 - Long failover_soak test hangs.
Summary: Long failover_soak test hangs.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: beta
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks: 490457 527451
TreeView+ depends on / blocked
 
Reported: 2010-03-26 19:04 UTC by Alan Conway
Modified: 2015-11-16 01:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
log of hung run_failover_soak (267.36 KB, text/plain)
2010-03-26 19:36 UTC, Alan Conway
no flags Details
ptol timeout report showing stack traces (3.25 KB, text/gz)
2010-03-26 19:39 UTC, Alan Conway
no flags Details
ptol test log output (64 bytes, text/plain)
2010-03-26 19:39 UTC, Alan Conway
no flags Details
state of hang broker and clients (15.51 KB, application/x-tbz)
2010-06-01 09:21 UTC, Frantisek Reznicek
no flags Details
Fast reproducer (342 bytes, application/x-shellscript)
2010-06-04 17:46 UTC, Alan Conway
no flags Details

Description Alan Conway 2010-03-26 19:04:42 UTC
Description of problem:

Failover_soak test hung in ptol and when running it in a loop for about 3 hours.

Version-Release number of selected component (if applicable):

SVN r927847

How reproducible:

Happened once in make check long run by ptol, and once by running run_failover_soak in a loop, hung after 3 hours.

Steps to Reproduce:

while ./run_failover_soak; do true; 
  
Actual results:

Hung after 3 hours.

Expected results:

Should not hang.

Additional info:

Comment 1 Alan Conway 2010-03-26 19:36:57 UTC
Created attachment 402924 [details]
log of hung run_failover_soak

Comment 2 Alan Conway 2010-03-26 19:39:12 UTC
Created attachment 402925 [details]
ptol timeout report showing stack traces

Comment 3 Alan Conway 2010-03-26 19:39:44 UTC
Created attachment 402926 [details]
ptol test log output

Comment 4 Alan Conway 2010-03-30 21:00:18 UTC
Fixed in r929277

Comment 5 Frantisek Reznicek 2010-06-01 09:17:29 UTC
While running qpid-cpp-server-0.7.946106-2 broker client lib and qpid-tree failover-soak test I was able to see/trigger one failover hang in qpid::tests::wait_for_newbie() for hours

[root@mrg-qe-02 qpid_ptest_cluster_failover_soak]# pstack 15838
Thread 2 (Thread 0x426db940 (LWP 17645)):
#0  0x00000036a10d4108 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b5b2fc9f33f in qpid::sys::Poller::wait(qpid::sys::Duration) () from /usr/lib64/libqpidcommon.so.2
#2  0x00002b5b2fc9fd22 in qpid::sys::Poller::run() () from /usr/lib64/libqpidcommon.so.2
#3  0x00002b5b2fc968da in ?? () from /usr/lib64/libqpidcommon.so.2
#4  0x00000036a1c0673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000036a10d3d1d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2b5b3000db60 (LWP 15838)):
#0  0x00000036a1c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b5b2f931091 in qpid::client::StateManager::waitFor(std::set<int, std::less<int>, std::allocator<int> >) () from /usr/lib64/libqpidclient.so.2
#2  0x00002b5b2f8e8411 in qpid::client::ConnectionHandler::waitForOpen() () from /usr/lib64/libqpidclient.so.2
#3  0x00002b5b2f8f5648 in qpid::client::ConnectionImpl::open() () from /usr/lib64/libqpidclient.so.2
#4  0x00002b5b2f8e7115 in qpid::client::Connection::open(qpid::client::ConnectionSettings const&) () from /usr/lib64/libqpidclient.so.2
#5  0x00002b5b2f8e763e in qpid::client::Connection::open(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, int, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned short) () from /usr/lib64/libqpidclient.so.2
#6  0x000000000040dd52 in qpid::tests::wait_for_newbie() ()
#7  0x000000000040e08a in qpid::tests::killFrontBroker(std::vector<qpid::tests::ForkedBroker*, std::allocator<qpid::tests::ForkedBroker*> >&, int) ()
#8  0x000000000040ecb7 in main ()

more precisely

(gdb) info thre
  2 Thread 0x426db940 (LWP 17645)  0x00000036a10d4108 in epoll_wait () from /lib64/libc.so.6
* 1 Thread 0x2b5b3000db60 (LWP 15838)  0x00000036a1c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) threa apply all bt

Thread 2 (Thread 0x426db940 (LWP 17645)):
#0  0x00000036a10d4108 in epoll_wait () from /lib64/libc.so.6
#1  0x00002b5b2fc9f33f in qpid::sys::Poller::wait (this=0x1af882e0, timeout=<value optimized out>) at qpid/sys/epoll/EpollPoller.cpp:524
#2  0x00002b5b2fc9fd22 in qpid::sys::Poller::run (this=0x1af882e0) at qpid/sys/epoll/EpollPoller.cpp:479
#3  0x00002b5b2fc968da in qpid::sys::(anonymous namespace)::runRunnable (p=0x8) at qpid/sys/posix/Thread.cpp:35
#4  0x00000036a1c0673d in start_thread () from /lib64/libpthread.so.0
#5  0x00000036a10d3d1d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x2b5b3000db60 (LWP 15838)):
#0  0x00000036a1c0aee9 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b5b2f931091 in wait (this=0x2aaaac001fe0, desired=...) at ../include/qpid/sys/posix/Condition.h:63
#2  wait (this=0x2aaaac001fe0, desired=...) at ../include/qpid/sys/Monitor.h:41
#3  qpid::client::StateManager::waitFor (this=0x2aaaac001fe0, desired=...) at qpid/client/StateManager.cpp:51
#4  0x00002b5b2f8e8411 in qpid::client::ConnectionHandler::waitForOpen (this=0x2aaaac001f78) at qpid/client/ConnectionHandler.cpp:144
#5  0x00002b5b2f8f5648 in qpid::client::ConnectionImpl::open (this=0x2aaaac001e90) at qpid/client/ConnectionImpl.cpp:282
#6  0x00002b5b2f8e7115 in qpid::client::Connection::open (this=0x7fff5d025b20, settings=...) at qpid/client/Connection.cpp:125
#7  0x00002b5b2f8e763e in qpid::client::Connection::open (this=0x7fff5d025b20, host=..., port=53743, uid=..., pwd=..., vhost=...,
    maxFrameSize=65535) at qpid/client/Connection.cpp:103
#8  0x000000000040dd52 in qpid::tests::wait_for_newbie () at src/tests/failover_soak.cpp:314
#9  0x000000000040e08a in qpid::tests::killFrontBroker (brokers=..., verbosity=10) at src/tests/failover_soak.cpp:417
#10 0x000000000040ecb7 in main (argc=11, argv=0x7fff5d0261c8) at src/tests/failover_soak.cpp:797
(gdb) detach

failover_soak,15838 /usr/lib64/qpid/daemon ../declare_queues ../replaying_sender ../resuming_receiver 1000000 20000 10 0 3 7
  ├─qpidd,8320 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-9 ...
  │   ├─{qpidd},8321
  │   ├─{qpidd},8322
  │   ├─{qpidd},8324
  │   ├─{qpidd},8341
  │   ├─{qpidd},8342
  │   ├─{qpidd},8343
  │   ├─{qpidd},8344
  │   ├─{qpidd},8345
  │   ├─{qpidd},8346
  │   ├─{qpidd},8347
  │   └─{qpidd},8348
  ├─qpidd,8362 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-10 ...
  │   ├─{qpidd},8364
  │   ├─{qpidd},8365
  │   ├─{qpidd},8367
  │   ├─{qpidd},8373
  │   ├─{qpidd},8374
  │   ├─{qpidd},8375
  │   ├─{qpidd},8376
  │   ├─{qpidd},8377
  │   ├─{qpidd},8378
  │   ├─{qpidd},8379
  │   └─{qpidd},8380
  ├─qpidd,8391 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-11 ...
  │   ├─{qpidd},8392
  │   ├─{qpidd},8393
  │   ├─{qpidd},8395
  │   ├─{qpidd},8398
  │   ├─{qpidd},8399
  │   ├─{qpidd},8400
  │   ├─{qpidd},8401
  │   ├─{qpidd},8402
  │   ├─{qpidd},8403
  │   ├─{qpidd},8404
  │   └─{qpidd},8405
  ├─qpidd,8444 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-12 ...
  │   ├─{qpidd},8445
  │   ├─{qpidd},8446
  │   ├─{qpidd},8448
  │   ├─{qpidd},8453
  │   ├─{qpidd},8454
  │   ├─{qpidd},8455
  │   ├─{qpidd},8456
  │   ├─{qpidd},8457
  │   ├─{qpidd},8458
  │   ├─{qpidd},8459
  │   └─{qpidd},8460
  ├─qpidd,8848 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-13 ...
  │   ├─{qpidd},8849
  │   ├─{qpidd},8850
  │   ├─{qpidd},8852
  │   ├─{qpidd},8932
  │   ├─{qpidd},8933
  │   ├─{qpidd},8934
  │   ├─{qpidd},8935
  │   ├─{qpidd},8936
  │   ├─{qpidd},8937
  │   ├─{qpidd},8938
  │   └─{qpidd},8939
  ├─qpidd,10168 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-14 ...
  │   ├─{qpidd},10169
  │   ├─{qpidd},10170
  │   ├─{qpidd},10172
  │   ├─{qpidd},10747
  │   ├─{qpidd},10748
  │   ├─{qpidd},10749
  │   ├─{qpidd},10750
  │   ├─{qpidd},10751
  │   ├─{qpidd},10752
  │   ├─{qpidd},10753
  │   └─{qpidd},10754
  ├─qpidd,28467 --cluster-name soakTestCluster_1e777f1a-114f-4c95-a56b-f85c26927629 --auth=no --mgmt-enable no --log-prefix soak-8 ...
  │   ├─{qpidd},28468
  │   ├─{qpidd},28522
  │   ├─{qpidd},28524
  │   ├─{qpidd},28932
  │   ├─{qpidd},28933
  │   ├─{qpidd},28934
  │   ├─{qpidd},28935
  │   ├─{qpidd},28936
  │   ├─{qpidd},28937
  │   ├─{qpidd},28938
  │   ├─{qpidd},28939
  │   ├─{qpidd},8407
  │   ├─{qpidd},8408
  │   └─{qpidd},8409
  ├─(replaying_sende,16573)
  ├─(replaying_sende,16575)
  ├─(replaying_sende,16577)
  ├─resuming_receiv,16572 127.0.0.1 55269 20000 1 failover_soak_15838_0
  │   ├─{resuming_receiv},16578
  │   ├─{resuming_receiv},18457
  │   └─{resuming_receiv},8410
  ├─resuming_receiv,16574 127.0.0.1 55269 20000 1 failover_soak_15838_1
  │   ├─{resuming_receiv},16582
  │   ├─{resuming_receiv},18461
  │   ├─{resuming_receiv},8363
  │   └─{resuming_receiv},8411
  ├─resuming_receiv,16576 127.0.0.1 55269 20000 1 failover_soak_15838_2
  │   ├─{resuming_receiv},16580
  │   ├─{resuming_receiv},18460
  │   └─{resuming_receiv},8412
  └─{failover_soak},17645



To match the lines there are three harmless patches applied to failover_soak:
http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/~checkout~/tests/distribution/MRG/Messaging/qpid_ptest_cluster_failover_soak/fsoak.patch?rev=1.3;content-type=application%2Foctet-stream

Current hang seems to be different to the above described one, but as not all hang data from past are available it is hard to confirm (comment 3, I'm not sure how I can analyze comment's 2 attachement)


Current hang I can see only on RHEL 5.5 x86_64 and in such hang state no broker is responsive

Would you advice me to treat it as different bug or not?

Comment 6 Frantisek Reznicek 2010-06-01 09:21:00 UTC
Created attachment 418551 [details]
state of hang broker and clients

The attachement stores process state backtrace of all clients and broker involved in failover_soak

Comment 7 Alan Conway 2010-06-01 14:26:19 UTC
I think this is a different hang, but we can just keep this BZ open for it.

Were the unresponsive brokers consuming CPU?
How easy is this to reproduce?

Comment 8 Frantisek Reznicek 2010-06-02 09:34:14 UTC
The test was run on 8 core AMD Opteron, all cluster nodes were hang and qpidd brokers were consuming CPU. I triggered the issue quite frequently on RHEL 5.5 x86_64 on i386 there was no such a case seen.

Let me know if you need to reproduce this case. Unfortunately I had to stop the hung case because I needed the machine.

Comment 9 Alan Conway 2010-06-04 17:46:44 UTC
Created attachment 421308 [details]
Fast reproducer

Fast reproducer for this deadlock: start a broker with configuration below and run attached script. Deadlocks in < 30 seconds

---- qpidd.conf
auth=no
cluster-name=test-cluster
worker-threads=3

Comment 10 Alan Conway 2010-06-08 15:38:43 UTC
Fixed on trunk r952692 and in release repo: http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=c13cc742f961cb771f008cd19edf38cb43ee6aeb

Comment 11 Frantisek Reznicek 2010-09-08 14:47:27 UTC
The issue is proved to be fixed (no hangs over few days cycling the
test), retested on RHEL 5.5 i386 / x86_64 on packages:
python-qmf-0.7.946106-12.el5
python-qpid-0.7.946106-13.el5
qmf-0.7.946106-12.el5
qmf-devel-0.7.946106-12.el5
qpid-cpp-*-0.7.946106-12.el5
qpid-dotnet-0.4.738274-2.el5
qpid-java-*-0.7.946106-8.el5
qpid-tests-0.7.946106-1.el5
qpid-tools-0.7.946106-10.el5
ruby-qmf-0.7.946106-12.el5

-> VERIFIED


Note You need to log in before you can comment on or make changes to this bug.