Bug 505210 - perftest loop causes cluster crash
Summary: perftest loop causes cluster crash
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1.1
Hardware: All
OS: Linux
high
high
Target Milestone: 1.1.2
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-11 02:45 UTC by Alan Conway
Modified: 2015-11-16 00:07 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-06-28 19:34:00 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
issue reproducer (14.23 KB, application/x-tbz)
2009-06-11 14:45 UTC, Frantisek Reznicek
no flags Details
Patch to fix the issue (2.32 KB, patch)
2009-06-12 20:43 UTC, Alan Conway
no flags Details | Diff

Description Alan Conway 2009-06-11 02:45:27 UTC
Description of problem:

Running perftest in a loop as described below crashes all members of a cluster except the directly connected one with:

invalid-argument: anonymous.a96d061f-206e-42d5-9dfa-455a5fd904e0: confirmed < (1+0) but only sent < (0+0) 

Version-Release number of selected component (if applicable):

How reproducible: easy

Steps to Reproduce:
1. run a qpidd cluster
2. while true; do src/tests/perftest & sleep 2 ; done
Watch for the other cluster members to die
  
Actual results:

All but directly connected node exit with error.

Expected results:

no failure

Additional info:

The error is always confirmed 1 but sent 0. 

Some of the perftests will exit with
Controller exception: Bad report: Sequence error: expected  n==81970 but got 81985 
This is expected as we're running multiple perftests using the same queues.

I suspect that the broker error is related to the sub-done queue, and the fact that the controller is throwing an exception which shuts down the client as soon as it gets the bad sub-done message.

Comment 1 Frantisek Reznicek 2009-06-11 14:45:36 UTC
Created attachment 347417 [details]
issue reproducer

Comment 2 Frantisek Reznicek 2009-06-11 14:47:23 UTC
The issue is still present in -15 and also -16 as far as I can see (transcript of the attached reproducer from above).

[root@hp-dl360g5-01 bz505210]# ./run.sh
aisexec (pid 1900) is running...
Stopping OpenAIS daemon (aisexec): [  OK  ]
Starting OpenAIS daemon (aisexec): [  OK  ]
client (perftest_mod) compile, ecode:0
client[s] ready
Client[s] compile
.ready
starting brokers in the cluster:...done
tcp        0      0 0.0.0.0:5672                0.0.0.0:*                   LISTEN      2217/qpidd
tcp        0      0 0.0.0.0:10001               0.0.0.0:*                   LISTEN      2233/qpidd
tcp        0      0 0.0.0.0:10002               0.0.0.0:*                   LISTEN      2251/qpidd
broker[s] running (pids:2217 2233 2251 )
perftest_mod loop started:
1/2(4):LLLL|.abort-cluster
tcp        0      0 0.0.0.0:5672                0.0.0.0:*                   LISTEN      2217/qpidd
test broker[s]:2217 2233 2251 done
stopping brokers...
FFFFOK
ERROR: Client[s] failed! (ecodes:)
qpidd brokers failed with searched message, cnt:4
| qpidd_1.log:2009-jun-11 10:40:14 debug Exception constructed: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_1.log:2009-jun-11 10:40:14 error Execution exception: invalid-argument: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_2.log:2009-jun-11 10:40:14 debug Exception constructed: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_2.log:2009-jun-11 10:40:14 error Execution exception: invalid-argument: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
Broker status:
qpidd_0.ecode:0
qpidd_1.ecode:0
qpidd_2.ecode:0
qpidd_0.ecode: 2009-06-11 10:40:20.000000000 -0400
qpidd_1.ecode: 2009-06-11 10:40:14.000000000 -0400
qpidd_2.ecode: 2009-06-11 10:40:14.000000000 -0400
qpidd-0.5.752581-16.el5
qpidd-cluster-0.5.752581-16.el5
openais-0.80.3-22.el5_3.7
Test Summary: TEST FAILED !!! (4 failures) dur: 36 secs

Comment 3 Frantisek Reznicek 2009-06-11 14:53:47 UTC
The failing data stored on mrg3.lab.bos.redhat.com:/root/bz505210_fails_rhel53i_090611.tar.bz2

Comment 4 Gordon Sim 2009-06-12 08:55:40 UTC
This was a regression since qpidd-0.5.752581-10.el5 and is fixed in qpidd-0.5.752581-17.el5

Error was due to indeterminacy in handling multiple consumers on a queue. It was also observable by starting several receiver programs then running sender in a loop. E.g.

  qpid-config add queue test-queue
  ./src/tests/receiver  > /dev/null &
  ./src/tests/receiver  > /dev/null &  
  ./src/tests/receiver  > /dev/null &
  ./src/tests/receiver  > /dev/null &
  while ./src/tests/sender < inputs ; do true; done

This would eventually cause nodes other than 5672 to shut themselves down.

Comment 5 Frantisek Reznicek 2009-06-12 09:20:10 UTC
The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages qpid*-0.5.752581-17.el5.

-> VERIFIED

Comment 6 Alan Conway 2009-06-12 20:43:05 UTC
Created attachment 347662 [details]
Patch to fix the issue

patch also comitted to qpid trunk r784263

Comment 7 Justin Ross 2011-06-28 19:34:00 UTC
Fixed and verfified; closing.


Note You need to log in before you can comment on or make changes to this bug.