Bug 505210 - perftest loop causes cluster crash
perftest loop causes cluster crash
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
1.1.1
All Linux
high Severity high
: 1.1.2
: ---
Assigned To: Alan Conway
Frantisek Reznicek
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-10 22:45 EDT by Alan Conway
Modified: 2015-11-15 19:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2011-06-28 15:34:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
issue reproducer (14.23 KB, application/x-tbz)
2009-06-11 10:45 EDT, Frantisek Reznicek
no flags Details
Patch to fix the issue (2.32 KB, patch)
2009-06-12 16:43 EDT, Alan Conway
no flags Details | Diff

  None (edit)
Description Alan Conway 2009-06-10 22:45:27 EDT
Description of problem:

Running perftest in a loop as described below crashes all members of a cluster except the directly connected one with:

invalid-argument: anonymous.a96d061f-206e-42d5-9dfa-455a5fd904e0: confirmed < (1+0) but only sent < (0+0) 

Version-Release number of selected component (if applicable):

How reproducible: easy

Steps to Reproduce:
1. run a qpidd cluster
2. while true; do src/tests/perftest & sleep 2 ; done
Watch for the other cluster members to die
  
Actual results:

All but directly connected node exit with error.

Expected results:

no failure

Additional info:

The error is always confirmed 1 but sent 0. 

Some of the perftests will exit with
Controller exception: Bad report: Sequence error: expected  n==81970 but got 81985 
This is expected as we're running multiple perftests using the same queues.

I suspect that the broker error is related to the sub-done queue, and the fact that the controller is throwing an exception which shuts down the client as soon as it gets the bad sub-done message.
Comment 1 Frantisek Reznicek 2009-06-11 10:45:36 EDT
Created attachment 347417 [details]
issue reproducer
Comment 2 Frantisek Reznicek 2009-06-11 10:47:23 EDT
The issue is still present in -15 and also -16 as far as I can see (transcript of the attached reproducer from above).

[root@hp-dl360g5-01 bz505210]# ./run.sh
aisexec (pid 1900) is running...
Stopping OpenAIS daemon (aisexec): [  OK  ]
Starting OpenAIS daemon (aisexec): [  OK  ]
client (perftest_mod) compile, ecode:0
client[s] ready
Client[s] compile
.ready
starting brokers in the cluster:...done
tcp        0      0 0.0.0.0:5672                0.0.0.0:*                   LISTEN      2217/qpidd
tcp        0      0 0.0.0.0:10001               0.0.0.0:*                   LISTEN      2233/qpidd
tcp        0      0 0.0.0.0:10002               0.0.0.0:*                   LISTEN      2251/qpidd
broker[s] running (pids:2217 2233 2251 )
perftest_mod loop started:
1/2(4):LLLL|.abort-cluster
tcp        0      0 0.0.0.0:5672                0.0.0.0:*                   LISTEN      2217/qpidd
test broker[s]:2217 2233 2251 done
stopping brokers...
FFFFOK
ERROR: Client[s] failed! (ecodes:)
qpidd brokers failed with searched message, cnt:4
| qpidd_1.log:2009-jun-11 10:40:14 debug Exception constructed: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_1.log:2009-jun-11 10:40:14 error Execution exception: invalid-argument: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_2.log:2009-jun-11 10:40:14 debug Exception constructed: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_2.log:2009-jun-11 10:40:14 error Execution exception: invalid-argument: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
Broker status:
qpidd_0.ecode:0
qpidd_1.ecode:0
qpidd_2.ecode:0
qpidd_0.ecode: 2009-06-11 10:40:20.000000000 -0400
qpidd_1.ecode: 2009-06-11 10:40:14.000000000 -0400
qpidd_2.ecode: 2009-06-11 10:40:14.000000000 -0400
qpidd-0.5.752581-16.el5
qpidd-cluster-0.5.752581-16.el5
openais-0.80.3-22.el5_3.7
Test Summary: TEST FAILED !!! (4 failures) dur: 36 secs
Comment 3 Frantisek Reznicek 2009-06-11 10:53:47 EDT
The failing data stored on mrg3.lab.bos.redhat.com:/root/bz505210_fails_rhel53i_090611.tar.bz2
Comment 4 Gordon Sim 2009-06-12 04:55:40 EDT
This was a regression since qpidd-0.5.752581-10.el5 and is fixed in qpidd-0.5.752581-17.el5

Error was due to indeterminacy in handling multiple consumers on a queue. It was also observable by starting several receiver programs then running sender in a loop. E.g.

  qpid-config add queue test-queue
  ./src/tests/receiver  > /dev/null &
  ./src/tests/receiver  > /dev/null &  
  ./src/tests/receiver  > /dev/null &
  ./src/tests/receiver  > /dev/null &
  while ./src/tests/sender < inputs ; do true; done

This would eventually cause nodes other than 5672 to shut themselves down.
Comment 5 Frantisek Reznicek 2009-06-12 05:20:10 EDT
The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages qpid*-0.5.752581-17.el5.

-> VERIFIED
Comment 6 Alan Conway 2009-06-12 16:43:05 EDT
Created attachment 347662 [details]
Patch to fix the issue

patch also comitted to qpid trunk r784263
Comment 7 Justin Ross 2011-06-28 15:34:00 EDT
Fixed and verfified; closing.

Note You need to log in before you can comment on or make changes to this bug.