505210 – perftest loop causes cluster crash

Bug 505210 - perftest loop causes cluster crash

Summary: perftest loop causes cluster crash

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.1.2
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Frantisek Reznicek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-11 02:45 UTC by Alan Conway
Modified:	2015-11-16 00:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-06-28 19:34:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
issue reproducer (14.23 KB, application/x-tbz) 2009-06-11 14:45 UTC, Frantisek Reznicek	no flags	Details
Patch to fix the issue (2.32 KB, patch) 2009-06-12 20:43 UTC, Alan Conway	no flags	Details \| Diff
View All

Description Alan Conway 2009-06-11 02:45:27 UTC

Description of problem:

Running perftest in a loop as described below crashes all members of a cluster except the directly connected one with:

invalid-argument: anonymous.a96d061f-206e-42d5-9dfa-455a5fd904e0: confirmed < (1+0) but only sent < (0+0) 

Version-Release number of selected component (if applicable):

How reproducible: easy

Steps to Reproduce:
1. run a qpidd cluster
2. while true; do src/tests/perftest & sleep 2 ; done
Watch for the other cluster members to die
  
Actual results:

All but directly connected node exit with error.

Expected results:

no failure

Additional info:

The error is always confirmed 1 but sent 0. 

Some of the perftests will exit with
Controller exception: Bad report: Sequence error: expected  n==81970 but got 81985 
This is expected as we're running multiple perftests using the same queues.

I suspect that the broker error is related to the sub-done queue, and the fact that the controller is throwing an exception which shuts down the client as soon as it gets the bad sub-done message.

Comment 1 Frantisek Reznicek 2009-06-11 14:45:36 UTC

Created attachment 347417 [details]
issue reproducer

Comment 2 Frantisek Reznicek 2009-06-11 14:47:23 UTC

The issue is still present in -15 and also -16 as far as I can see (transcript of the attached reproducer from above).

[root@hp-dl360g5-01 bz505210]# ./run.sh
aisexec (pid 1900) is running...
Stopping OpenAIS daemon (aisexec): [  OK  ]
Starting OpenAIS daemon (aisexec): [  OK  ]
client (perftest_mod) compile, ecode:0
client[s] ready
Client[s] compile
.ready
starting brokers in the cluster:...done
tcp        0      0 0.0.0.0:5672                0.0.0.0:*                   LISTEN      2217/qpidd
tcp        0      0 0.0.0.0:10001               0.0.0.0:*                   LISTEN      2233/qpidd
tcp        0      0 0.0.0.0:10002               0.0.0.0:*                   LISTEN      2251/qpidd
broker[s] running (pids:2217 2233 2251 )
perftest_mod loop started:
1/2(4):LLLL|.abort-cluster
tcp        0      0 0.0.0.0:5672                0.0.0.0:*                   LISTEN      2217/qpidd
test broker[s]:2217 2233 2251 done
stopping brokers...
FFFFOK
ERROR: Client[s] failed! (ecodes:)
qpidd brokers failed with searched message, cnt:4
| qpidd_1.log:2009-jun-11 10:40:14 debug Exception constructed: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_1.log:2009-jun-11 10:40:14 error Execution exception: invalid-argument: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_2.log:2009-jun-11 10:40:14 debug Exception constructed: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
| qpidd_2.log:2009-jun-11 10:40:14 error Execution exception: invalid-argument: anonymous.8a43df1e-6590-4e85-bcc4-9b12f77eeb5d: confirmed < (1+0) but only sent < (0+0) (qpid/SessionState.cpp:163)
Broker status:
qpidd_0.ecode:0
qpidd_1.ecode:0
qpidd_2.ecode:0
qpidd_0.ecode: 2009-06-11 10:40:20.000000000 -0400
qpidd_1.ecode: 2009-06-11 10:40:14.000000000 -0400
qpidd_2.ecode: 2009-06-11 10:40:14.000000000 -0400
qpidd-0.5.752581-16.el5
qpidd-cluster-0.5.752581-16.el5
openais-0.80.3-22.el5_3.7
Test Summary: TEST FAILED !!! (4 failures) dur: 36 secs

Comment 3 Frantisek Reznicek 2009-06-11 14:53:47 UTC

The failing data stored on mrg3.lab.bos.redhat.com:/root/bz505210_fails_rhel53i_090611.tar.bz2

Comment 4 Gordon Sim 2009-06-12 08:55:40 UTC

This was a regression since qpidd-0.5.752581-10.el5 and is fixed in qpidd-0.5.752581-17.el5

Error was due to indeterminacy in handling multiple consumers on a queue. It was also observable by starting several receiver programs then running sender in a loop. E.g.

  qpid-config add queue test-queue
  ./src/tests/receiver  > /dev/null &
  ./src/tests/receiver  > /dev/null &  
  ./src/tests/receiver  > /dev/null &
  ./src/tests/receiver  > /dev/null &
  while ./src/tests/sender < inputs ; do true; done

This would eventually cause nodes other than 5672 to shut themselves down.

Comment 5 Frantisek Reznicek 2009-06-12 09:20:10 UTC

The issue has been fixed on RHEL 5.3 i386 / x86_64 on packages qpid*-0.5.752581-17.el5.

-> VERIFIED

Comment 6 Alan Conway 2009-06-12 20:43:05 UTC

Created attachment 347662 [details]
Patch to fix the issue

patch also comitted to qpid trunk r784263

Comment 7 Justin Ross 2011-06-28 19:34:00 UTC

Fixed and verfified; closing.

Note You need to log in before you can comment on or make changes to this bug.