506758 – Error inconsistencies when starting cluster

Bug 506758 - Error inconsistencies when starting cluster

Summary: Error inconsistencies when starting cluster

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	1.1.6
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Frantisek Reznicek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-18 15:27 UTC by Gordon Sim
Modified:	2018-10-20 00:57 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-07-14 17:31:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
test program (2.91 KB, text/x-c++src) 2009-06-18 15:28 UTC, Gordon Sim	no flags	Details
test program that closes connections (3.04 KB, text/x-c++src) 2009-06-18 21:05 UTC, Alan Conway	no flags	Details
patch to fix the issue (8.26 KB, patch) 2009-06-18 21:25 UTC, Alan Conway	no flags	Details \| Diff
Patch that really fixes the issue (32.09 KB, patch) 2009-07-01 21:33 UTC, Alan Conway	no flags	Details \| Diff
corrected patch, previous patch was missing new files. (36.06 KB, patch) 2009-07-02 16:10 UTC, Alan Conway	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1153	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging bug fixing update	2009-07-14 17:31:48 UTC

Description Gordon Sim 2009-06-18 15:27:46 UTC

Description of problem:

When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-17.el5

How reproducible:

Reliably

Steps to Reproduce:
1. have clients trying to connect and subscribe to a non-existent queue
   (e.g. using the attached test prog I run:

    for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:127.0.0.1:5672,tcp:127.0.0.1:5673,tcp:127.0.0.1:5674,tcp:127.0.0.1:5675 & done
   )

2. start 4 cluster nodes for the clients to connect to
   (e.g. 
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done
    )
3. check each node is alive
   (e.g.
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done
   tail test-cluster-qpidd.567*.log
   )
  
Actual results:

2 or 3 nodes fail with:

2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-18 11:21:49 critical 100.0.50.15:14363(CATCHUP/error) Error 4009 did not occur on 100.0.50.15:14545
2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-18 11:21:49 notice 100.0.50.15:14363(LEFT/error) leaving cluster grs-mrg15-test-cluster
2009-jun-18 11:21:49 notice Shut down


Expected results:

All nodes remain up and running.

Additional info:

I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.

Comment 1 Gordon Sim 2009-06-18 15:28:46 UTC

Created attachment 348500 [details]
test program

Comment 2 Gordon Sim 2009-06-18 16:11:00 UTC

In attached program, --retry-interval 0 --url amqp:tcp:127.0.0.1 makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.

Comment 3 Alan Conway 2009-06-18 21:05:12 UTC

Created attachment 348555 [details]
 test program  that closes connections

The original runs out of file descriptors if you're not sharp about starting the cluster.

Comment 4 Alan Conway 2009-06-18 21:25:42 UTC

Created attachment 348561 [details]
patch to fix the issue

Fix also commited to trun r786294

This fix includes XML changes so code needs to be re-generated.

Comment 5 Gordon Sim 2009-06-23 09:08:42 UTC

Fixed in qpidd-0.5.752581-20.

Comment 6 Gordon Sim 2009-06-23 10:05:47 UTC

Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?). 


I also saw a node exit on one occasion with:

2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458)
2009-jun-23 05:37:50 critical 100.0.50.15:20677(CATCHUP/error) Error 4723 did not occur on 100.0.50.15:20317
2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas

Comment 7 Gordon Sim 2009-06-23 10:25:47 UTC

Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.

Comment 8 Alan Conway 2009-07-01 13:47:55 UTC

Fixed in r790163

If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.

Comment 9 Alan Conway 2009-07-01 21:33:59 UTC

Created attachment 350203 [details]
Patch that really fixes the issue

Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)

Comment 10 Alan Conway 2009-07-01 21:59:26 UTC

Comitted to trunk r790397

Comment 11 Alan Conway 2009-07-02 16:10:42 UTC

Created attachment 350309 [details]
corrected patch, previous patch was missing new files.

Comment 12 Gordon Sim 2009-07-03 07:28:33 UTC

Fixed in qpidd-0.5.752581-22.

Comment 13 Gordon Sim 2009-07-03 09:31:21 UTC

I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).

Comment 14 Alan Conway 2009-07-07 13:16:28 UTC

This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state.

Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2

Comment 15 Gordon Sim 2009-07-07 17:39:35 UTC

Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.

Comment 16 Frantisek Reznicek 2009-07-09 14:44:06 UTC

The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages:

[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-25.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5


-> VERIFIED

Comment 18 errata-xmlrpc 2009-07-14 17:31:51 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html

Note You need to log in before you can comment on or make changes to this bug.