Bug 506758 - Error inconsistencies when starting cluster
Summary: Error inconsistencies when starting cluster
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1.1
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: 1.1.6
: ---
Assignee: Alan Conway
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-18 15:27 UTC by Gordon Sim
Modified: 2018-10-20 00:57 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-07-14 17:31:51 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test program (2.91 KB, text/x-c++src)
2009-06-18 15:28 UTC, Gordon Sim
no flags Details
test program that closes connections (3.04 KB, text/x-c++src)
2009-06-18 21:05 UTC, Alan Conway
no flags Details
patch to fix the issue (8.26 KB, patch)
2009-06-18 21:25 UTC, Alan Conway
no flags Details | Diff
Patch that really fixes the issue (32.09 KB, patch)
2009-07-01 21:33 UTC, Alan Conway
no flags Details | Diff
corrected patch, previous patch was missing new files. (36.06 KB, patch)
2009-07-02 16:10 UTC, Alan Conway
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2009:1153 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging bug fixing update 2009-07-14 17:31:48 UTC

Description Gordon Sim 2009-06-18 15:27:46 UTC
Description of problem:

When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-17.el5

How reproducible:

Reliably

Steps to Reproduce:
1. have clients trying to connect and subscribe to a non-existent queue
   (e.g. using the attached test prog I run:

    for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:127.0.0.1:5672,tcp:127.0.0.1:5673,tcp:127.0.0.1:5674,tcp:127.0.0.1:5675 & done
   )

2. start 4 cluster nodes for the clients to connect to
   (e.g. 
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done
    )
3. check each node is alive
   (e.g.
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done
   tail test-cluster-qpidd.567*.log
   )
  
Actual results:

2 or 3 nodes fail with:

2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-18 11:21:49 critical 100.0.50.15:14363(CATCHUP/error) Error 4009 did not occur on 100.0.50.15:14545
2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-18 11:21:49 notice 100.0.50.15:14363(LEFT/error) leaving cluster grs-mrg15-test-cluster
2009-jun-18 11:21:49 notice Shut down


Expected results:

All nodes remain up and running.

Additional info:

I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.

Comment 1 Gordon Sim 2009-06-18 15:28:46 UTC
Created attachment 348500 [details]
test program

Comment 2 Gordon Sim 2009-06-18 16:11:00 UTC
In attached program, --retry-interval 0 --url amqp:tcp:127.0.0.1 makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.

Comment 3 Alan Conway 2009-06-18 21:05:12 UTC
Created attachment 348555 [details]
 test program  that closes connections

The original runs out of file descriptors if you're not sharp about starting the cluster.

Comment 4 Alan Conway 2009-06-18 21:25:42 UTC
Created attachment 348561 [details]
patch to fix the issue

Fix also commited to trun r786294

This fix includes XML changes so code needs to be re-generated.

Comment 5 Gordon Sim 2009-06-23 09:08:42 UTC
Fixed in qpidd-0.5.752581-20.

Comment 6 Gordon Sim 2009-06-23 10:05:47 UTC
Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?). 


I also saw a node exit on one occasion with:

2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458)
2009-jun-23 05:37:50 critical 100.0.50.15:20677(CATCHUP/error) Error 4723 did not occur on 100.0.50.15:20317
2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas

Comment 7 Gordon Sim 2009-06-23 10:25:47 UTC
Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.

Comment 8 Alan Conway 2009-07-01 13:47:55 UTC
Fixed in r790163

If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.

Comment 9 Alan Conway 2009-07-01 21:33:59 UTC
Created attachment 350203 [details]
Patch that really fixes the issue

Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)

Comment 10 Alan Conway 2009-07-01 21:59:26 UTC
Comitted to trunk r790397

Comment 11 Alan Conway 2009-07-02 16:10:42 UTC
Created attachment 350309 [details]
corrected patch, previous patch was missing new files.

Comment 12 Gordon Sim 2009-07-03 07:28:33 UTC
Fixed in qpidd-0.5.752581-22.

Comment 13 Gordon Sim 2009-07-03 09:31:21 UTC
I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).

Comment 14 Alan Conway 2009-07-07 13:16:28 UTC
This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state.

Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2

Comment 15 Gordon Sim 2009-07-07 17:39:35 UTC
Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.

Comment 16 Frantisek Reznicek 2009-07-09 14:44:06 UTC
The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages:

[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-25.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5


-> VERIFIED

Comment 18 errata-xmlrpc 2009-07-14 17:31:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html


Note You need to log in before you can comment on or make changes to this bug.