Bug 506758 - Error inconsistencies when starting cluster
Error inconsistencies when starting cluster
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
1.1.1
All Linux
urgent Severity urgent
: 1.1.6
: ---
Assigned To: Alan Conway
Frantisek Reznicek
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-06-18 11:27 EDT by Gordon Sim
Modified: 2015-11-15 19:07 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-07-14 13:31:51 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
test program (2.91 KB, text/x-c++src)
2009-06-18 11:28 EDT, Gordon Sim
no flags Details
test program that closes connections (3.04 KB, text/x-c++src)
2009-06-18 17:05 EDT, Alan Conway
no flags Details
patch to fix the issue (8.26 KB, patch)
2009-06-18 17:25 EDT, Alan Conway
no flags Details | Diff
Patch that really fixes the issue (32.09 KB, patch)
2009-07-01 17:33 EDT, Alan Conway
no flags Details | Diff
corrected patch, previous patch was missing new files. (36.06 KB, patch)
2009-07-02 12:10 EDT, Alan Conway
no flags Details | Diff

  None (edit)
Description Gordon Sim 2009-06-18 11:27:46 EDT
Description of problem:

When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-17.el5

How reproducible:

Reliably

Steps to Reproduce:
1. have clients trying to connect and subscribe to a non-existent queue
   (e.g. using the attached test prog I run:

    for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:127.0.0.1:5672,tcp:127.0.0.1:5673,tcp:127.0.0.1:5674,tcp:127.0.0.1:5675 & done
   )

2. start 4 cluster nodes for the clients to connect to
   (e.g. 
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done
    )
3. check each node is alive
   (e.g.
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done
   tail test-cluster-qpidd.567*.log
   )
  
Actual results:

2 or 3 nodes fail with:

2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-18 11:21:49 critical 100.0.50.15:14363(CATCHUP/error) Error 4009 did not occur on 100.0.50.15:14545
2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-18 11:21:49 notice 100.0.50.15:14363(LEFT/error) leaving cluster grs-mrg15-test-cluster
2009-jun-18 11:21:49 notice Shut down


Expected results:

All nodes remain up and running.

Additional info:

I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.
Comment 1 Gordon Sim 2009-06-18 11:28:46 EDT
Created attachment 348500 [details]
test program
Comment 2 Gordon Sim 2009-06-18 12:11:00 EDT
In attached program, --retry-interval 0 --url amqp:tcp:127.0.0.1 makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.
Comment 3 Alan Conway 2009-06-18 17:05:12 EDT
Created attachment 348555 [details]
 test program  that closes connections

The original runs out of file descriptors if you're not sharp about starting the cluster.
Comment 4 Alan Conway 2009-06-18 17:25:42 EDT
Created attachment 348561 [details]
patch to fix the issue

Fix also commited to trun r786294

This fix includes XML changes so code needs to be re-generated.
Comment 5 Gordon Sim 2009-06-23 05:08:42 EDT
Fixed in qpidd-0.5.752581-20.
Comment 6 Gordon Sim 2009-06-23 06:05:47 EDT
Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?). 


I also saw a node exit on one occasion with:

2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458)
2009-jun-23 05:37:50 critical 100.0.50.15:20677(CATCHUP/error) Error 4723 did not occur on 100.0.50.15:20317
2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas
Comment 7 Gordon Sim 2009-06-23 06:25:47 EDT
Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.
Comment 8 Alan Conway 2009-07-01 09:47:55 EDT
Fixed in r790163

If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.
Comment 9 Alan Conway 2009-07-01 17:33:59 EDT
Created attachment 350203 [details]
Patch that really fixes the issue

Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)
Comment 10 Alan Conway 2009-07-01 17:59:26 EDT
Comitted to trunk r790397
Comment 11 Alan Conway 2009-07-02 12:10:42 EDT
Created attachment 350309 [details]
corrected patch, previous patch was missing new files.
Comment 12 Gordon Sim 2009-07-03 03:28:33 EDT
Fixed in qpidd-0.5.752581-22.
Comment 13 Gordon Sim 2009-07-03 05:31:21 EDT
I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).
Comment 14 Alan Conway 2009-07-07 09:16:28 EDT
This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state.

Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2
Comment 15 Gordon Sim 2009-07-07 13:39:35 EDT
Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.
Comment 16 Frantisek Reznicek 2009-07-09 10:44:06 EDT
The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages:

[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-25.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5


-> VERIFIED
Comment 18 errata-xmlrpc 2009-07-14 13:31:51 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html

Note You need to log in before you can comment on or make changes to this bug.