Bug 506758

Summary: Error inconsistencies when starting cluster
Product: Red Hat Enterprise MRG Reporter: Gordon Sim <gsim>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.1.1CC: esammons, freznice, jsarenik, tao
Target Milestone: 1.1.6   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-14 17:31:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test program
none
test program that closes connections
none
patch to fix the issue
none
Patch that really fixes the issue
none
corrected patch, previous patch was missing new files. none

Description Gordon Sim 2009-06-18 15:27:46 UTC
Description of problem:

When a series of clients are trying to subscribe to a non-existent queue and a 4 node cluster is then started, some of the nodes fail due to inconsistent errors.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-17.el5

How reproducible:

Reliably

Steps to Reproduce:
1. have clients trying to connect and subscribe to a non-existent queue
   (e.g. using the attached test prog I run:

    for q in `seq 1 50`; do ./subscribe --queue "queue-$q" --url amqp:tcp:127.0.0.1:5672,tcp:127.0.0.1:5673,tcp:127.0.0.1:5674,tcp:127.0.0.1:5675 & done
   )

2. start 4 cluster nodes for the clients to connect to
   (e.g. 
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --auth no --cluster-name grs-mrg15-test-cluster --data-dir `pwd`/test-cluster-temp-data-$p --port $p --log-to-file test-cluster-qpidd.$p.log --log-to-syslog false --daemon; sleep 2; done
    )
3. check each node is alive
   (e.g.
   for p in 5672 5673 5674 5675; do /usr/sbin/qpidd --check --port $p; done
   tail test-cluster-qpidd.567*.log
   )
  
Actual results:

2 or 3 nodes fail with:

2009-jun-18 11:21:49 error Execution exception: not-found: Queue not found: queue-37 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-18 11:21:49 critical 100.0.50.15:14363(CATCHUP/error) Error 4009 did not occur on 100.0.50.15:14545
2009-jun-18 11:21:49 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-18 11:21:49 notice 100.0.50.15:14363(LEFT/error) leaving cluster grs-mrg15-test-cluster
2009-jun-18 11:21:49 notice Shut down


Expected results:

All nodes remain up and running.

Additional info:

I had to put a short sleep between starting the nodes or the surviving nodes appear to get stuck trying to join.

Comment 1 Gordon Sim 2009-06-18 15:28:46 UTC
Created attachment 348500 [details]
test program

Comment 2 Gordon Sim 2009-06-18 16:11:00 UTC
In attached program, --retry-interval 0 --url amqp:tcp:127.0.0.1 makes it easier to reproduce on some boxes. Reproduced with qpidd-0.5.752581-18.el5 also.

Comment 3 Alan Conway 2009-06-18 21:05:12 UTC
Created attachment 348555 [details]
 test program  that closes connections

The original runs out of file descriptors if you're not sharp about starting the cluster.

Comment 4 Alan Conway 2009-06-18 21:25:42 UTC
Created attachment 348561 [details]
patch to fix the issue

Fix also commited to trun r786294

This fix includes XML changes so code needs to be re-generated.

Comment 5 Gordon Sim 2009-06-23 09:08:42 UTC
Fixed in qpidd-0.5.752581-20.

Comment 6 Gordon Sim 2009-06-23 10:05:47 UTC
Actually, I still see problems on qpidd-0.5.752581-20. When adding nodes while there are many clients attached (some of which are continually generating errors) the cluster can become unresponsive for long periods (indefinitely?). 


I also saw a node exit on one occasion with:

2009-jun-23 05:37:50 error Execution exception: not-found: Queue not found: queue-50 (qpid/broker/SessionAdapter.cpp:742)
2009-jun-23 05:37:50 error Execution exception: not-found: Unknown destination queue-4 (qpid/broker/SemanticState.cpp:458)
2009-jun-23 05:37:50 critical 100.0.50.15:20677(CATCHUP/error) Error 4723 did not occur on 100.0.50.15:20317
2009-jun-23 05:37:50 error Error delivering frames: Aborted by local failure that did not occur on all replicas

Comment 7 Gordon Sim 2009-06-23 10:25:47 UTC
Update on previous comment - cluster is not indefinitely unresponsive, it became usable for me in my latest test after 30 mins. If this is expected it should be noted in the documentation/release notes.

Comment 8 Alan Conway 2009-07-01 13:47:55 UTC
Fixed in r790163

If an offer is delivered while an error is being resolved, the offer is retracted and the newcomer tries again.

Comment 9 Alan Conway 2009-07-01 21:33:59 UTC
Created attachment 350203 [details]
Patch that really fixes the issue

Patch based on qpidd-0.5.752581-20 + patch from bug 508917 (apply that first)

Comment 10 Alan Conway 2009-07-01 21:59:26 UTC
Comitted to trunk r790397

Comment 11 Alan Conway 2009-07-02 16:10:42 UTC
Created attachment 350309 [details]
corrected patch, previous patch was missing new files.

Comment 12 Gordon Sim 2009-07-03 07:28:33 UTC
Fixed in qpidd-0.5.752581-22.

Comment 13 Gordon Sim 2009-07-03 09:31:21 UTC
I have reproduced a hung cluster on qpidd-0.5.752581-22 using the test for this BZ. The pstack output from all nodes shows nothing untoward, last log entry for each node is "2009-jul-03 05:14:55 notice Broker running" (i.e. no notice of joining cluster or of being the first member).

Comment 14 Alan Conway 2009-07-07 13:16:28 UTC
This hang may be https://bugzilla.redhat.com/show_bug.cgi?id=494393, I saw that several times while testing. If you can reproduce the hang with --log-enable=debug+:cluster you will see all members in the JOINER state.

Its quite rare and not an issue that would cause data loss in a running cluster so I think its reasonable to leave it for 1.2

Comment 15 Gordon Sim 2009-07-07 17:39:35 UTC
Believed fixed in qpidd-0.5.752581-22; issues seen likely to be over-aggressive test.

Comment 16 Frantisek Reznicek 2009-07-09 14:44:06 UTC
The issue has been fixed validated on RHEL 5.3 i386 / x86_64 on packages:

[root@mrg-qe-01 bz506758]# rpm -qa | grep -E '(qpid|rhm|openais)' | sort -u
openais-0.80.3-22.el5_3.8
openais-debuginfo-0.80.3-22.el5_3.8
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-25.el5
qpidc-debuginfo-0.5.752581-25.el5
qpidc-devel-0.5.752581-25.el5
qpidc-perftest-0.5.752581-25.el5
qpidc-rdma-0.5.752581-25.el5
qpidc-ssl-0.5.752581-25.el5
qpidd-0.5.752581-25.el5
qpidd-acl-0.5.752581-25.el5
qpidd-cluster-0.5.752581-25.el5
qpidd-devel-0.5.752581-25.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-25.el5
qpidd-ssl-0.5.752581-25.el5
qpidd-xml-0.5.752581-25.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-6.el5
rhm-docs-0.5.756148-1.el5


-> VERIFIED

Comment 18 errata-xmlrpc 2009-07-14 17:31:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html