Bug 681026 - Clustered broker exits with invalid-argument error
Summary: Clustered broker exits with invalid-argument error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: 2.0
: ---
Assignee: Alan Conway
QA Contact: Petr Matousek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-28 20:53 UTC by Alan Conway
Modified: 2018-11-14 14:25 UTC (History)
5 users (show)

Fixed In Version: qpid-cpp-mrg-0.9.1079953-1
Doc Type: Bug Fix
Doc Text:
Cause: Running a cluster with persistent message store, durable messages and a management agent (e.g. qpid-tool, cumin) Consequence: occasionally a broker would shut down with an "invalid argument" message. Fix: Corrected issues in the store that caused inconsistencies in a cluster. Result: Brokers no longer shut down.
Clone Of:
Environment:
Last Closed: 2011-06-23 15:44:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Log files provided by the customer (8.10 MB, application/x-gzip)
2011-02-28 21:04 UTC, Alan Conway
no flags Details
Shell script to reproduce the issue. (861 bytes, application/x-sh)
2011-03-01 20:05 UTC, Alan Conway
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2011:0890 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 2.0 Release 2011-06-23 15:42:41 UTC

Description Alan Conway 2011-02-28 20:53:19 UTC
Description of problem:

Customer is seeing a clustered broker exit with an invalid-argument error.

Version-Release number of selected component (if applicable):


How reproducible: unknown. Customer seems to see it reliably.


Steps to Reproduce: unknown
1.
2.
3.
  
Actual results: clustered broker exit with invalid-argument error


Expected results: no error.


Additional info:

Comment 1 Alan Conway 2011-02-28 21:04:11 UTC
Created attachment 481447 [details]
Log files provided by the customer

3 sets of logs:

mail1 only has normal logging so doesn't tell us much.

mail2 and mail3 have debug+:cluster and trace+:management. Run qpid/cpp/src/tests/cluster_test_logs.py in the directory with the logs and it will split them into corresponding sections, filter out expected differences and tell you where it finds unexpected differences.

mail2:  

Extra management events on qpidd-rh1.log, survivor

trace Changed V1 statistics com.redhat.rhm.store:journal:ASIC len=NN
debug SEND V1 Multicast ContentInd to=console.obj.1.0.com.redhat.rhm.store.journal props=0 stats=1 len=NN
trace Changed V1 statistics org.apache.qpid.broker:binding:org.apache.qpid.broker:exchange:qpid.management,org.apache.qpid.broker:queue:topic-frgpasic-vc1.ffx.jfh.com.au.12833.1,console.obj.*.*.com.redhat.rhm.store.# len=NN
trace Changed V1 statistics org.apache.qpid.broker:binding:org.apache.qpid.broker:exchange:qpid.management,org.apache.qpid.broker:queue:topic-frgpasic-vc1.ffx.jfh.com.au.12833.2,console.obj.*.*.com.redhat.rhm.store.# len=NN

These are due to changing the journal properties in a journal timer thread.

However they don't generate an extra SEND, so its not clear how this discrepancy causes the failure.

mail3:

1-3vs2-3:
- extra: debug SEND raiseEvent (v1) class=com.redhat.rhm.store.created
- no relation to update

2-2vs1-3:
- initial extra object: com.redhat.rhm.store:journal:TplStore
- extra: debug SEND raiseEvent (v1) class=com.redhat.rhm.store.created
- 2011-02-23 08:02:17 debug Suppressing management message to console.event.1.0.com.redhat.rhm.store.created

I can't see how we are getting the extra store.created event, as this is called from Queue::create which is always (I think) called in a cluster safe context.

I have tried to reproduce a crash like the customers on the -27 builds, no luck so far. I can easily reproduce the journal props discrepancy, but without triggering a failure.

Comment 2 Gordon Sim 2011-03-01 13:48:44 UTC
I think it is the lazy initialisation of the TPLStore that is the issue. I can reproduce the error fairly simply:

1. start on cluster node
2. run qpid-txtest against it
3. start a second cluster node
4. run qpid-printevents against the second node
5. re-run qpid-txtest
6. restart the second cluster node
7. repeat from 5. as necessary til you see the first node shutting down

There is no management object for the TplStore on the first broker as the management agent is not set when it is created (init() called before managementInit() in store plugin). However the second node gets a truncateInit() which happens after management is initialised and consequently that second node does have a TplStore management object (see via list journal on qpid-tool). The TplStore is initialised 'on demand', and it is on initialisation that the event is raised. This can therefore happen on one node and not the other.

Comment 3 Alan Conway 2011-03-01 20:05:40 UTC
Created attachment 481708 [details]
Shell script to reproduce the issue.

Attached script reproduces the problem. Run the script and note that output ends with "Broker 5555 down", and only one broker is running.

It should output "Brokers 5555 5556 alive" and both brokers should be running.

Comment 4 Alan Conway 2011-03-01 20:11:57 UTC
Fix for the above reproducer in the store repo, r4445.

Comment 5 Alan Conway 2011-03-01 22:03:30 UTC
The fix mentioned in comment 4 does address one of the issue in the customers logs, but there is a second issue so the fix is not complete.

Comment 6 Alan Conway 2011-03-04 19:02:36 UTC
Committed to store repo r4446. The complete fix is:

------------------------------------------------------------------------
r4446 | aconway | 2011-03-04 13:57:42 -0500 (Fri, 04 Mar 2011) | 6 lines

Disable management of store and journal in a cluster.

The store updates statistics and properties of store and journal
managed objects in timer threads, which can create inconsistencies in
a cluster.

------------------------------------------------------------------------
r4445 | aconway | 2011-03-01 15:08:48 -0500 (Tue, 01 Mar 2011) | 10 lines

Bug 681026 - Clustered broker exits with invalid-argument error.

In a normal start-up the TplStore had no management object becase it
was created before the management infrastructure. In a cluster
start-up when joining an existing cluster and truncating the store,
the TplStore was created after the management infrastructure and so
did have a management object. This could create a cluster
inconsistency. The fix is to never create a management object for the
TplStore.

Comment 7 Alan Conway 2011-03-11 22:02:32 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Running a cluster with persistent message store, durable messages and a management agent (e.g. qpid-tool, cumin)
Consequence: occasionally a broker would shut down with an "invalid argument" message.
Fix: Corrected issues in the store that caused inconsistencies in a cluster.
Result: Brokers no longer shut down.

Comment 11 Petr Matousek 2011-05-19 10:54:02 UTC
By testing this issue with the attached reproducer it seems to the error was fixed.

There is mentioned also second issue in Comment 5 and Comment 6. Is the issue covered by the reproducer as well? If not, please provide some description how to test it.

Comment 12 Alan Conway 2011-05-19 20:06:32 UTC
Both issues are covered by the same reproducer. The second issue was showing up less frequently than the first.

Comment 13 Petr Matousek 2011-05-20 09:32:31 UTC
This issue has been fixed.

Verified on RHEL5.6, RHEL6.1 architectures: i386, x86_64

Successfully tested with the attached reproducer in loop.
Test loops: 10000
Test duration: over 30 hours.
The invalid-argument error did not occur.

packages installed:
python-qpid-0.10-1.el5.noarch
python-qpid-qmf-0.10-6.el5.x86_64
qpid-cpp-client-0.10-6.el5.x86_64
qpid-cpp-client-devel-0.10-6.el5.x86_64
qpid-cpp-client-devel-docs-0.10-6.el5.x86_64
qpid-cpp-client-ssl-0.10-6.el5.x86_64
qpid-cpp-mrg-debuginfo-0.10-6.el5.x86_64
qpid-cpp-server-0.10-6.el5.x86_64
qpid-cpp-server-cluster-0.10-6.el5.x86_64
qpid-cpp-server-devel-0.10-6.el5.x86_64
qpid-cpp-server-ssl-0.10-6.el5.x86_64
qpid-cpp-server-store-0.10-6.el5.x86_64
qpid-cpp-server-xml-0.10-6.el5.x86_64
qpid-dotnet-0.4.738274-2.el5.x86_64
qpid-java-client-0.10-4.el5.noarch
qpid-java-common-0.10-4.el5.noarch
qpid-java-example-0.10-4.el5.noarch
qpid-qmf-0.10-6.el5.x86_64
qpid-qmf-debuginfo-0.10-6.el5.x86_64
qpid-qmf-devel-0.10-6.el5.x86_64
qpid-tools-0.10-4.el5.noarch

-> VERIFIED

Comment 16 errata-xmlrpc 2011-06-23 15:44:47 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2011-0890.html


Note You need to log in before you can comment on or make changes to this bug.