Bug 592999

Summary: clustered persistant broker leaves cluster with 'critical Unexpected error: Cluster-ID mismatch. Stores belong to different clusters.' when cluster nodes started from clean store
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: aconway, esammons, gsim
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-20 11:28:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 494393    

Description Frantisek Reznicek 2010-05-17 15:10:23 UTC
Description of problem:

There are multiple correlated observations which prove that
starting multiple brokers at once leads to issue:
  2010-05-17 10:35:49 debug cluster(10.34.45.2:6880 INIT) initial status map complete.
  2010-05-17 10:35:49 debug Exception constructed: Cluster-ID mismatch. Stores belong to different clusters.
  2010-05-17 10:35:49 critical Unexpected error: Cluster-ID mismatch. Stores belong to different clusters.

even when all brokers are initialy started on clean store (on non existent data-dir).

This behavior is most probably bug or at least ugliness which needs to be documented.


I triggered this while trying to probe bug 494393 is fixed, so this bug/behavior is blocking bug 494393 verification.



Version-Release number of selected component (if applicable):
openais-0.80.6-16.el5_5.1
openais-debuginfo-0.80.6-16.el5_5.1
openais-devel-0.80.6-16.el5_5.1
python-qmf-0.7.934605-1.el5
python-qpid-0.7.934605-1.el5
qmf-0.7.935473-1.el5
qmf-devel-0.7.935473-1.el5
qpid-cpp-client-0.7.935473-1.el5
qpid-cpp-client-devel-0.7.935473-1.el5
qpid-cpp-client-devel-docs-0.7.935473-1.el5
qpid-cpp-client-ssl-0.7.935473-1.el5
qpid-cpp-mrg-debuginfo-0.7.935473-1.el5
qpid-cpp-server-0.7.935473-1.el5
qpid-cpp-server-cluster-0.7.935473-1.el5
qpid-cpp-server-devel-0.7.935473-1.el5
qpid-cpp-server-ssl-0.7.935473-1.el5
qpid-cpp-server-store-0.7.935473-1.el5
qpid-cpp-server-xml-0.7.935473-1.el5
qpid-java-client-0.7.934605-1.el5
qpid-java-common-0.7.934605-1.el5
qpid-tests-0.7.934605-1.el5
qpid-tools-0.7.934605-2.el5
ruby-qpid-0.7.935473-1.el5


How reproducible:
100%

Steps to Reproduce:
0. install qpidd & openais
1. configure openais
2. run  ./looper.sh
3. wait for error
  
Actual results:
Cluster of persistant brokers does not always start up in expected cluster width.

Expected results:
Cluster of persistant brokers should always start up in expected cluster width.

Additional info:

RHEL 5.5 i386 case:
~~~~~~~~~~~~~~~~~~
[root@mrg-qe-01 bz494393]# ./looper.sh
8:0;4:0;8:0;8:2;
Last log:
aisexec (pid 9750) is running...
client[s] ready
starting brokers in the cluster:........done
A:ERROR:broker[s] not running! [exp. width:8] (ports:53526 39256 50745 ,#:3, pids:12968 12974 12964 ,#:3)
tcp        0      0 0.0.0.0:53526               0.0.0.0:*                   LISTEN      12968/qpidd
tcp        0      0 0.0.0.0:39256               0.0.0.0:*                   LISTEN      12974/qpidd
tcp        0      0 0.0.0.0:50745               0.0.0.0:*                   LISTEN      12964/qpidd
root     12964  0.5  0.1 143464 11232 pts/2    Sl+  10:44   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_4
root     12968  0.8  0.1 165944 12060 pts/2    Sl+  10:44   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_5
root     12974  0.6  0.1 142716 11708 pts/2    Sl+  10:44   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_7
test-core:1883.42       1510.96 2896.68 2.82879
  Cluster Name: mrg-qe-01.lab.eng.brq.redhat.com_cluster_bz494393
Cluster Status: ACTIVE
  Cluster Size: 3
       Members: ID=10.34.45.1:12964 URL=amqp:tcp:10.34.33.54:50745,tcp:10.34.45.1:50745
              : ID=10.34.45.1:12968 URL=amqp:tcp:10.34.33.54:53526,tcp:10.34.45.1:53526
              : ID=10.34.45.1:12974 URL=amqp:tcp:10.34.33.54:39256,tcp:10.34.45.1:39256
.ecodes:00
B:ERROR:broker[s] not running! [exp. width:8] (ports:53526 39256 50745 ,#:3, pids:12968 12974 12964 ,#:3)
stopping brokers...
Test Summary:
TEST FAILED !!! (2 failures)
err_cnt:1; dur:36 sec[s]

RHEL 5.5 x86_64 case:
~~~~~~~~~~~~~~~~~~~~
[root@mrg-qe-02 bz494393]# ./looper.sh
9:0;8:0;5:0;7:0;8:0;9:0;4:0;4:0;9:0;7:0;7:0;8:0;7:0;9:0;6:0;5:0;8:0;6:0;9:0;9:0;4:0;8:0;8:0;8:0;9:2;
Last log:
aisexec (pid 871) is running...
client[s] ready
starting brokers in the cluster:.........done
A:ERROR:broker[s] not running! (ports:34248 ,#:1, pids:6868 ,#:1)
tcp        0      0 0.0.0.0:34248               0.0.0.0:*                   LISTEN      6868/qpidd
root      6868  0.5  0.1 353908 12424 pts/2    Sl+  10:35   0:00 qpidd -p 0 --auth no --cluster-name mrg-qe-02.lab.eng.brq.redhat.com_cluster_bz494393 --log-enable trace+ --data-dir data_0_4
test-core:4241.08       2435.35 4978.36 4.86168
B:ERROR:broker[s] not running! (ports:34248 ,#:1, pids:6868 ,#:1)
stopping brokers...
Test Summary:
TEST FAILED !!! (2 failures)
err_cnt:1; dur:161 sec[s]

Comment 2 Frantisek Reznicek 2010-05-17 15:28:24 UTC
The issue is seen on both RHEL 5.5 i386 / x86_64. The test runs multiple
brokers on one machine with separate logging and data-dirs.

To be more clear on reproducibility, I was able to trigger the issue while
running attached reproducer within few iterations (5-25 loops).

Comment 3 Alan Conway 2010-05-25 18:16:16 UTC
Fixed on trunk r948143 and on 1.3 release branch:
http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commitdiff;h=87a548b4b5e525bd887fd486469e63ed9bd79c03

Comment 4 Frantisek Reznicek 2010-06-02 14:12:54 UTC
The issue has been fixed, verified in long test run (few hundreds of cluster restarts, qpidd min/max logging, various cluster widths) on RHEL 5.5 i386 / x86_64 using packages:
openais-0.80.6-16.el5_5.1
openais-debuginfo-0.80.6-16.el5_5.1
openais-devel-0.80.6-16.el5_5.1
python-qpid-0.7.946106-1.el5
qpid-cpp-client-0.7.946106-2.el5
qpid-cpp-client-devel-0.7.946106-2.el5
qpid-cpp-client-devel-docs-0.7.946106-2.el5
qpid-cpp-client-ssl-0.7.946106-2.el5
qpid-cpp-mrg-debuginfo-0.7.946106-2.el5
qpid-cpp-server-0.7.946106-2.el5
qpid-cpp-server-cluster-0.7.946106-2.el5
qpid-cpp-server-devel-0.7.946106-2.el5
qpid-cpp-server-ssl-0.7.946106-2.el5
qpid-cpp-server-store-0.7.946106-2.el5
qpid-cpp-server-xml-0.7.946106-2.el5
qpid-java-client-0.7.946106-3.el5
qpid-java-common-0.7.946106-3.el5
qpid-tests-0.7.946106-1.el5
qpid-tools-0.7.946106-4.el5
ruby-qpid-0.7.946106-1.el5

-> VERIFIED