Bug 520384

Summary: clustered qpidd+store: startup failure error Unexpected exception: Queue perf_16kkx0: create() failed: jexception 0x0400 fcntl::open_fh() threw JERR_FCNTL_OPENWR: Unable to open file for write.
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED NOTABUG QA Contact: MRG Quality Engineering <mrgqe-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 1.1.6CC: esammons
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-08-31 11:35:53 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2009-08-31 09:23:08 UTC
Description of problem:

The clustered qpidd broker with all plugins enabled (incl. rhm) sometimes fail to start.

Used testing scenario:
Test launches concurrent perftests on the clustered qpidd. (Perftest is modified the way that queue can be specified, so perftests are tx/rx from different exchanges/queues with different parameters durable/non-durable different message counts and sizes)
There are always few batches of perftests. Every one running 2-7 perftests in parallel against randomly selected qpidd cluster nodes. After every batch of the perftest there is replication check which makes sure that all queues are allocated in every cluster node and have the same properties.

Moreover after few of batches there is change of cluster configuration done, for instance: 0->1->2->1->0 ...


The issue I can see is that during change of the cluster configuration:
0->4->6
     ^ I can see (only) one node (of 2 joining) refuses to join cluster with following exception:

2009-aug-27 16:23:24 notice Journal "perf_16kkx0": Created
2009-aug-27 16:23:24 error Unexpected exception: Queue perf_16kkx0: create() failed: jexception 0x0400 fcntl::open_fh() threw JERR_FCNTL_OPENWR:
Unable to open file for write. (pfid=4 lfid=4 file="/root/MRG/Messaging/qpid_ptest_cluster_perftest/rhts_qpidd/20090827_160032/qpidd.2480/rhm/jrn
l/0005/perf_16kkx0//JournalData.0004.jdat" errno=2 (No such file or directory)) (MessageStoreImpl.cpp:456)
2009-aug-27 16:23:24 error Connection 10.34.33.54:53676 closed by error: Queue perf_16kkx0: create() failed: jexception 0x0400 fcntl::open_fh() t
hrew JERR_FCNTL_OPENWR: Unable to open file for write. (pfid=4 lfid=4 file="/root/MRG/Messaging/qpid_ptest_cluster_perftest/rhts_qpidd/20090827_1
60032/qpidd.2480/rhm/jrnl/0005/perf_16kkx0//JournalData.0004.jdat" errno=2 (No such file or directory)) (MessageStoreImpl.cpp:456)(501)
2009-aug-27 16:23:24 error Channel exception: not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:40)
2009-aug-27 16:23:24 critical 10.34.33.54:2485(UPDATEE) catch-up connection closed prematurely 10.34.33.54:2485-1(local,catchup)
2009-aug-27 16:23:24 notice 10.34.33.54:2485(LEFT) leaving cluster cl_mrg-qe-01.lab.eng.brq.redhat.com_V.MRG.M.0043_1/3_w4


I can observe this regularly on RHEL 5.3 i386/x86_64.
I minimized chance hitting max. number of files by 'ulimit -n 32768'.
The qpidd parameters:
-p 8669 --auth no --log-enable info+ --cluster-name cl_
mrg-qe-01.lab.eng.brq.redhat.com_V.MRG.M.0043_1/3_w4 --num-jfiles=64 --tpl-num-jfiles=64 --data-dir /root/MRG/Messaging/qpid_ptest_cluster_perfte
st/rhts_qpidd/20090827_160032/qpidd.2719

So it might be connected with '--num-jfiles=64 --tpl-num-jfiles=64'.


Version-Release number of selected component (if applicable):
1.1.6 set:
openais-0.80.3-22.el5_3.9
openais-debuginfo-0.80.3-22.el5_3.9
openais-devel-0.80.3-22.el5_3.9
python-qpid-0.5.752581-3.el5
qpidc-0.5.752581-26.el5
qpidc-debuginfo-0.5.752581-26.el5
qpidc-devel-0.5.752581-26.el5
qpidc-perftest-0.5.752581-26.el5
qpidc-rdma-0.5.752581-26.el5
qpidc-ssl-0.5.752581-26.el5
qpidd-0.5.752581-26.el5
qpidd-acl-0.5.752581-26.el5
qpidd-cluster-0.5.752581-26.el5
qpidd-devel-0.5.752581-26.el5
qpid-dotnet-0.4.738274-2.el5
qpidd-rdma-0.5.752581-26.el5
qpidd-ssl-0.5.752581-26.el5
qpidd-xml-0.5.752581-26.el5
qpid-java-client-0.5.751061-8.el5
qpid-java-common-0.5.751061-8.el5
rhm-0.5.3206-9.el5
rhm-docs-0.5.756148-1.el5
libaio-0.3.106-3.2


How reproducible:
99.9%

Steps to Reproduce:
1. start 4 node cluster
2. run couple of modified perftests on different nodes
3. try to join 2 new nodes to the qpidd cluster
  
Actual results:
One of two joining qpidd cluster nodes exits with msgstore exception.

Expected results:
Both two joining nodes should join cluster well.

Additional info:

The data for analysis including msg store journals are stored here:
mrg-qe-02.lab.eng.brq.redhat.com:/root/MRG/Messaging/qpid_ptest_cluster_perftest_rhel53x_fail090831.tar.bz2
mrg-qe-01.lab.eng.brq.redhat.com:/root/MRG/Messaging/qpid_ptest_cluster_perftest_rhel53i_fail090831.tar.bz2

As the starting point unpack the archive and search for the exception in qpidd_*.log files (should be qpidd_1_4_ab1.log). Then you can check the journals attached and rought test log qpid_ptest_cluster_perftest.log


[root@mrg-qe-02 qpid_ptest_cluster_perftest]# tail -50 qpidd_1_4_ab1.log
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_K6nXRa36Ot6Zxu1TJGVsRa6w9S4hQFRhKQ9bn5aIjWOZc85JkK7HhuGKd0; exchange=; binding-key=perf_K6nXRa36Ot6Zxu1TJGVsRa6w9S4hQFRhKQ9bn5aIjWOZc85JkK7HhuGKd0; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=perf_K6nXRa36Ot6Zxu1TJGVsRa6w9S4hQFRhKQ9bn5aIjWOZc85JkK7HhuGKd0; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=perf_SzClXnBuA0hIliqwIYDoRZ7EsvxmeEc1zXYDBOlsjyBfx0; alternate-exchange=; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_SzClXnBuA0hIliqwIYDoRZ7EsvxmeEc1zXYDBOlsjyBfx0; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_SzClXnBuA0hIliqwIYDoRZ7EsvxmeEc1zXYDBOlsjyBfx0; exchange=; binding-key=perf_SzClXnBuA0hIliqwIYDoRZ7EsvxmeEc1zXYDBOlsjyBfx0; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=perf_SzClXnBuA0hIliqwIYDoRZ7EsvxmeEc1zXYDBOlsjyBfx0; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=perf_hA4yDDeWHf83hNWf14wv82OiV4Ad2ix6aad8KUeg0; alternate-exchange=; durable=1; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_hA4yDDeWHf83hNWf14wv82OiV4Ad2ix6aad8KUeg0; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_hA4yDDeWHf83hNWf14wv82OiV4Ad2ix6aad8KUeg0; exchange=; binding-key=perf_hA4yDDeWHf83hNWf14wv82OiV4Ad2ix6aad8KUeg0; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=perf_hA4yDDeWHf83hNWf14wv82OiV4Ad2ix6aad8KUeg0; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=perf_lsNAgUlSS7jSjWcqZfFYSSZGxuO90YhNnNXcTz90; alternate-exchange=; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_lsNAgUlSS7jSjWcqZfFYSSZGxuO90YhNnNXcTz90; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_lsNAgUlSS7jSjWcqZfFYSSZGxuO90YhNnNXcTz90; exchange=; binding-key=perf_lsNAgUlSS7jSjWcqZfFYSSZGxuO90YhNnNXcTz90; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=perf_lsNAgUlSS7jSjWcqZfFYSSZGxuO90YhNnNXcTz90; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=perf_lsP3alKUICPY38gbYq0MonbA2NtyQQBus6qvo6EQo7bQ0; alternate-exchange=; durable=1; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_lsP3alKUICPY38gbYq0MonbA2NtyQQBus6qvo6EQo7bQ0; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_lsP3alKUICPY38gbYq0MonbA2NtyQQBus6qvo6EQo7bQ0; exchange=; binding-key=perf_lsP3alKUICPY38gbYq0MonbA2NtyQQBus6qvo6EQo7bQ0; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=perf_lsP3alKUICPY38gbYq0MonbA2NtyQQBus6qvo6EQo7bQ0; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=perf_xsUpTicOOEWK507eN880A4pDlhtmFDwlBZ0av07HGOdPcTThyPmw0; alternate-exchange=; durable=1; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_xsUpTicOOEWK507eN880A4pDlhtmFDwlBZ0av07HGOdPcTThyPmw0; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=perf_xsUpTicOOEWK507eN880A4pDlhtmFDwlBZ0av07HGOdPcTThyPmw0; exchange=; binding-key=perf_xsUpTicOOEWK507eN880A4pDlhtmFDwlBZ0av07HGOdPcTThyPmw0; arguments={qpid.max_count:F4:int32(0),qpid.max_size:F4:int32(0)}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=perf_xsUpTicOOEWK507eN880A4pDlhtmFDwlBZ0av07HGOdPcTThyPmw0; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=pub_done; alternate-exchange=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=pub_done; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=pub_done; exchange=; binding-key=pub_done; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=pub_done; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=pub_start; alternate-exchange=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=pub_start; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=pub_start; exchange=; binding-key=pub_start; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=pub_start; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=sub_done; alternate-exchange=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=sub_done; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=sub_done; exchange=; binding-key=sub_done; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=sub_done; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=sub_ready; alternate-exchange=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=sub_ready; exchange=qpid.cluster-update; binding-key=; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeBindBody: queue=sub_ready; exchange=; binding-key=sub_ready; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExchangeUnbindBody: queue=sub_ready; exchange=qpid.cluster-update; binding-key=; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {QueueDeclareBody: queue=qpid.cluster-update; alternate-exchange=; auto-delete=1; arguments={}; }]
2009-aug-31 09:36:02 trace 10.34.33.55:7495(UPDATEE) RECV 10.34.33.55:7495-1(local,catchup): Frame[BEbe; channel=1; {ExecutionSyncBody: }]
2009-aug-31 09:36:02 trace SENT [10.34.33.55:57275]: Frame[BEbe; channel=1; {SessionTimeoutBody: timeout=0; }]
2009-aug-31 09:36:02 trace SENT [10.34.33.55:57275]: Frame[BEbe; channel=0; {ConnectionCloseBody: reply-code=501; reply-text=Queue perf_GpYAAcW5qNB2XbP3HaDJ4ykVNUjuevp3913LjuIrj7QLkWqBV9qjRmr7fpXlsuoz0: create() failed: jexception 0x0400 fcntl::open_fh() threw JERR_FCNTL_OPENWR: Unable to open file for write. (pfid=12 lfid=12 file="/root/MRG/Messaging/qpid_ptest_cluster_perftest/rhts_qpidd/20090831_091240/qpidd.7490/rhm/jrnl/000c/perf_GpYAAcW5qNB2XbP3HaDJ4ykVNUjuevp3913LjuIrj7QLkWqBV9qjRmr7fpXlsuoz0//JournalData.000c.jdat" errno=2 (No such file or directory)) (MessageStoreImpl.cpp:456); }]
2009-aug-31 09:36:02 critical 10.34.33.55:7495(UPDATEE) catch-up connection closed prematurely 10.34.33.55:7495-1(local,catchup)
2009-aug-31 09:36:02 notice 10.34.33.55:7495(LEFT) leaving cluster cl_mrg-qe-02.lab.eng.brq.redhat.com_V.MRG.M.0043_1/3_w4
2009-aug-31 09:36:02 debug 10.34.33.55:7495(LEFT) deleted connection: 10.34.33.55:7495-1(local,catchup)
2009-aug-31 09:36:02 debug Shutting down CPG
2009-aug-31 09:36:02 notice Shut down
2009-aug-31 09:36:02 debug Journal "perf_GpYAAcW5qNB2XbP3HaDJ4ykVNUjuevp3913LjuIrj7QLkWqBV9qjRmr7fpXlsuoz0": Destroyed
2009-aug-31 09:36:02 debug Journal "perf_BRgLEy0": Destroyed
2009-aug-31 09:36:02 debug Journal "TplStore": Destroyed

Comment 1 Frantisek Reznicek 2009-08-31 09:36:29 UTC
Still under investigation in QE, it might be bug in the test. Back with the result soon.

Comment 2 Frantisek Reznicek 2009-08-31 11:35:53 UTC
The bug found in the qpidd running bash function, so not real qpidd msgstore issue. 

-> CLOSED (NOTABUG)