I've been running the broker like this: ./src/qpidd --load-module ~/work/bdbstore/cpp/lib/.libs/libbdbstore.so --data-dir temp_store --jfile-size-pgs 50 --num-jfiles 16 --auth no --log-enable info+ then the perftest client like this: ./src/tests/perftest --queue-max-count 10 --durable yes --npubs 2 --nsubs 1 --count 10000 --size 10 --log-enable info+ The test fails due to a reported closure of the connection by the broker. Generally topping the broker results in a core dump due to a bad free (see stack traces below). Once it closed cleanly then errored on recovery: Exception: Queue perftest0: recoverMessages() failed: jexception 0x0900 rmgr::read() threw JERR_RMGR_UNKNOWNMAGIC: Found record with unknown magic. (Magic=0x00000000) (BdbMessageStore.cpp:706) #0 0x45c0c7e4 in _int_free () from /lib/libc.so.6 #1 0x45c103ef in free () from /lib/libc.so.6 #2 0x007b7f41 in ~JournalImpl (this=0x9b6d668) at JournalImpl.cpp:108 #3 0x00355bc9 in ~Queue (this=0x9b6cea0) at qpid/broker/PersistableQueue.h:59 #4 0x0039a9d8 in boost::detail::sp_counted_impl_p<qpid::broker::Queue>::dispose (this=0x9b6d378) at /usr/include/boost/checked_delete.hpp:34 #5 0x0039b564 in std::_Rb_tree<std::string, std::pair<std::string const, boost::shared_ptr<qpid::broker::Queue> >, std::_Select1st<std::pair<std::string const, boost::shared_ptr<qpid::broker::Queue> > >, std::less<std::string>, std::allocator<std::pair<std::string const, boost::shared_ptr<qpid::broker::Queue> > > >::_M_erase (this=0x9b22f60, __x=0x9b6d390) at /usr/include/boost/detail/sp_counted_base_gcc_x86.hpp:145 #6 0x0039a1bf in ~QueueRegistry (this=0x9b22f60) at /usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/bits/stl_tree.h:578 #7 0x003425a0 in ~Broker (this=0x9b22ed0) at qpid/broker/Broker.cpp:295 #8 0x0804ddd9 in __tcf_1 () at /usr/include/boost/detail/sp_counted_base_gcc_x86.hpp:145 #9 0x45bd398e in exit () from /lib/libc.so.6 #10 0x45bbe72c in __libc_start_main () from /lib/libc.so.6 #11 0x0804db41 in _start ()
Also tried using producer/consume, added messages to queue beyond the limit then tried to consume them and got: Exception: Exception: Queue my-queue: loadContent() failed: jexception 0x0901 rmgr::read() threw JERR_RMGR_RIDMISMATCH: RID mismatch between current record and dtok RID (rid=0x0000000000000003; dtok_rid=0000000000000065; dtok_id=0x00000000) (BdbMessageStore.cpp:996)
core dump solved with r2038. The other two will need further testing.
Another test case: (1) start broker (2) setup queue e.g. with queue-config tool in python/commands (and PYTHONPATH set): ../python/commands/qpid-config -s ../specs/amqp.0-10.xml add queue my-queue --durable --max-queue-count 5) (3) publish more messages than the maximum e.g. ./src/tests/publish --routing-key my-queue --count 6 (4) consume all the messages e.g. ./src/tests/consume --count 6 --queue my-queue With --store-sync no this works; with the async store the broker kills the connection as soon as it tries to deliver the first message that has been flushed to disk. e.g. Received: msg1 Received: msg2 Received: msg3 Received: msg4 Received: msg5 Exception: Connection closed by broker Closed: (with sync store get: Received: msg1 Received: msg2 Received: msg3 Received: msg4 Received: msg5 Received: msg6 )
Appears to be a result of: jexception 0x0901 rmgr::read() threw JERR_RMGR_RIDMISMATCH: RID mismatch between current record and dtok RID (rid=0x0000000000000001; dtok_rid=0000000000000006; dtok_id=0x00000000) (BdbMessageStore.cpp:1045)
One further update: this only affects durable messages (non-durable messages are written to bdbstore).
Critical to place in release notes for B4.
The journal's read interface will always return the first non-dequeued (and unread) message in the journal. In addition, if the record id (rid) is set (!=0) in the data token when the read is made, the journal assumes that this is the expected rid, and makes a check that the record it finds has this rid. If not, then the JERR_RMGR_RIDMISMATCH exception is thrown. If the rid is set to 0, then the journal will set the rid in the data token to whatever the read rid is. In the scenario in comment #3 above, the publisher enqueues 6 records, the last of which must be retrieved from disk because of queue memory limitations. However, the messages are not dequeued as they are read. So msg1 to msg5 are read from memory, but when the broker calls down to the journal to read, the first message it returns is msg1, not msg6 - hence the JERR_RMGR_RIDMISMATCH error. Running the test so that each message is dequeued on read works: ./src/tests/consume --count 6 --queue my-queue --ack-frequency 0 To fix this, the logic in JournalImpl was changed to set the rid on read to 0 and then compare the rid returned with that being sought. If the rid read is less, then the read is repeated until the correct rid is found. This has the effect of "consuming" the enqueued but unwanted records from the journal before they are dequeued.
Fixed in r2059
qpidc-0.2.656926-1.el5, qpidd-0.2.656926-1.el5, and rhm-0.2.2058-1.el5 have been pushed to the staging repo for testing
Kim, not sure I fully understood comment #7. Is this expected behavior: # ./src/tests/perftest --queue-max-count 10 --durable yes --npubs 2 --nsubs 1 --count 10000 --size 10 --log-enable info+ 2008-jun-03 05:59:14 info Connecting to localhost:5672 2008-jun-03 05:59:14 warning SASL: No Authentication Performed 2008-jun-03 05:59:14 notice Journal "perftest0": Instantiation 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed Processing 1 messages from sub_ready . done. Sending start 2 times to pub_start Processing 2 messages from pub_done .2008-jun-03 05:59:15 info Queue size exceeded policy for perftest0 . done. Processing 1 messages from sub_done .2008-jun-03 05:59:16 warning Broker closed connection: 541, Exception: Queue perftest0: loadContent() failed: jexception 0x0000 JournalImpl::loadMsgContent() threw <Unknown error code> (read_data_record() was unable to find rid 230; last rid found was 295) (BdbMessageStore.cpp:1045) SubscribeThread exception: Closed:
This may be a more complex scenario for BZ448935 "Lazy-load with journal will fail for redelivered messages". Although messages are not redelivered in this case, the underlying issue is the same: the journal cannot retrace its read pointers, and is essentially a read-once mechanism. Thus if there are two clients on two threads trying to read at the same time, and one of them gets in first and reads later messages, the second client subsequently cannot read earlier ones (hence the "unable to find rid yyy; last rid found was xxx" message where yyy < xxx). BZ448935 provides a fix for this by invalidating the read manager when this scenario arises, allowing it to resynchronize itself and reset its read pointers. This is a slow and inefficient way to solve this, but it should be borne in mind that flow-to-disk scenarios are already "bad" situations in which memory limits in the broker/queue have been exceeded, and as a stop-gap measure, messages are being read from disk before they are dispatched - an overhead that will almost certainly hit performance hard. A broker that has been properly set up for its anticipated load should not encounter memory shortages or limits, but as always, the ability to handle it should it unexpectedly happen is required. I will leave this in state NEEDINFO for now to give time to see if it can be reproduced with the fix for BZ448935 in place.