Bug 446096
Summary: | flow to disk not working with journal | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Gordon Sim <gsim> |
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Kim van der Riet <kim.vdriet> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | beta | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-12-02 16:05:56 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Gordon Sim
2008-05-12 18:25:17 UTC
Also tried using producer/consume, added messages to queue beyond the limit then tried to consume them and got: Exception: Exception: Queue my-queue: loadContent() failed: jexception 0x0901 rmgr::read() threw JERR_RMGR_RIDMISMATCH: RID mismatch between current record and dtok RID (rid=0x0000000000000003; dtok_rid=0000000000000065; dtok_id=0x00000000) (BdbMessageStore.cpp:996) core dump solved with r2038. The other two will need further testing. Another test case: (1) start broker (2) setup queue e.g. with queue-config tool in python/commands (and PYTHONPATH set): ../python/commands/qpid-config -s ../specs/amqp.0-10.xml add queue my-queue --durable --max-queue-count 5) (3) publish more messages than the maximum e.g. ./src/tests/publish --routing-key my-queue --count 6 (4) consume all the messages e.g. ./src/tests/consume --count 6 --queue my-queue With --store-sync no this works; with the async store the broker kills the connection as soon as it tries to deliver the first message that has been flushed to disk. e.g. Received: msg1 Received: msg2 Received: msg3 Received: msg4 Received: msg5 Exception: Connection closed by broker Closed: (with sync store get: Received: msg1 Received: msg2 Received: msg3 Received: msg4 Received: msg5 Received: msg6 ) Appears to be a result of: jexception 0x0901 rmgr::read() threw JERR_RMGR_RIDMISMATCH: RID mismatch between current record and dtok RID (rid=0x0000000000000001; dtok_rid=0000000000000006; dtok_id=0x00000000) (BdbMessageStore.cpp:1045) One further update: this only affects durable messages (non-durable messages are written to bdbstore). Critical to place in release notes for B4. The journal's read interface will always return the first non-dequeued (and unread) message in the journal. In addition, if the record id (rid) is set (!=0) in the data token when the read is made, the journal assumes that this is the expected rid, and makes a check that the record it finds has this rid. If not, then the JERR_RMGR_RIDMISMATCH exception is thrown. If the rid is set to 0, then the journal will set the rid in the data token to whatever the read rid is. In the scenario in comment #3 above, the publisher enqueues 6 records, the last of which must be retrieved from disk because of queue memory limitations. However, the messages are not dequeued as they are read. So msg1 to msg5 are read from memory, but when the broker calls down to the journal to read, the first message it returns is msg1, not msg6 - hence the JERR_RMGR_RIDMISMATCH error. Running the test so that each message is dequeued on read works: ./src/tests/consume --count 6 --queue my-queue --ack-frequency 0 To fix this, the logic in JournalImpl was changed to set the rid on read to 0 and then compare the rid returned with that being sought. If the rid read is less, then the read is repeated until the correct rid is found. This has the effect of "consuming" the enqueued but unwanted records from the journal before they are dequeued. Fixed in r2059 qpidc-0.2.656926-1.el5, qpidd-0.2.656926-1.el5, and rhm-0.2.2058-1.el5 have been pushed to the staging repo for testing Kim, not sure I fully understood comment #7. Is this expected behavior: # ./src/tests/perftest --queue-max-count 10 --durable yes --npubs 2 --nsubs 1 --count 10000 --size 10 --log-enable info+ 2008-jun-03 05:59:14 info Connecting to localhost:5672 2008-jun-03 05:59:14 warning SASL: No Authentication Performed 2008-jun-03 05:59:14 notice Journal "perftest0": Instantiation 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed 2008-jun-03 05:59:15 info Connecting to localhost:5672 2008-jun-03 05:59:15 warning SASL: No Authentication Performed Processing 1 messages from sub_ready . done. Sending start 2 times to pub_start Processing 2 messages from pub_done .2008-jun-03 05:59:15 info Queue size exceeded policy for perftest0 . done. Processing 1 messages from sub_done .2008-jun-03 05:59:16 warning Broker closed connection: 541, Exception: Queue perftest0: loadContent() failed: jexception 0x0000 JournalImpl::loadMsgContent() threw <Unknown error code> (read_data_record() was unable to find rid 230; last rid found was 295) (BdbMessageStore.cpp:1045) SubscribeThread exception: Closed: This may be a more complex scenario for BZ448935 "Lazy-load with journal will fail for redelivered messages". Although messages are not redelivered in this case, the underlying issue is the same: the journal cannot retrace its read pointers, and is essentially a read-once mechanism. Thus if there are two clients on two threads trying to read at the same time, and one of them gets in first and reads later messages, the second client subsequently cannot read earlier ones (hence the "unable to find rid yyy; last rid found was xxx" message where yyy < xxx). BZ448935 provides a fix for this by invalidating the read manager when this scenario arises, allowing it to resynchronize itself and reset its read pointers. This is a slow and inefficient way to solve this, but it should be borne in mind that flow-to-disk scenarios are already "bad" situations in which memory limits in the broker/queue have been exceeded, and as a stop-gap measure, messages are being read from disk before they are dispatched - an overhead that will almost certainly hit performance hard. A broker that has been properly set up for its anticipated load should not encounter memory shortages or limits, but as always, the ability to handle it should it unexpectedly happen is required. I will leave this in state NEEDINFO for now to give time to see if it can be reproduced with the fix for BZ448935 in place. |