+++ This bug was initially created as a clone of Bug #458053 +++ Using store r2258 from release branch (and r680695 from qpid.0-10). 1. Start broker 2. run txtest (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4) 3. after sometime 'kill -9 <broker-pid> 4. remove lock and restart broker 5. run check phase (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4 --check yes --init no --transfer no) Expect all messages to be present. Sometimes messages are reported as missing, sometimes the following error occurs instead: Queue tx-test-2: async_dequeue() failed: jexception 0x0b01 txn_map::get_tdata_list() threw JERR_MAP_NOTFOUND: Key not found in map. (xid=) (BdbMessageStore.cpp:1246) --- Additional comment from kim.vdriet on 2008-08-12 11:35:15 EDT --- Several problems were responsible for this error: 1. The 1PC transactions were not being handled atomically across multiple queues. This was fixed by keeping 1PC txns in the prepared list and modifying the txn recovery logic to handle 1PC txns. 2. Message recovery did not correctly predict the outcome of messages which needed to be rolled forward/back because of incomplete multi-queue commits/aborts. The message recovery logic was reworked to extract the information it needs to make this determination from the journal enqueue and transaction maps. Some new accessors were added to class jcntl to allow for these operations. In addition, some bugs were found in the python journal file checker jfile_chk.py which was used to analyze the journal files. These were fixed, and a new -a flag now performs transactional analysis on the journal and reports open transactions and locked records. Fixed in r.2279. --- Additional comment from kim.vdriet on 2008-08-12 13:58:01 EDT --- After some testing, there are still occasional cases of lost messages when txtest is run in test mode. Reassigning. --- Additional comment from kim.vdriet on 2008-08-13 15:29:09 EDT --- Additional cases for prepared but not completed transactions found; also for non-prepared transactions which were not being correctly aborted at journal level. r.2297
RHTS test developed (MRG/qpid_txtest_fails_bz458053). Test results comming soon.
RHTS test MRG/qpid_txtest_fails_bz458053 proved that this issue is no longer present.(See RHTS jobs 28113 and 28114 for details)
After few more automated tests (MRG/qpid_txtest_fails_bz458053 and MRG/qpid_test_transaction_integrity) there is still less than percent of failing cases. MRG/qpid_test_transaction_integrity test shows it on http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4116052. Please find test case log attached (on RHEL5 clone i.e. bz458053). Moving VERIFIED to FAILS_QA. P.S. Latest test case code can be found here: http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/tests/distribution/MRG_Messaging/
Fixed, see https://bugzilla.redhat.com/show_bug.cgi?id=458053
RHTS automated tests (MRG/qpid_txtest_fails_bz458053 and MRG/qpid_test_transaction_integrity) now prove that issue is gone. (->VERIFIED)
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2008-0867.html