Using store r2258 from release branch (and r680695 from qpid.0-10).
1. Start broker
2. run txtest (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4)
3. after sometime 'kill -9 <broker-pid>
4. remove lock and restart broker
5. run check phase (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4 --check yes --init no --transfer no)
Expect all messages to be present. Sometimes messages are reported as missing, sometimes the following error occurs instead:
Queue tx-test-2: async_dequeue() failed: jexception 0x0b01 txn_map::get_tdata_list() threw JERR_MAP_NOTFOUND: Key not found in map. (xid=) (BdbMessageStore.cpp:1246)
Several problems were responsible for this error:
1. The 1PC transactions were not being handled atomically across multiple queues. This was fixed by keeping 1PC txns in the prepared list and modifying the txn recovery logic to handle 1PC txns.
2. Message recovery did not correctly predict the outcome of messages which needed to be rolled forward/back because of incomplete multi-queue commits/aborts. The message recovery logic was reworked to extract the information it needs to make this determination from the journal enqueue and transaction maps. Some new accessors were added to class jcntl to allow for these operations.
In addition, some bugs were found in the python journal file checker jfile_chk.py which was used to analyze the journal files. These were fixed, and a new -a flag now performs transactional analysis on the journal and reports open transactions and locked records.
Fixed in r.2279.
After some testing, there are still occasional cases of lost messages when txtest is run in test mode. Reassigning.
Additional cases for prepared but not completed transactions found; also for non-prepared transactions which were not being correctly aborted at journal level.
RHTS test developed (MRG/qpid_txtest_fails_bz458053).
Test results comming soon.
RHTS test MRG/qpid_txtest_fails_bz458053 proved that this issue is no longer present.(See RHTS jobs 28113 and 28114 for details)
After few more automated tests (MRG/qpid_txtest_fails_bz458053 and MRG/qpid_test_transaction_integrity) there is still less than percent of failing cases.
MRG/qpid_test_transaction_integrity test shows it on http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4116052.
Please find test case log attached.
Moving VERIFIED to FAILS_QA.
P.S. Latest test case code can be found here:
Created attachment 315728 [details]
Failing MRG/qpid_test_transaction_integrity test log (also on http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4116052)
FYI: to get the journal data tarball use e.g.: 'uudecode -o fjournal.tar.bz2 4116052-test_log--distribution-MRG_Messaging-qpid_test_transaction_integrity.log'
(uudecode is in the sharutils package)
It seems that attached failing journal is a good one. After the re-run of the tests I got these 3 failing cases:
RHEL 5.2 x86_64
RHEL 4.7 x86_64
RHEL 4.6 i386
After taking a close look at these failures, it was found that the errors are probabilistic - ie there is a small finite chance of failure which does not depend on the database/journal itself. All of the attached failure journals pass the txtest check when rerun, but if this check is run repeatedly, a failure will eventually occur.
After a difficult investigation, it was found that a bug exists in the read pipeline which is triggered by a particular case of out-of-order returns from libaio to the read page cache. The sequence that triggers the bug is as follows:
1. Page n in the read cache is consumed and must be updated with new disk data. A read is sent through libaio to get the new data.
2. All the following pages (from n+1 to the last page and from 0 to n-1) are consumed and similarly refreshed.
3. All of the libaio read operations for the pages following page n (from n+1 to the last page and from 0 to n-1) return, but that of page n is still outstanding - even though it was the first to be sent.
When these conditions are met, then:
a) consumption of data from the page cache stops at the end of page n-1 because page n is still waiting for libaio to return with its data, and a RHM_IORES_PAGE_AIOWAIT is returned for all read attempts.
b) when the new data for page n-1 returns, the journal read logic incorrectly believes that the journal is fully read and returns an RHM_IORES_EMPTY code. This results in the recovery process stopping before all the available messages are recovered.
This is a probabilistic read error, and is triggered purely by the libaio order of return. There is no fault in the journal files. This bug would affect all versions of RHEL, and any recovery. The fact that it was uncovered by txtest (a transactional integrity test) is coincidental, and has nothing to do with transactions per se. In addition, the probability of failure has much to do with the disk subsystem and drivers in use. Fast disks / controllers with efficient drivers will present a lower probability of failure than slower and inefficient systems.
1.0 branch: r.2496
RHTS automated tests (MRG/qpid_txtest_fails_bz458053 and
MRG/qpid_test_transaction_integrity) now prove that issue is gone. (->VERIFIED)
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.