Description of problem: I triggered two occurrences of txtest message loss during transaction integrity tests in about 20 runs and also I didn't see it for quite a long period. This might indicate that an bug introduced in between revisions rhm-0.4.2970 ... rhm-0.4.3030. occurrence 1] qpid_txtest_fails_bz458053 test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes) http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5662704 http://rhts.redhat.com/testlogs/40386/137892/1158776/TESTOUT.log.gz line 2719 http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_txtest.transcript.log.gz line 13709 http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_journal_a0086.tar.bz2 occurrence 2] qpid_test_transaction_integrity test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes) http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5644362 http://rhts.redhat.com/testlogs/40384/137888/1158651/TESTOUT.log.gz line 2557 http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_txtest.transcript.log.gz line 11178 http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_journal_b0007.tar.bz2 Version-Release number of selected component (if applicable): qpidd-0.4.728142-1.el4 rhm-0.4.3030-1.el4 How reproducible: rarely (<5%) Steps to Reproduce: 1. run RHTS qpid_test_transaction_integrity or qpid_txtest_fails_bz458053 test on multiple machines and cpu architectures 2. go to point 1. until you detect it Actual results: 'txtest --check yes' after broker recovers identify one missing message in the message journal Expected results: No missing message is expected! Additional info: Observed on RHEL4.7 i386 only.
Analysis of the journal shows that this is not RHEL4-specific, rather it is a specific conditions in the journal itself that causes the problem. In order to show this error: 1. the first message to be recovered must cross a file boundary (ie part in one file, the remainder in the next); 2. all previous enqueues must be dequeued, making this message the first to be read on recovery. The read manager will jump to the first file containing an enqueued record when starting the read process (to increase the speed of recovery - there is no need to go through lots of files which we know have no enqueued records on them). However, when a record crosses a file boundary, the logic incorrectly placed the count for that record against the second, not first file. This has no effect on recovery if there are prior records to be read in the first file, but if this record is the first to be read, the read manager will start in the file containing the tail-end of the first record. This causes the first record not to be read.
Fixed in r.3038. QA: This will be difficult to reproduce. The specific conditions outlined above do not occur frequently (although using a small journal and large messages will increase the probability somewhat). If the diff for r.3038 is applied to r.3030 (the version of the journal that the test was done with), then recovering the error journal and running the check portion of txtest will work ok. Unfortunately, this journal cannot be recovered against a trunk store as the hash algorithm for distributing the queues across the various directories in rhm/jrnl (ie across dirs 0000 - 00c0) has changed to give a more even distribution.
Bunch of long term (total test time of about 4 days) journal testing on various supported platforms (RHEL 5.2 / 4.7, i386 / x86_64) proves that message loss problem has been fixed. Any occurence in future of message loss will result to ASSIGN. ->VERIFIED
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0035.html