Red Hat Bugzilla – Bug 479714
qpidd+store message loss (detected by txtest --check yes after qpidd recovery)
Last modified: 2015-11-15 19:06:32 EST
Description of problem:
I triggered two occurrences of txtest message loss during transaction integrity tests in about 20 runs and also I didn't see it for quite a long period.
This might indicate that an bug introduced in between revisions rhm-0.4.2970 ... rhm-0.4.3030.
qpid_txtest_fails_bz458053 test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes)
http://rhts.redhat.com/testlogs/40386/137892/1158776/TESTOUT.log.gz line 2719
http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_txtest.transcript.log.gz line 13709
qpid_test_transaction_integrity test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes)
http://rhts.redhat.com/testlogs/40384/137888/1158651/TESTOUT.log.gz line 2557
http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_txtest.transcript.log.gz line 11178
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. run RHTS qpid_test_transaction_integrity or qpid_txtest_fails_bz458053 test
on multiple machines and cpu architectures
2. go to point 1. until you detect it
'txtest --check yes' after broker recovers identify one missing message in the message journal
No missing message is expected!
Observed on RHEL4.7 i386 only.
Analysis of the journal shows that this is not RHEL4-specific, rather it is a specific conditions in the journal itself that causes the problem. In order to show this error:
1. the first message to be recovered must cross a file boundary (ie part in one file, the remainder in the next);
2. all previous enqueues must be dequeued, making this message the first to be read on recovery.
The read manager will jump to the first file containing an enqueued record when starting the read process (to increase the speed of recovery - there is no need to go through lots of files which we know have no enqueued records on them). However, when a record crosses a file boundary, the logic incorrectly placed the count for that record against the second, not first file. This has no effect on recovery if there are prior records to be read in the first file, but if this record is the first to be read, the read manager will start in the file containing the tail-end of the first record. This causes the first record not to be read.
Fixed in r.3038.
QA: This will be difficult to reproduce. The specific conditions outlined above do not occur frequently (although using a small journal and large messages will increase the probability somewhat). If the diff for r.3038 is applied to r.3030 (the version of the journal that the test was done with), then recovering the error journal and running the check portion of txtest will work ok.
Unfortunately, this journal cannot be recovered against a trunk store as the hash algorithm for distributing the queues across the various directories in rhm/jrnl (ie across dirs 0000 - 00c0) has changed to give a more even distribution.
Bunch of long term (total test time of about 4 days) journal testing on various supported platforms (RHEL 5.2 / 4.7, i386 / x86_64) proves that message loss problem has been fixed.
Any occurence in future of message loss will result to ASSIGN.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.