Created attachment 436218 [details]
Description of problem:
Qpid broker with resized store (duplicated) with resize utility, fails to start again, with:
Daemon startup failed: Queue tx-test-1: recoverMessages() failed: jexception 0x0405 fcntl::add_rd_subm_cnt_dblks() threw JERR_FCNTL_RDOFFSOVFL: Attempted increase read offset past write offset. (pfid=1 lfid=1 rd_subm_cnt_dblks=4 incr=2560 wr_subm_cnt_dblks=4) (MessageStoreImpl.cpp:1018)
2010-08-02 04:42:16 critical Unexpected error: Removing stale lock file /root/.qpidd/qpidd.48667.pid
Version-Release number of selected component (if applicable):
Steps to Reproduce:
see attached shell script
Broker fails to start
Broker should start
Created attachment 436220 [details]
There are several issues uncovered from this test.
A. Bugs - two are present here, one uncovered by the other:
1. Resize failed when the message size > file size. It failed so as to
leave the store it was resizing partly written, and the only record
that had been written was being truncated.
2. The store, on recovery of this truncated record threw a
JERR_FCNTL_RDOFFSOVFL error. This is also a bug, and should not occur.
The store needs to be stable when recovering any partly written journal.
I'll open a separate bug for this when I isolate a reproducer.
B. Test script issues: Looking at the test script, there are several issues that need comment:
1. Current settings have the size of the records being stored > file size.
Given that only 4 files are in use and 10 records are being stored, this
represents a highly mismatched sizing scenario, and only one record will
fit into the journal before the enqueue threshold is hit while writing
the second record.
2. The current script does not change the size of the store, but resizes it
to the same size as it was. Clearly this scenario should also be tested,
but resizing to smaller or bigger sizes is desired here.
3. The return result from the resize operation is ignored. If resize returns
a non-zero result, then an error/exception occurred during resizing and
the test should skip to the end of the test cycle and move on to the next
4. The script kills the broker immediately after qpid-txtest ends. This does
not leave any time for the broker to flush before being killed, leaving
the end of the records that were written truncated. Given that only one
record fits (see point 1 above), this leaves the only record truncated,
and thus a store with no recoverable records.
5. If a test fails, save the store dir in question for analysis. The store
dir should have a bak direcory in it containing a copy of the original
store before it was resized.
C. Test program: qpid-txtest:
1. The program is now called qpid-txtest.
2. Note that qpid-txtest does not handle closed connections during
initialization phase of its test (this is what happens when the enqueue
threshold error is raised). ie while loading the initial messages.
If this happens, then txtest will report those messages that did not
get loaded as missing when the test is rerun (unless you adjust the
parameters to reflect only those messages that got into the store -
which is hard to do).
3. Given the above limitation, it might be better to run qpid-txtest in
three parts: first using only "--init yes" to initialize the test which
must succeed for the rest of the test to work. Then in a loop, run with
"--transfer yes" only, interrupting after a random time with a broker
kill. Then resize the store. Lastly, after recovery, run qpid-txtest
with "--check yes" to make sure no messages were lost.
4. Knowing the size and number of messages, one can get some idea of how
large the store needs to be to contain the messages. Remember, however,
that transactions take up extra space in the store for commit records
and filler records that result from flushing.
5. Resizing both up and down in size should be checked.
6. Attempting to resize to a size too small to contain the messages should
also be checked, making sure that it fails predictably.
Fixed in r.4196
Also included in this fix are new resize utility tests which run as part of make check.
QE - the comments above notwithstanding, it should be possible to rerun this scenario without a resize failure, and the broker should be able to recover the resulting journal.
VERIFIED on rhel 5.7 / 6.2 both i686 / x86_64 archs