+++ This bug was initially created as a clone of Bug #620676 +++ Created an attachment (id=436218) reproducer Description of problem: Qpid broker with resized store (duplicated) with resize utility, fails to start again, with: Daemon startup failed: Queue tx-test-1: recoverMessages() failed: jexception 0x0405 fcntl::add_rd_subm_cnt_dblks() threw JERR_FCNTL_RDOFFSOVFL: Attempted increase read offset past write offset. (pfid=1 lfid=1 rd_subm_cnt_dblks=4 incr=2560 wr_subm_cnt_dblks=4) (MessageStoreImpl.cpp:1018) 2010-08-02 04:42:16 critical Unexpected error: Removing stale lock file /root/.qpidd/qpidd.48667.pid Version-Release number of selected component (if applicable): python-qpid-0.7.946106-4.el5 qpid-cpp-client-0.7.946106-8.el5 qpid-cpp-client-devel-0.7.946106-8.el5 qpid-cpp-client-devel-docs-0.7.946106-8.el5 qpid-cpp-client-ssl-0.7.946106-8.el5 qpid-cpp-mrg-debuginfo-0.7.946106-8.el5 qpid-cpp-server-0.7.946106-8.el5 qpid-cpp-server-cluster-0.7.946106-8.el5 qpid-cpp-server-devel-0.7.946106-8.el5 qpid-cpp-server-ssl-0.7.946106-8.el5 qpid-cpp-server-store-0.7.946106-8.el5 qpid-cpp-server-xml-0.7.946106-8.el5 qpid-java-client-0.7.946106-5.el5 qpid-java-common-0.7.946106-5.el5 qpid-tests-0.7.946106-1.el5 qpid-tools-0.7.946106-6.el5 How reproducible: often Steps to Reproduce: see attached shell script Actual results: Broker fails to start Expected results: Broker should start Additional info: --- Additional comment from ppecka on 2010-08-03 05:08:10 EDT --- Created an attachment (id=436220) test "log" --- Additional comment from kim.vdriet on 2010-08-05 09:32:14 EDT --- There are several issues uncovered from this test. A. Bugs - two are present here, one uncovered by the other: 1. Resize failed when the message size > file size. It failed so as to leave the store it was resizing partly written, and the only record that had been written was being truncated. 2. The store, on recovery of this truncated record threw a JERR_FCNTL_RDOFFSOVFL error. This is also a bug, and should not occur. The store needs to be stable when recovering any partly written journal. I'll open a separate bug for this when I isolate a reproducer. B. Test script issues: Looking at the test script, there are several issues that need comment: 1. Current settings have the size of the records being stored > file size. Given that only 4 files are in use and 10 records are being stored, this represents a highly mismatched sizing scenario, and only one record will fit into the journal before the enqueue threshold is hit while writing the second record. 2. The current script does not change the size of the store, but resizes it to the same size as it was. Clearly this scenario should also be tested, but resizing to smaller or bigger sizes is desired here. 3. The return result from the resize operation is ignored. If resize returns a non-zero result, then an error/exception occurred during resizing and the test should skip to the end of the test cycle and move on to the next test. 4. The script kills the broker immediately after qpid-txtest ends. This does not leave any time for the broker to flush before being killed, leaving the end of the records that were written truncated. Given that only one record fits (see point 1 above), this leaves the only record truncated, and thus a store with no recoverable records. 5. If a test fails, save the store dir in question for analysis. The store dir should have a bak direcory in it containing a copy of the original store before it was resized. C. Test program: qpid-txtest: 1. The program is now called qpid-txtest. 2. Note that qpid-txtest does not handle closed connections during initialization phase of its test (this is what happens when the enqueue threshold error is raised). ie while loading the initial messages. If this happens, then txtest will report those messages that did not get loaded as missing when the test is rerun (unless you adjust the parameters to reflect only those messages that got into the store - which is hard to do). 3. Given the above limitation, it might be better to run qpid-txtest in three parts: first using only "--init yes" to initialize the test which must succeed for the rest of the test to work. Then in a loop, run with "--transfer yes" only, interrupting after a random time with a broker kill. Then resize the store. Lastly, after recovery, run qpid-txtest with "--check yes" to make sure no messages were lost. 4. Knowing the size and number of messages, one can get some idea of how large the store needs to be to contain the messages. Remember, however, that transactions take up extra space in the store for commit records and filler records that result from flushing. 5. Resizing both up and down in size should be checked. 6. Attempting to resize to a size too small to contain the messages should also be checked, making sure that it fails predictably.
This bug is to deal with the broker recovery failure (JERR_FCNTL_RDOFFSOVFL) resulting from attempting to restore the partly written journal in the scenario above.
Fixed in r.4199 This represents a corner case previously undiscovered for recovery of corrupted journals. In this case, a large record which spans more than one file was incompletely written in the second file. When the read pipeline is validated, it reads the file header from the first non-empty file (or last file written if there are no records to read), and sets the write pointer at the first available record in the file which is located at the position indicated in the fro (first record offset) field in the header. This places it after the point where the last record was corrupted. The write pointers, however, were (correctly) set at the start of the file. Since the read pointers are constrained by the write pointers, this operation caused a JERR_FCNTL_RDOFFSOVFL error. The error was fixed by placing an additional check in the file header read logic which processes the initial read pointer locations to ensure that the value set is constrained by the value of the write pointers. QE: You will need a saved part-written journal from Bug #620676 to test this. Make sure you save the complete store. Also, do not attempt to recover your copy directly, as the store will overwrite the store when it attempts to recover it. Use a fresh copy of the part-written store each time you test this. If you cannot get a store that fails correctly, contact me, I have a "working" copy that you can use.
VERIFIED on RHEL 5.6 / 6.2 - (i686/x86_64) rpm -qa | grep qpid | sort -u python-qpid-0.10-1.el6.noarch python-qpid-qmf-0.10-10.el6.x86_64 qpid-cpp-client-0.10-6.el6.x86_64 qpid-cpp-client-devel-0.10-6.el6.x86_64 qpid-cpp-client-devel-docs-0.10-6.el6.noarch qpid-cpp-client-rdma-0.10-6.el6.x86_64 qpid-cpp-client-ssl-0.10-6.el6.x86_64 qpid-cpp-server-0.10-6.el6.x86_64 qpid-cpp-server-cluster-0.10-6.el6.x86_64 qpid-cpp-server-devel-0.10-6.el6.x86_64 qpid-cpp-server-rdma-0.10-6.el6.x86_64 qpid-cpp-server-ssl-0.10-6.el6.x86_64 qpid-cpp-server-store-0.10-6.el6.x86_64 qpid-cpp-server-xml-0.10-6.el6.x86_64 qpid-java-client-0.10-6.el6.noarch qpid-java-common-0.10-6.el6.noarch qpid-java-example-0.10-6.el6.noarch qpid-java-jca-0.10-6.el6.noarch qpid-qmf-0.10-10.el6.x86_64 qpid-qmf-devel-0.10-10.el6.x86_64 qpid-tests-0.10-1.el6.noarch qpid-tools-0.10-4.el6.noarch rh-qpid-cpp-tests-0.10-6.el6.x86_64 --> VERIFIED