Bug 620676
Summary: | Store resize operation fails with large messages (greater than store file size) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | ppecka <ppecka> | ||||||
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> | ||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | ppecka <ppecka> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | low | ||||||||
Version: | Development | CC: | gsim, jross | ||||||
Target Milestone: | 1.3 | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 622889 (view as bug list) | Environment: | |||||||
Last Closed: | 2011-07-05 14:18:32 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 622889 | ||||||||
Attachments: |
|
Created attachment 436220 [details]
test "log"
There are several issues uncovered from this test. A. Bugs - two are present here, one uncovered by the other: 1. Resize failed when the message size > file size. It failed so as to leave the store it was resizing partly written, and the only record that had been written was being truncated. 2. The store, on recovery of this truncated record threw a JERR_FCNTL_RDOFFSOVFL error. This is also a bug, and should not occur. The store needs to be stable when recovering any partly written journal. I'll open a separate bug for this when I isolate a reproducer. B. Test script issues: Looking at the test script, there are several issues that need comment: 1. Current settings have the size of the records being stored > file size. Given that only 4 files are in use and 10 records are being stored, this represents a highly mismatched sizing scenario, and only one record will fit into the journal before the enqueue threshold is hit while writing the second record. 2. The current script does not change the size of the store, but resizes it to the same size as it was. Clearly this scenario should also be tested, but resizing to smaller or bigger sizes is desired here. 3. The return result from the resize operation is ignored. If resize returns a non-zero result, then an error/exception occurred during resizing and the test should skip to the end of the test cycle and move on to the next test. 4. The script kills the broker immediately after qpid-txtest ends. This does not leave any time for the broker to flush before being killed, leaving the end of the records that were written truncated. Given that only one record fits (see point 1 above), this leaves the only record truncated, and thus a store with no recoverable records. 5. If a test fails, save the store dir in question for analysis. The store dir should have a bak direcory in it containing a copy of the original store before it was resized. C. Test program: qpid-txtest: 1. The program is now called qpid-txtest. 2. Note that qpid-txtest does not handle closed connections during initialization phase of its test (this is what happens when the enqueue threshold error is raised). ie while loading the initial messages. If this happens, then txtest will report those messages that did not get loaded as missing when the test is rerun (unless you adjust the parameters to reflect only those messages that got into the store - which is hard to do). 3. Given the above limitation, it might be better to run qpid-txtest in three parts: first using only "--init yes" to initialize the test which must succeed for the rest of the test to work. Then in a loop, run with "--transfer yes" only, interrupting after a random time with a broker kill. Then resize the store. Lastly, after recovery, run qpid-txtest with "--check yes" to make sure no messages were lost. 4. Knowing the size and number of messages, one can get some idea of how large the store needs to be to contain the messages. Remember, however, that transactions take up extra space in the store for commit records and filler records that result from flushing. 5. Resizing both up and down in size should be checked. 6. Attempting to resize to a size too small to contain the messages should also be checked, making sure that it fails predictably. Fixed in r.4196 Also included in this fix are new resize utility tests which run as part of make check. QE - the comments above notwithstanding, it should be possible to rerun this scenario without a resize failure, and the broker should be able to recover the resulting journal. VERIFIED on rhel 5.7 / 6.2 both i686 / x86_64 archs package set: python-qpid-0.10-1.el6.noarch python-qpid-qmf-0.10-10.el6.x86_64 qpid-cpp-client-0.10-6.el6.x86_64 qpid-cpp-client-devel-0.10-6.el6.x86_64 qpid-cpp-client-devel-docs-0.10-6.el6.noarch qpid-cpp-client-rdma-0.10-6.el6.x86_64 qpid-cpp-client-ssl-0.10-6.el6.x86_64 qpid-cpp-server-0.10-6.el6.x86_64 qpid-cpp-server-cluster-0.10-6.el6.x86_64 qpid-cpp-server-devel-0.10-6.el6.x86_64 qpid-cpp-server-rdma-0.10-6.el6.x86_64 qpid-cpp-server-ssl-0.10-6.el6.x86_64 qpid-cpp-server-store-0.10-6.el6.x86_64 qpid-cpp-server-xml-0.10-6.el6.x86_64 qpid-java-client-0.10-6.el6.noarch qpid-java-common-0.10-6.el6.noarch qpid-java-example-0.10-6.el6.noarch qpid-java-jca-0.10-6.el6.noarch qpid-qmf-0.10-10.el6.x86_64 qpid-qmf-devel-0.10-10.el6.x86_64 qpid-tests-0.10-1.el6.noarch qpid-tools-0.10-4.el6.noarch rh-qpid-cpp-tests-0.10-6.el6.x86_64 --> VERIFIED ruby-qpid-qmf-0.10-10.el6.x86_64 |
Created attachment 436218 [details] reproducer Description of problem: Qpid broker with resized store (duplicated) with resize utility, fails to start again, with: Daemon startup failed: Queue tx-test-1: recoverMessages() failed: jexception 0x0405 fcntl::add_rd_subm_cnt_dblks() threw JERR_FCNTL_RDOFFSOVFL: Attempted increase read offset past write offset. (pfid=1 lfid=1 rd_subm_cnt_dblks=4 incr=2560 wr_subm_cnt_dblks=4) (MessageStoreImpl.cpp:1018) 2010-08-02 04:42:16 critical Unexpected error: Removing stale lock file /root/.qpidd/qpidd.48667.pid Version-Release number of selected component (if applicable): python-qpid-0.7.946106-4.el5 qpid-cpp-client-0.7.946106-8.el5 qpid-cpp-client-devel-0.7.946106-8.el5 qpid-cpp-client-devel-docs-0.7.946106-8.el5 qpid-cpp-client-ssl-0.7.946106-8.el5 qpid-cpp-mrg-debuginfo-0.7.946106-8.el5 qpid-cpp-server-0.7.946106-8.el5 qpid-cpp-server-cluster-0.7.946106-8.el5 qpid-cpp-server-devel-0.7.946106-8.el5 qpid-cpp-server-ssl-0.7.946106-8.el5 qpid-cpp-server-store-0.7.946106-8.el5 qpid-cpp-server-xml-0.7.946106-8.el5 qpid-java-client-0.7.946106-5.el5 qpid-java-common-0.7.946106-5.el5 qpid-tests-0.7.946106-1.el5 qpid-tools-0.7.946106-6.el5 How reproducible: often Steps to Reproduce: see attached shell script Actual results: Broker fails to start Expected results: Broker should start Additional info: