Bug 620676

Summary: Store resize operation fails with large messages (greater than store file size)
Product: Red Hat Enterprise MRG Reporter: ppecka <ppecka>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED CURRENTRELEASE QA Contact: ppecka <ppecka>
Severity: medium Docs Contact:
Priority: low    
Version: DevelopmentCC: gsim, jross
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 622889 (view as bug list) Environment:
Last Closed: 2011-07-05 14:18:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 622889    
Attachments:
Description Flags
reproducer
none
test "log" none

Description ppecka 2010-08-03 09:07:20 UTC
Created attachment 436218 [details]
reproducer

Description of problem:
Qpid broker with resized store (duplicated) with resize utility, fails to start again, with:
 
Daemon startup failed: Queue tx-test-1: recoverMessages() failed: jexception 0x0405 fcntl::add_rd_subm_cnt_dblks() threw JERR_FCNTL_RDOFFSOVFL: Attempted increase read offset past write offset. (pfid=1 lfid=1 rd_subm_cnt_dblks=4 incr=2560 wr_subm_cnt_dblks=4) (MessageStoreImpl.cpp:1018)
2010-08-02 04:42:16 critical Unexpected error: Removing stale lock file /root/.qpidd/qpidd.48667.pid


Version-Release number of selected component (if applicable):
python-qpid-0.7.946106-4.el5
qpid-cpp-client-0.7.946106-8.el5
qpid-cpp-client-devel-0.7.946106-8.el5
qpid-cpp-client-devel-docs-0.7.946106-8.el5
qpid-cpp-client-ssl-0.7.946106-8.el5
qpid-cpp-mrg-debuginfo-0.7.946106-8.el5
qpid-cpp-server-0.7.946106-8.el5
qpid-cpp-server-cluster-0.7.946106-8.el5
qpid-cpp-server-devel-0.7.946106-8.el5
qpid-cpp-server-ssl-0.7.946106-8.el5
qpid-cpp-server-store-0.7.946106-8.el5
qpid-cpp-server-xml-0.7.946106-8.el5
qpid-java-client-0.7.946106-5.el5
qpid-java-common-0.7.946106-5.el5
qpid-tests-0.7.946106-1.el5
qpid-tools-0.7.946106-6.el5


How reproducible:
often

Steps to Reproduce:
see attached shell script 
  
Actual results:
Broker fails to start

Expected results:
Broker should start

Additional info:

Comment 1 ppecka 2010-08-03 09:08:10 UTC
Created attachment 436220 [details]
test "log"

Comment 2 Kim van der Riet 2010-08-05 13:32:14 UTC
There are several issues uncovered from this test.

A. Bugs - two are present here, one uncovered by the other:

  1. Resize failed when the message size > file size. It failed so as to
     leave the store it was resizing partly written, and the only record
     that had been written was being truncated.
  2. The store, on recovery of this truncated record threw a
     JERR_FCNTL_RDOFFSOVFL error. This is also a bug, and should not occur.
     The store needs to be stable when recovering any partly written journal.
     I'll open a separate bug for this when I isolate a reproducer.

B. Test script issues: Looking at the test script, there are several issues that need comment:

  1. Current settings have the size of the records being stored > file size.
     Given that only 4 files are in use and 10 records are being stored, this
     represents a highly mismatched sizing scenario, and only one record will
     fit into the journal before the enqueue threshold is hit while writing
     the second record.
  2. The current script does not change the size of the store, but resizes it
     to the same size as it was. Clearly this scenario should also be tested,
     but resizing to smaller or bigger sizes is desired here.
  3. The return result from the resize operation is ignored. If resize returns
     a non-zero result, then an error/exception occurred during resizing and
     the test should skip to the end of the test cycle and move on to the next
     test.
  4. The script kills the broker immediately after qpid-txtest ends. This does
     not leave any time for the broker to flush before being killed, leaving
     the end of the records that were written truncated. Given that only one
     record fits (see point 1 above), this leaves the only record truncated,
     and thus a store with no recoverable records.
  5. If a test fails, save the store dir in question for analysis. The store
     dir should have a bak direcory in it containing a copy of the original
     store before it was resized.

C. Test program: qpid-txtest:

  1. The program is now called qpid-txtest.
  2. Note that qpid-txtest does not handle closed connections during
     initialization phase of its test (this is what happens when the enqueue
     threshold error is raised). ie while loading the initial messages.
     If this happens, then txtest will report those messages that did not
     get loaded as missing when the test is rerun (unless you adjust the
     parameters to reflect only those messages that got into the store -
     which is hard to do).
  3. Given the above limitation, it might be better to run qpid-txtest in
     three parts: first using only "--init yes" to initialize the test which
     must succeed for the rest of the test to work. Then in a loop, run with
     "--transfer yes" only, interrupting after a random time with a broker
     kill. Then resize the store. Lastly, after recovery, run qpid-txtest
     with "--check yes" to make sure no messages were lost.
  4. Knowing the size and number of messages, one can get some idea of how
     large the store needs to be to contain the messages. Remember, however,
     that transactions take up extra space in the store for commit records
     and filler records that result from flushing.     
  5. Resizing both up and down in size should be checked.
  6. Attempting to resize to a size too small to contain the messages should
     also be checked, making sure that it fails predictably.

Comment 3 Kim van der Riet 2010-08-10 17:35:17 UTC
Fixed in r.4196

Also included in this fix are new resize utility tests which run as part of make check.

QE - the comments above notwithstanding, it should be possible to rerun this scenario without a resize failure, and the broker should be able to recover the resulting journal.

Comment 5 ppecka 2011-06-30 15:22:11 UTC
VERIFIED on rhel 5.7 / 6.2 both i686 / x86_64 archs

package set:
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-10.el6.x86_64
qpid-cpp-client-0.10-6.el6.x86_64
qpid-cpp-client-devel-0.10-6.el6.x86_64
qpid-cpp-client-devel-docs-0.10-6.el6.noarch
qpid-cpp-client-rdma-0.10-6.el6.x86_64
qpid-cpp-client-ssl-0.10-6.el6.x86_64
qpid-cpp-server-0.10-6.el6.x86_64
qpid-cpp-server-cluster-0.10-6.el6.x86_64
qpid-cpp-server-devel-0.10-6.el6.x86_64
qpid-cpp-server-rdma-0.10-6.el6.x86_64
qpid-cpp-server-ssl-0.10-6.el6.x86_64
qpid-cpp-server-store-0.10-6.el6.x86_64
qpid-cpp-server-xml-0.10-6.el6.x86_64
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-10.el6.x86_64
qpid-qmf-devel-0.10-10.el6.x86_64
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-6.el6.x86_64

--> VERIFIED
ruby-qpid-qmf-0.10-10.el6.x86_64