622889 – Store resize operation causes Qpid broker recovery to fail with JERR_FCNTL_RDOFFSOVFL

Bug 622889 - Store resize operation causes Qpid broker recovery to fail with JERR_FCNTL_RDOFFSOVFL

Summary: Store resize operation causes Qpid broker recovery to fail with JERR_FCNTL_RD...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Kim van der Riet
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:	620676
Blocks:
TreeView+	depends on / blocked

Reported:	2010-08-10 17:28 UTC by Kim van der Riet
Modified:	2012-12-07 17:44 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	620676
Environment:
Last Closed:	2012-12-07 17:41:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Kim van der Riet 2010-08-10 17:28:28 UTC

+++ This bug was initially created as a clone of Bug #620676 +++

Created an attachment (id=436218)
reproducer

Description of problem:
Qpid broker with resized store (duplicated) with resize utility, fails to start again, with:
 
Daemon startup failed: Queue tx-test-1: recoverMessages() failed: jexception 0x0405 fcntl::add_rd_subm_cnt_dblks() threw JERR_FCNTL_RDOFFSOVFL: Attempted increase read offset past write offset. (pfid=1 lfid=1 rd_subm_cnt_dblks=4 incr=2560 wr_subm_cnt_dblks=4) (MessageStoreImpl.cpp:1018)
2010-08-02 04:42:16 critical Unexpected error: Removing stale lock file /root/.qpidd/qpidd.48667.pid


Version-Release number of selected component (if applicable):
python-qpid-0.7.946106-4.el5
qpid-cpp-client-0.7.946106-8.el5
qpid-cpp-client-devel-0.7.946106-8.el5
qpid-cpp-client-devel-docs-0.7.946106-8.el5
qpid-cpp-client-ssl-0.7.946106-8.el5
qpid-cpp-mrg-debuginfo-0.7.946106-8.el5
qpid-cpp-server-0.7.946106-8.el5
qpid-cpp-server-cluster-0.7.946106-8.el5
qpid-cpp-server-devel-0.7.946106-8.el5
qpid-cpp-server-ssl-0.7.946106-8.el5
qpid-cpp-server-store-0.7.946106-8.el5
qpid-cpp-server-xml-0.7.946106-8.el5
qpid-java-client-0.7.946106-5.el5
qpid-java-common-0.7.946106-5.el5
qpid-tests-0.7.946106-1.el5
qpid-tools-0.7.946106-6.el5


How reproducible:
often

Steps to Reproduce:
see attached shell script 
  
Actual results:
Broker fails to start

Expected results:
Broker should start

Additional info:

--- Additional comment from ppecka on 2010-08-03 05:08:10 EDT ---

Created an attachment (id=436220)
test "log"

--- Additional comment from kim.vdriet on 2010-08-05 09:32:14 EDT ---

There are several issues uncovered from this test.

A. Bugs - two are present here, one uncovered by the other:

  1. Resize failed when the message size > file size. It failed so as to
     leave the store it was resizing partly written, and the only record
     that had been written was being truncated.
  2. The store, on recovery of this truncated record threw a
     JERR_FCNTL_RDOFFSOVFL error. This is also a bug, and should not occur.
     The store needs to be stable when recovering any partly written journal.
     I'll open a separate bug for this when I isolate a reproducer.

B. Test script issues: Looking at the test script, there are several issues that need comment:

  1. Current settings have the size of the records being stored > file size.
     Given that only 4 files are in use and 10 records are being stored, this
     represents a highly mismatched sizing scenario, and only one record will
     fit into the journal before the enqueue threshold is hit while writing
     the second record.
  2. The current script does not change the size of the store, but resizes it
     to the same size as it was. Clearly this scenario should also be tested,
     but resizing to smaller or bigger sizes is desired here.
  3. The return result from the resize operation is ignored. If resize returns
     a non-zero result, then an error/exception occurred during resizing and
     the test should skip to the end of the test cycle and move on to the next
     test.
  4. The script kills the broker immediately after qpid-txtest ends. This does
     not leave any time for the broker to flush before being killed, leaving
     the end of the records that were written truncated. Given that only one
     record fits (see point 1 above), this leaves the only record truncated,
     and thus a store with no recoverable records.
  5. If a test fails, save the store dir in question for analysis. The store
     dir should have a bak direcory in it containing a copy of the original
     store before it was resized.

C. Test program: qpid-txtest:

  1. The program is now called qpid-txtest.
  2. Note that qpid-txtest does not handle closed connections during
     initialization phase of its test (this is what happens when the enqueue
     threshold error is raised). ie while loading the initial messages.
     If this happens, then txtest will report those messages that did not
     get loaded as missing when the test is rerun (unless you adjust the
     parameters to reflect only those messages that got into the store -
     which is hard to do).
  3. Given the above limitation, it might be better to run qpid-txtest in
     three parts: first using only "--init yes" to initialize the test which
     must succeed for the rest of the test to work. Then in a loop, run with
     "--transfer yes" only, interrupting after a random time with a broker
     kill. Then resize the store. Lastly, after recovery, run qpid-txtest
     with "--check yes" to make sure no messages were lost.
  4. Knowing the size and number of messages, one can get some idea of how
     large the store needs to be to contain the messages. Remember, however,
     that transactions take up extra space in the store for commit records
     and filler records that result from flushing.     
  5. Resizing both up and down in size should be checked.
  6. Attempting to resize to a size too small to contain the messages should
     also be checked, making sure that it fails predictably.

Comment 1 Kim van der Riet 2010-08-10 17:29:55 UTC

This bug is to deal with the broker recovery failure (JERR_FCNTL_RDOFFSOVFL) resulting from attempting to restore the partly written journal in the scenario above.

Comment 2 Kim van der Riet 2010-08-11 17:19:49 UTC

Fixed in r.4199

This represents a corner case previously undiscovered for recovery of corrupted journals. In this case, a large record which spans more than one file was incompletely written in the second file.

When the read pipeline is validated, it reads the file header from the first non-empty file (or last file written if there are no records to read), and sets the write pointer at the first available record in the file which is located at the position indicated in the fro (first record offset) field in the header. This places it after the point where the last record was corrupted. The write pointers, however, were (correctly) set at the start of the file. Since the read pointers are constrained by the write pointers, this operation caused a JERR_FCNTL_RDOFFSOVFL error.

The error was fixed by placing an additional check in the file header read logic which processes the initial read pointer locations to ensure that the value set is constrained by the value of the write pointers.

QE: You will need a saved part-written journal from Bug #620676 to test this. Make sure you save the complete store. Also, do not attempt to recover your copy directly, as the store will overwrite the store when it attempts to recover it. Use a fresh copy of the part-written store each time you test this.

If you cannot get a store that fails correctly, contact me, I have a "working" copy that you can use.

Comment 3 ppecka 2011-07-12 08:42:03 UTC

VERIFIED on RHEL 5.6 / 6.2 - (i686/x86_64)

rpm -qa | grep qpid | sort -u
python-qpid-0.10-1.el6.noarch
python-qpid-qmf-0.10-10.el6.x86_64
qpid-cpp-client-0.10-6.el6.x86_64
qpid-cpp-client-devel-0.10-6.el6.x86_64
qpid-cpp-client-devel-docs-0.10-6.el6.noarch
qpid-cpp-client-rdma-0.10-6.el6.x86_64
qpid-cpp-client-ssl-0.10-6.el6.x86_64
qpid-cpp-server-0.10-6.el6.x86_64
qpid-cpp-server-cluster-0.10-6.el6.x86_64
qpid-cpp-server-devel-0.10-6.el6.x86_64
qpid-cpp-server-rdma-0.10-6.el6.x86_64
qpid-cpp-server-ssl-0.10-6.el6.x86_64
qpid-cpp-server-store-0.10-6.el6.x86_64
qpid-cpp-server-xml-0.10-6.el6.x86_64
qpid-java-client-0.10-6.el6.noarch
qpid-java-common-0.10-6.el6.noarch
qpid-java-example-0.10-6.el6.noarch
qpid-java-jca-0.10-6.el6.noarch
qpid-qmf-0.10-10.el6.x86_64
qpid-qmf-devel-0.10-10.el6.x86_64
qpid-tests-0.10-1.el6.noarch
qpid-tools-0.10-4.el6.noarch
rh-qpid-cpp-tests-0.10-6.el6.x86_64

--> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.