Bug 474371 - qpidd+store exits on journal recovery because of 'Timeout waiting for AIO in MessageStoreImpl::recoverMessages()'
Summary: qpidd+store exits on journal recovery because of 'Timeout waiting for AIO in ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1
Hardware: All
OS: Linux
high
high
Target Milestone: 1.1
: ---
Assignee: Kim van der Riet
QA Contact: Frantisek Reznicek
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-12-03 15:19 UTC by Frantisek Reznicek
Modified: 2015-11-16 00:06 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-02-04 15:36:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2009:0035 0 normal SHIPPED_LIVE Red Hat Enterprise MRG Messaging 1.1 Release 2009-02-04 15:33:44 UTC

Description Frantisek Reznicek 2008-12-03 15:19:46 UTC
qpidd with store plugin loaded fails i.e. exits during the journal recovery
with message (exit code=1):

 'critical Broker start-up failed: Timeout waiting for AIO in MessageStoreImpl::recoverMessages() (MessageStoreImpl.cpp:930)'

Data stored in:
mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_105fails_rhel5.2_x86_64_081203.tar.bz2

  qpid_test_transaction_integrity/qpid_test_transaction_integrity.log:2700
  qpid_test_transaction_integrity/qpidd_txtest.transcript.log:13333
  qpid_test_transaction_integrity/qpidd_journal_a0088.tar.bz2


Version-Release number of selected component (if applicable):
qpidd rev. 722122 / rhm rev. 2898

How reproducible:
sporadic

Steps to Reproduce:
1. run qpid_test_transaction_integrity
2. check qpidd_txtest.transcript.log for message
3. go-to 1. if not found otherwise analyze stored journals


  
Actual results:
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/client/rdmaconnector.so
2008-dec-02 08:36:32 warning SSL connector not enabled, you must set QPID_SSL_CERT_DB to enable it.
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/client/sslconnector.so
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2008-dec-02 08:36:32 info Loaded Module: /usr/lib64/qpid/daemon/rdma.so
2008-dec-02 08:36:32 info Management enabled
2008-dec-02 08:36:32 notice Journal "TplStore": Created
2008-dec-02 08:36:32 notice Store module initialized; dir=/root/qpid_test_transaction_integrity/rhts_qpidd/20081202_060832/broker.3213
2008-dec-02 08:36:32 info > Default files per journal: 8
2008-dec-02 08:36:32 info > Default journal file size: 24 (wpgs)
2008-dec-02 08:36:32 info > Default write cache page size: 32 (Kib)
2008-dec-02 08:36:32 info > Default number of write cache pages: 32
2008-dec-02 08:36:32 info > TPL files per journal: 8
2008-dec-02 08:36:32 info > TPL journal file size: 24 (wpgs)
2008-dec-02 08:36:32 info > TPL write cache page size: 4 (Kib)
2008-dec-02 08:36:32 info > TPL number of write cache pages: 64
2008-dec-02 08:36:33 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-1": Created
2008-dec-02 08:36:33 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-1": 10661 messages recovered; 0 messages
in-doubt.
2008-dec-02 08:36:33 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-2": Created
2008-dec-02 08:36:33 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-2": 612 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:33 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-3": Created
2008-dec-02 08:36:33 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-3": 204 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:33 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-4": Created
2008-dec-02 08:36:34 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-4": 102 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:34 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-5": Created
2008-dec-02 08:36:34 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-5": 204 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:34 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-6": Created
2008-dec-02 08:36:34 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-6": 306 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:34 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-7": Created
2008-dec-02 08:36:34 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-7": 408 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:34 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-8": Created
2008-dec-02 08:36:34 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-8": 204 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:34 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-9": Created
2008-dec-02 08:36:35 info Recovered queue "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-9": 306 messages recovered; 0 messages in
-doubt.
2008-dec-02 08:36:35 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10": Created
2008-dec-02 08:36:37 critical Broker start-up failed: Timeout waiting for AIO in MessageStoreImpl::recoverMessages() (MessageStoreImpl.cpp:930)


Expected results:
Successful journal recovery


Additional info:

  Data stored in:
  mrg3.lab.bos.redhat.com:
  /root/qpid_test_transaction_integrity_105fails_rhel5.2_x86_64_081203.tar.bz2

  another failure occurrences are stored in 

  mrg3.lab.bos.redhat.com:
  /root/qpid_test_transaction_integrity_56fails_rhel4.7_x86_64_081203.tar.bz2

Comment 1 Frantisek Reznicek 2008-12-04 09:07:42 UTC
One more observation (RHEL5.2, x86_64) including trace qpidd logging stored in
mrg3.lab.bos.redhat.com:
/root/qpid_test_transaction_integrity_3fails_rhel5.2_x86_64_081204.tar.bz2
see
  qpid_test_transaction_integrity/qpid_test_transaction_integrity.log
  qpid_test_transaction_integrity/qpidd_txtest.transcript.log.a370
  qpid_test_transaction_integrity/qpidd_journal_a0370.tar.bz2

Comment 2 Kim van der Riet 2008-12-05 20:50:23 UTC
Fixed in r.2939

This error was highlighted indirectly by the fix for Bug 472215. The rounding down of the read size to the nearest sblk boundary prevented the recover from reading the first dblk of a new page which was the last dblk containing data. However the deeper issue is that the initial recovery did not correct the sblk offset as it should by writing additional filler records until the dblk boundary was reached. This has been added, and solves this issue.

QA - test as follows:

1. By extracting the journal from queue 1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10 and recovering from it - the recovery should time out without the fix and be normal with the fix in place. In addition, with the fix and running with --log-enable info+, you should see the following messages:

2008-dec-05 15:48:40 warning Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10": Bad record alignment found at fid=0x2 offs=0x158280 (likely journal overwrite boundary); 3 filler record(s) required.
2008-dec-05 15:48:40 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10": Recover phase write: Wrote filler record at offs=0x158280
2008-dec-05 15:48:40 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10": Recover phase write: Wrote filler record at offs=0x158300
2008-dec-05 15:48:40 notice Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10": Recover phase write: Wrote filler record at offs=0x158380
2008-dec-05 15:48:40 info Journal "1e2214bf2e87d528f104c6fea6b0dc5401a6d6df987dff5a63b1cf3338bfe3a5-10": Bad record alignment fixed.


2. Check for regressions and make sure that a) no further occurrences of this bug; and b) nothing else was broken.

Comment 4 Frantisek Reznicek 2008-12-15 10:47:50 UTC
Manually and automatically (via qpid_test_transaction_integrity test) validated that issue has been fixed on RHEL5.2 / 4.7 i386 / x86_64 on packages:
qpidd-0.4.725652-2.el5, rhm-0.4.2970-1.el5

->VERIFIED

Comment 6 errata-xmlrpc 2009-02-04 15:36:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0035.html


Note You need to log in before you can comment on or make changes to this bug.