Bug 472215

Summary: qpidd rmgr::get_events() threw JERR__AIO: AIO error
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Kim van der Riet <kimp>
Status: CLOSED ERRATA QA Contact: Kim van der Riet <kim.vdriet>
Severity: high Docs Contact:
Priority: urgent    
Version: 1.1CC: esammons
Target Milestone: 1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-04 15:35:42 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2008-11-19 12:56:25 UTC
Description of problem:

RHTS qpid_txtest_fails_bz458053 test triggered JERR__AIO: AIO error. (AIO read operation failed: Invalid argument (-22) [pg=2 buf=0x2a97288200 rsize=0x80 offset=0x120200 fh=78]) (MessageStoreImpl.cpp:938)

RHTS run http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=36588
recipe 27623 test /distribution/MRG_Messaging/qpid_txtest_fails_bz458053 failed.

main transcript lays here:
http://rhts.redhat.com/testlogs/36588/127623/1080066/TESTOUT.log

on line 2434 can be found first failure {run(A) 70/200}.

Digging into details:
http://rhts.redhat.com/testlogs/36588/127623/1080066/qpidd_txtest.transcript.log line 8812 shows jexception 0x0103 rmgr::get_events() threw JERR__AIO: AIO error...

qpidd_txtest.transcript.log line 8812
'Queue cfb445aa5177ede791995-6: recoverMessages() failed: jexception 0x0103 rmgr::get_events() threw JERR__AIO: AIO error. (AIO read operation failed: Invalid argument (-22) [pg=2 buf=0x2a97288200 rsize=0x80 offset=0x120200 fh=78]) (MessageStoreImpl.cpp:938)'

Corresponding journal can be found in:
http://rhts.redhat.com/testlogs/36588/127623/1080066/qpidd_journal_a0070.tar.bz2




Version-Release number of selected component (if applicable):
qpidd-0.3.714058-4.el4, rhm-0.3.2804-1.el4, libaio-0.3.105-2


How reproducible:
unknown

Steps to Reproduce:
1a. Schedule RHTS test /distribution/MRG_Messaging/qpid_txtest_fails_bz458053 on an RHEL4.7 x86_64 machine (hp-xw9400-02.rhts.bos.redhat.com)

1b. Analyze qpidd journal store here:
  http://rhts.redhat.com/testlogs/36588/127623/1080066/qpidd_journal_a0070.tar.bz2

Actual results:
  run(A) 70/200 failed

Expected results:
  no failure

Additional info:
RHTS qpid_txtest_fails_bz458053 test results from
  (an RHEL4.7 x86_64 hp-xw9400-02.rhts.bos.redhat.com 2.6.9-78.ELsmp):

There is one more failure 'recoverMessages() failed: jexception 0x0900 rmgr::read() threw JERR_RMGR_UNKNOWNMAGIC' qpidd_txtest.transcript.log:11256 but I don' have journal for that.

Comment 1 Kim van der Riet 2008-11-20 13:54:13 UTC
The read pipeline tries to read complete read pages, but when there is insufficient material to read, it will read whatever is available. However, as we are using O_DIRECT, we are constrained by disk softblock (sblk) boundaries of 512 bytes. Looking at the above, rsize=0x80 clearly violates this condition. Looking at the source, this condition may arise when the read pipeline catches up with the write pointer.

This is a logic error, modify by ensuring the readsize is floored to the closest sblk boundary.

This may be a difficult condition to reproduce as it arises based on dynamic asynchronous events in the read and write pipelines.

Comment 2 Kim van der Riet 2008-11-24 19:42:44 UTC
Fixed in r.2875

QA:This was found by inspection of the code, but no known reproducer (other than blind chance and very small odds) exists. It should be sufficient to check that no regressions occur as a result of this checkin.

Comment 4 Frantisek Reznicek 2008-12-03 14:28:44 UTC
6 long term qpid_test_transaction_integrity test instances on RHEL 5.2 / 4.7, i386 / x86_64 proves that issue has been fixed.
Validated on packages:rhm-0.3.2898-1.el5 qpidd-0.3.722122-2.el5
->VERIFIED

Comment 6 errata-xmlrpc 2009-02-04 15:35:42 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0035.html