Bug 458053 - txtest failures when broker killed during transfer phase
Summary: txtest failures when broker killed during transfer phase
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.0
Hardware: All
OS: Linux
urgent
high
Target Milestone: 1.0.1
: ---
Assignee: Kim van der Riet
QA Contact: Kim van der Riet
URL:
Whiteboard:
Depends On:
Blocks: 460116
TreeView+ depends on / blocked
 
Reported: 2008-08-06 11:55 UTC by Gordon Sim
Modified: 2011-08-12 16:21 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-10-06 19:08:26 UTC
Target Upstream Version:


Attachments (Terms of Use)
Failing MRG/qpid_test_transaction_integrity test log (also on http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4116052) (1.44 MB, application/x-bzip2)
2008-09-04 09:05 UTC, Frantisek Reznicek
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2008:0640 0 normal SHIPPED_LIVE Red Hat Enterprise MRG bug fix and enhancement update 2008-10-06 19:08:07 UTC

Description Gordon Sim 2008-08-06 11:55:33 UTC
Using store r2258 from release branch (and r680695 from qpid.0-10). 

1. Start broker
2. run txtest (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4)
3. after sometime 'kill -9 <broker-pid>
4. remove lock and restart broker
5. run check phase (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4 --check yes --init no --transfer no)

Expect all messages to be present. Sometimes messages are reported as missing, sometimes the following error occurs instead:

Queue tx-test-2: async_dequeue() failed: jexception 0x0b01 txn_map::get_tdata_list() threw JERR_MAP_NOTFOUND: Key not found in map. (xid=) (BdbMessageStore.cpp:1246)

Comment 1 Kim van der Riet 2008-08-12 15:35:15 UTC
Several problems were responsible for this error:

1. The 1PC transactions were not being handled atomically across multiple queues. This was fixed by keeping 1PC txns in the prepared list and modifying the txn recovery logic to handle 1PC txns.

2. Message recovery did not correctly predict the outcome of messages which needed to be rolled forward/back because of incomplete multi-queue commits/aborts. The message recovery logic was reworked to extract the information it needs to make this determination from the journal enqueue and transaction maps. Some new accessors were added to class jcntl to allow for these operations.

In addition, some bugs were found in the python journal file checker jfile_chk.py which was used to analyze the journal files. These were fixed, and a new -a flag now performs transactional analysis on the journal and reports open transactions and locked records.

Fixed in r.2279.

Comment 2 Kim van der Riet 2008-08-12 17:58:01 UTC
After some testing, there are still occasional cases of lost messages when txtest is run in test mode. Reassigning.

Comment 3 Kim van der Riet 2008-08-13 19:29:09 UTC
Additional cases for prepared but not completed transactions found; also for non-prepared transactions which were not being correctly aborted at journal level.

r.2297

Comment 5 Frantisek Reznicek 2008-08-29 07:58:15 UTC
RHTS test developed (MRG/qpid_txtest_fails_bz458053).
Test results comming soon.

Comment 6 Frantisek Reznicek 2008-08-29 14:08:16 UTC
RHTS test MRG/qpid_txtest_fails_bz458053 proved that this issue is no longer present.(See RHTS jobs 28113 and 28114 for details)

Comment 7 Frantisek Reznicek 2008-09-04 09:03:09 UTC
After few more automated tests (MRG/qpid_txtest_fails_bz458053 and MRG/qpid_test_transaction_integrity) there is still less than percent of failing cases.

MRG/qpid_test_transaction_integrity test shows it on http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4116052.

Please find test case log attached.


Moving VERIFIED to FAILS_QA.

P.S. Latest test case code can be found here:
http://cvs.devel.redhat.com/cgi-bin/cvsweb.cgi/tests/distribution/MRG_Messaging/

Comment 8 Frantisek Reznicek 2008-09-04 09:05:35 UTC
Created attachment 315728 [details]
Failing MRG/qpid_test_transaction_integrity test log (also on http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=4116052)

Comment 9 Gordon Sim 2008-09-04 09:25:57 UTC
FYI: to get the journal data tarball use e.g.: 'uudecode -o fjournal.tar.bz2 4116052-test_log--distribution-MRG_Messaging-qpid_test_transaction_integrity.log'

(uudecode is in the sharutils package)

Comment 10 Frantisek Reznicek 2008-09-17 07:32:10 UTC
It seems that attached failing journal is a good one. After the re-run of the tests I got these 3 failing cases:
  RHEL 5.2 x86_64
  mrg4.lab.bos.redhat.com:/root/20080917qpid_test_transaction_integrity_3_fails_incl_db.tar.bz2
  RHEL 4.7 x86_64
  http://rhts.redhat.com/testlogs/29291/106517/909531/4318706-test_log--distribution-MRG_Messaging-qpid_txtest_fails_bz458053.log
  RHEL 4.6 i386
  http://rhts.redhat.com/testlogs/29294/106520/909599/4324650-test_log--distribution-MRG_Messaging-qpid_test_transaction_integrity.log

Comment 11 Kim van der Riet 2008-09-18 17:46:09 UTC
After taking a close look at these failures, it was found that the errors are probabilistic - ie there is a small finite chance of failure which does not depend on the database/journal itself. All of the attached failure journals pass the txtest check when rerun, but if this check is run repeatedly, a failure will eventually occur.

After a difficult investigation, it was found that a bug exists in the read pipeline which is triggered by a particular case of out-of-order returns from libaio to the read page cache. The sequence that triggers the bug is as follows:

1. Page n in the read cache is consumed and must be updated with new disk data. A read is sent through libaio to get the new data.

2. All the following pages (from n+1 to the last page and from 0 to n-1) are consumed and similarly refreshed.

3. All of the libaio read operations for the pages following page n (from n+1 to the last page and from 0 to n-1) return, but that of page n is still outstanding - even though it was the first to be sent.

When these conditions are met, then:
a) consumption of data from the page cache stops at the end of page n-1 because page n is still waiting for libaio to return with its data, and a RHM_IORES_PAGE_AIOWAIT is returned for all read attempts.

b) when the new data for page n-1 returns, the journal read logic incorrectly believes that the journal is fully read and returns an RHM_IORES_EMPTY code. This results in the recovery process stopping before all the available messages are recovered.

This is a probabilistic read error, and is triggered purely by the libaio order of return. There is no fault in the journal files. This bug would affect all versions of RHEL, and any recovery. The fact that it was uncovered by txtest (a transactional integrity test) is coincidental, and has nothing to do with transactions per se. In addition, the probability of failure has much to do with the disk subsystem and drivers in use. Fast disks / controllers with efficient drivers will present a lower probability of failure than slower and inefficient systems.

1.0 branch: r.2496
trunk: r.2497

Comment 13 Frantisek Reznicek 2008-09-24 15:02:35 UTC
RHTS automated tests (MRG/qpid_txtest_fails_bz458053 and
MRG/qpid_test_transaction_integrity) now prove that issue is gone. (->VERIFIED)

Comment 15 errata-xmlrpc 2008-10-06 19:08:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0640.html


Note You need to log in before you can comment on or make changes to this bug.