Bug 460116 - txtest failures when broker killed during transfer phase (RHEL 4)
txtest failures when broker killed during transfer phase (RHEL 4)
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
All Linux
urgent Severity high
: 1.0.1
: ---
Assigned To: Kim van der Riet
Kim van der Riet
Depends On: 458053
  Show dependency treegraph
Reported: 2008-08-26 05:02 EDT by Gordon Sim
Modified: 2011-08-12 12:22 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2008-10-06 14:58:58 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Gordon Sim 2008-08-26 05:02:00 EDT
+++ This bug was initially created as a clone of Bug #458053 +++

Using store r2258 from release branch (and r680695 from qpid.0-10). 

1. Start broker
2. run txtest (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4)
3. after sometime 'kill -9 <broker-pid>
4. remove lock and restart broker
5. run check phase (e.g. txtest --messages-per-tx 100 --tx-count 100000 --total-messages 10000 --size 64 --queues 4 --check yes --init no --transfer no)

Expect all messages to be present. Sometimes messages are reported as missing, sometimes the following error occurs instead:

Queue tx-test-2: async_dequeue() failed: jexception 0x0b01 txn_map::get_tdata_list() threw JERR_MAP_NOTFOUND: Key not found in map. (xid=) (BdbMessageStore.cpp:1246)

--- Additional comment from kim.vdriet@redhat.com on 2008-08-12 11:35:15 EDT ---

Several problems were responsible for this error:

1. The 1PC transactions were not being handled atomically across multiple queues. This was fixed by keeping 1PC txns in the prepared list and modifying the txn recovery logic to handle 1PC txns.

2. Message recovery did not correctly predict the outcome of messages which needed to be rolled forward/back because of incomplete multi-queue commits/aborts. The message recovery logic was reworked to extract the information it needs to make this determination from the journal enqueue and transaction maps. Some new accessors were added to class jcntl to allow for these operations.

In addition, some bugs were found in the python journal file checker jfile_chk.py which was used to analyze the journal files. These were fixed, and a new -a flag now performs transactional analysis on the journal and reports open transactions and locked records.

Fixed in r.2279.

--- Additional comment from kim.vdriet@redhat.com on 2008-08-12 13:58:01 EDT ---

After some testing, there are still occasional cases of lost messages when txtest is run in test mode. Reassigning.

--- Additional comment from kim.vdriet@redhat.com on 2008-08-13 15:29:09 EDT ---

Additional cases for prepared but not completed transactions found; also for non-prepared transactions which were not being correctly aborted at journal level.

Comment 2 Frantisek Reznicek 2008-08-29 03:58:43 EDT
RHTS test developed (MRG/qpid_txtest_fails_bz458053).
Test results comming soon.
Comment 3 Frantisek Reznicek 2008-08-29 10:11:51 EDT
RHTS test MRG/qpid_txtest_fails_bz458053 proved that this issue is no longer
present.(See RHTS jobs 28113 and 28114 for details)
Comment 4 Frantisek Reznicek 2008-09-04 05:07:34 EDT
After few more automated tests (MRG/qpid_txtest_fails_bz458053 and
MRG/qpid_test_transaction_integrity) there is still less than percent of
failing cases.

MRG/qpid_test_transaction_integrity test shows it on

Please find test case log attached (on RHEL5 clone i.e. bz458053).


P.S. Latest test case code can be found here:
Comment 5 Kim van der Riet 2008-09-18 14:43:00 EDT
Fixed, see https://bugzilla.redhat.com/show_bug.cgi?id=458053
Comment 6 Frantisek Reznicek 2008-09-24 11:01:56 EDT
RHTS automated tests (MRG/qpid_txtest_fails_bz458053 and
MRG/qpid_test_transaction_integrity) now prove that issue is gone. (->VERIFIED)
Comment 8 errata-xmlrpc 2008-10-06 14:58:58 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.