Bug 472937

Summary: TPL recoverTplStore() failed: jexception 0x0b01 txn_map::get_tdata_list() threw JERR_MAP_NOTFOUND: Key not found in map. (xid=rhm-tid0x2aac2c9b0ee0) (MessageStoreImpl.cpp:1079)
Product: Red Hat Enterprise MRG Reporter: Gordon Sim <gsim>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED ERRATA QA Contact: Kim van der Riet <kim.vdriet>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 1.0CC: davids, freznice
Target Milestone: 1.1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-21 16:16:48 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test script I was using none

Description Gordon Sim 2008-11-25 18:03:01 UTC
Created attachment 324639 [details]
test script I was using

ERROR: test_recover (tests_0-10.dtx.DtxTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/gordon/qpid/python/tests_0-10/dtx.py", line 655, in test_recover
    xids = session.dtx_recover().in_doubt
  File "/home/gordon/qpid/python/qpid/invoker.py", line 27, in <lambda>
    method = lambda *args, **kwargs: self.invoke(resolved, args, kwargs)
  File "/home/gordon/qpid/python/qpid/session.py", line 158, in invoke
    return self.do_invoke(type, args, kwargs)
  File "/home/gordon/qpid/python/qpid/session.py", line 213, in do_invoke
    return result.get(self.timeout)
  File "/home/gordon/qpid/python/qpid/datatypes.py", line 257, in get
    raise self.exception(self._error)
SessionException: (501, u'TPL recoverTplStore() failed: jexception 0x0b01 txn_map::get_tdata_list() threw JERR_MAP_NOTFOUND:
Key not found in map. (xid=rhm-tid0x2aac2c9b0ee0) (MessageStoreImpl.cpp:1079)')

I ran a couple of txtests concurrently in a loop with the python tests and got the above error back after a couple of iterations on the python side.

Comment 1 Kim van der Riet 2008-12-09 17:59:57 UTC
Fixed in BZ 2954.

A race condition was found which the load test of this script uncovered.

QA: This error is easy to reproduce using the above script, particularly if the python test is modified to just run dtx.DtxTests.test_recover. (This can be done by editing qpid/cpp/src/tests/python_tests or setting $PYTHON_TESTS appropriately.)

Comment 2 Kim van der Riet 2008-12-09 18:02:31 UTC
er... the above should have read:

Fixed in svn r.2954

Comment 4 David Sommerseth 2009-01-12 15:11:25 UTC
Ran the test on ibm-mongoose.rhts.bos.redhat.com using these packages:

python-qpid-0.4.733051-1.el5
rhm-0.4.3036-2.el5
qpidd-0.4.732838-1.el5
qpidc-perftest-0.4.732838-1.el5

Modified the test script attached in this bz by adding 

    export PYTHON_TESTS="tests_0-10.dtx.DtxTests.test_recover"

in the beginning of the script and just changing the paths for perftest and txtest binaries.  I also checked out cpp/src/tests, python and specs from SVN to have the needed test files for running this test.  txtest and perftest binaries are from the corresponding qpidc-perftest package.


======================================================================
ERROR: test_recover (tests_0-10.dtx.DtxTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/qpid/python/tests_0-10/dtx.py", line 655, in test_recover
    xids = session.dtx_recover().in_doubt
  File "/root/qpid/python/qpid/generator.py", line 25, in <lambda>
    method = lambda self, *args, **kwargs: self.invoke(inst, args, kwargs)
  File "/root/qpid/python/qpid/session.py", line 143, in invoke
    return self.do_invoke(type, args, kwargs)
  File "/root/qpid/python/qpid/session.py", line 198, in do_invoke
    return result.get(self.timeout)
  File "/root/qpid/python/qpid/datatypes.py", line 257, in get
    raise self.exception(self._error)
SessionException: (501, u'TPL recoverTplStore() failed: jexception 0x0b01 txn_map::get_tdata_list_nolock() threw JERR_MAP_NOTFOUND: Key not found in map. (xi
d=rhm-tid0x2aaab0018940) (MessageStoreImpl.cpp:1079)')
======================================================================

The test is performed on MRG/M-1.0.1 and RC packages of MRG/M-1.1.

Comment 5 Kim van der Riet 2009-01-14 13:01:16 UTC
A further race condition was found in MessageStoreImpl::readTplStore() in which an XID read from the while loop would be removed by another thread by the time the execution reached the tmap.get_tdata_list(xid) call within the loop.

A pragmatic fix of catching and ignoring the error was chosen in this case rather than attempting the complex, error-prone and possibly also performance degrading route of locking the transaction map from other threads while the entire map is read for this operation. This call is made infrequently and is not part of the regular message handling code path.

Fixed in r.3039

QA - a long test as described above without error should prove this bug is fixed.

Comment 6 Frantisek Reznicek 2009-01-30 15:49:57 UTC
The issue has been fixed as proved on RHEL4.7/5.3 i386/x86_64 on packages
qpidd-0.4.738274-1, rhm-0.4.3075-3.
tests_0-10.dtx.DtxTests.test_recover is passed during long term test.

->VERIFIED

Comment 8 errata-xmlrpc 2009-04-21 16:16:48 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html