479714 – qpidd+store message loss (detected by txtest --check yes after qpidd recovery)

Bug 479714 - qpidd+store message loss (detected by txtest --check yes after qpidd recovery)

Summary: qpidd+store message loss (detected by txtest --check yes after qpidd recovery)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.1
Target Release:	---
Assignee:	Kim van der Riet
QA Contact:	Frantisek Reznicek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-01-12 16:45 UTC by Frantisek Reznicek
Modified:	2015-11-16 00:06 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-02-04 15:36:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:0035	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging 1.1 Release	2009-02-04 15:33:44 UTC

Description Frantisek Reznicek 2009-01-12 16:45:58 UTC

Description of problem:

I triggered two occurrences of txtest message loss during transaction integrity tests in about 20 runs and also I didn't see it for quite a long period.
This might indicate that an bug introduced in between revisions rhm-0.4.2970 ... rhm-0.4.3030.

 occurrence 1]
qpid_txtest_fails_bz458053 test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes)

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5662704
http://rhts.redhat.com/testlogs/40386/137892/1158776/TESTOUT.log.gz line 2719
http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_txtest.transcript.log.gz line 13709
http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_journal_a0086.tar.bz2

 occurrence 2]
qpid_test_transaction_integrity test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes)

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5644362
http://rhts.redhat.com/testlogs/40384/137888/1158651/TESTOUT.log.gz line 2557
http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_txtest.transcript.log.gz line 11178
http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_journal_b0007.tar.bz2


Version-Release number of selected component (if applicable):
qpidd-0.4.728142-1.el4
rhm-0.4.3030-1.el4

How reproducible:
rarely (<5%)

Steps to Reproduce:
1. run RHTS qpid_test_transaction_integrity or qpid_txtest_fails_bz458053 test
   on multiple machines and cpu architectures
2. go to point 1. until you detect it
  
Actual results:
'txtest --check yes' after broker recovers identify one missing message in the message journal

Expected results:
No missing message is expected!

Additional info:
Observed on RHEL4.7 i386 only.

Comment 1 Kim van der Riet 2009-01-12 16:56:41 UTC

Analysis of the journal shows that this is not RHEL4-specific, rather it is a specific conditions in the journal itself that causes the problem. In order to show this error:

1. the first message to be recovered must cross a file boundary (ie part in one file, the remainder in the next);
2. all previous enqueues must be dequeued, making this message the first to be read on recovery.

The read manager will jump to the first file containing an enqueued record when starting the read process (to increase the speed of recovery - there is no need to go through lots of files which we know have no enqueued records on them). However, when a record crosses a file boundary, the logic incorrectly placed the count for that record against the second, not first file. This has no effect on recovery if there are prior records to be read in the first file, but if this record is the first to be read, the read manager will start in the file containing the tail-end of the first record. This causes the first record not to be read.

Comment 2 Kim van der Riet 2009-01-12 17:05:48 UTC

Fixed in r.3038.

QA: This will be difficult to reproduce. The specific conditions outlined above do not occur frequently (although using a small journal and large messages will increase the probability somewhat). If the diff for r.3038 is applied to r.3030 (the version of the journal that the test was done with), then recovering the error journal and running the check portion of txtest will work ok.

Unfortunately, this journal cannot be recovered against a trunk store as the hash algorithm for distributing the queues across the various directories in rhm/jrnl (ie across dirs 0000 - 00c0) has changed to give a more even distribution.

Comment 4 Frantisek Reznicek 2009-01-15 09:40:45 UTC

Bunch of long term (total test time of about 4 days) journal testing on various supported platforms (RHEL 5.2 / 4.7, i386 / x86_64) proves that message loss problem has been fixed.

Any occurence in future of message loss will result to ASSIGN.

->VERIFIED

Comment 6 errata-xmlrpc 2009-02-04 15:36:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0035.html

Note You need to log in before you can comment on or make changes to this bug.