Bug 479714

Summary:	qpidd+store message loss (detected by txtest --check yes after qpidd recovery)
Product:	Red Hat Enterprise MRG	Reporter:	Frantisek Reznicek <freznice>
Component:	qpid-cpp	Assignee:	Kim van der Riet <kim.vdriet>
Status:	CLOSED ERRATA	QA Contact:	Frantisek Reznicek <freznice>
Severity:	high	Docs Contact:
Priority:	high
Version:	1.1	CC:	esammons
Target Milestone:	1.1
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-02-04 15:36:43 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2009-01-12 16:45:58 UTC

Description of problem:

I triggered two occurrences of txtest message loss during transaction integrity tests in about 20 runs and also I didn't see it for quite a long period.
This might indicate that an bug introduced in between revisions rhm-0.4.2970 ... rhm-0.4.3030.

 occurrence 1]
qpid_txtest_fails_bz458053 test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes)

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5662704
http://rhts.redhat.com/testlogs/40386/137892/1158776/TESTOUT.log.gz line 2719
http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_txtest.transcript.log.gz line 13709
http://rhts.redhat.com/testlogs/40386/137892/1158776/qpidd_journal_a0086.tar.bz2

 occurrence 2]
qpid_test_transaction_integrity test RHTS, RHEL4.7 i386 (message loss after qpidd recovery and txtest --check yes)

http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=5644362
http://rhts.redhat.com/testlogs/40384/137888/1158651/TESTOUT.log.gz line 2557
http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_txtest.transcript.log.gz line 11178
http://rhts.redhat.com/testlogs/40384/137888/1158651/qpidd_journal_b0007.tar.bz2


Version-Release number of selected component (if applicable):
qpidd-0.4.728142-1.el4
rhm-0.4.3030-1.el4

How reproducible:
rarely (<5%)

Steps to Reproduce:
1. run RHTS qpid_test_transaction_integrity or qpid_txtest_fails_bz458053 test
   on multiple machines and cpu architectures
2. go to point 1. until you detect it
  
Actual results:
'txtest --check yes' after broker recovers identify one missing message in the message journal

Expected results:
No missing message is expected!

Additional info:
Observed on RHEL4.7 i386 only.

Comment 1 Kim van der Riet 2009-01-12 16:56:41 UTC

Analysis of the journal shows that this is not RHEL4-specific, rather it is a specific conditions in the journal itself that causes the problem. In order to show this error:

1. the first message to be recovered must cross a file boundary (ie part in one file, the remainder in the next);
2. all previous enqueues must be dequeued, making this message the first to be read on recovery.

The read manager will jump to the first file containing an enqueued record when starting the read process (to increase the speed of recovery - there is no need to go through lots of files which we know have no enqueued records on them). However, when a record crosses a file boundary, the logic incorrectly placed the count for that record against the second, not first file. This has no effect on recovery if there are prior records to be read in the first file, but if this record is the first to be read, the read manager will start in the file containing the tail-end of the first record. This causes the first record not to be read.

Comment 2 Kim van der Riet 2009-01-12 17:05:48 UTC

Fixed in r.3038.

QA: This will be difficult to reproduce. The specific conditions outlined above do not occur frequently (although using a small journal and large messages will increase the probability somewhat). If the diff for r.3038 is applied to r.3030 (the version of the journal that the test was done with), then recovering the error journal and running the check portion of txtest will work ok.

Unfortunately, this journal cannot be recovered against a trunk store as the hash algorithm for distributing the queues across the various directories in rhm/jrnl (ie across dirs 0000 - 00c0) has changed to give a more even distribution.

Comment 4 Frantisek Reznicek 2009-01-15 09:40:45 UTC

Bunch of long term (total test time of about 4 days) journal testing on various supported platforms (RHEL 5.2 / 4.7, i386 / x86_64) proves that message loss problem has been fixed.

Any occurence in future of message loss will result to ASSIGN.

->VERIFIED

Comment 6 errata-xmlrpc 2009-02-04 15:36:43 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0035.html