Bug 479714
Summary: | qpidd+store message loss (detected by txtest --check yes after qpidd recovery) | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> |
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 1.1 | CC: | esammons |
Target Milestone: | 1.1 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-02-04 15:36:43 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frantisek Reznicek
2009-01-12 16:45:58 UTC
Analysis of the journal shows that this is not RHEL4-specific, rather it is a specific conditions in the journal itself that causes the problem. In order to show this error: 1. the first message to be recovered must cross a file boundary (ie part in one file, the remainder in the next); 2. all previous enqueues must be dequeued, making this message the first to be read on recovery. The read manager will jump to the first file containing an enqueued record when starting the read process (to increase the speed of recovery - there is no need to go through lots of files which we know have no enqueued records on them). However, when a record crosses a file boundary, the logic incorrectly placed the count for that record against the second, not first file. This has no effect on recovery if there are prior records to be read in the first file, but if this record is the first to be read, the read manager will start in the file containing the tail-end of the first record. This causes the first record not to be read. Fixed in r.3038. QA: This will be difficult to reproduce. The specific conditions outlined above do not occur frequently (although using a small journal and large messages will increase the probability somewhat). If the diff for r.3038 is applied to r.3030 (the version of the journal that the test was done with), then recovering the error journal and running the check portion of txtest will work ok. Unfortunately, this journal cannot be recovered against a trunk store as the hash algorithm for distributing the queues across the various directories in rhm/jrnl (ie across dirs 0000 - 00c0) has changed to give a more even distribution. Bunch of long term (total test time of about 4 days) journal testing on various supported platforms (RHEL 5.2 / 4.7, i386 / x86_64) proves that message loss problem has been fixed. Any occurence in future of message loss will result to ASSIGN. ->VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0035.html |