Bug 731770
Summary: | JERR_JINF_CVALIDFAIL after a reboot | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | gautric <gregoire> | ||||
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> | ||||
Status: | CLOSED NOTABUG | QA Contact: | MRG Quality Engineering <mrgqe-bugs> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 2.0 | CC: | jross, kim.vdriet | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-02-27 10:41:03 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
gautric
2011-08-18 15:15:39 UTC
i will provide as attachement all qpid fs database ASAP "RHM_JDAT_VERSION mismatch" implies that the store is trying to recover an old journal in which the version of the recovered store (written into the header of each journal file and into the jinf file (a sort of journal meta-data XML file) does not match the current store version. The old store is located at /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128/. Looking at the .jinf file in this dir will provide other details like when the store was created. It is also possible that this file is damaged or incomplete is some way. The value of the version field should be 0x01, and that has been the case since at least mid-2007. There is no change to this part of the code in recent times that should cause a problem. A compressed tar/zip of the files in /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128/ would be useful in analyzing what is going on; if it is determined that this store is old or incomplete, then simply deleting it will solve the problem. If this is a recent store, and this condition has arisen from the use of the current version, then further investigation would be required. Since this store is located on /var, and you have resized this partition, I can't help wondering if there is some connection... perhaps the files in this dir were damaged in some way. In particular, was the broker running while the resizing operation was in progress? One other observation (from looking at the source code): The message above is incomplete/truncated. void jinf::validate() { bool err = false; std::ostringstream oss; if (_jver != RHM_JDAT_VERSION) { oss << "File \"" << _filename << "\": "; oss << "RHM_JDAT_VERSION mismatch: " << _jver; oss << "; required=" << RHM_JDAT_VERSION << std::endl; err = true; } ... other validity checks ... if (err) throw jexception(jerrno::JERR_JINF_CVALIDFAIL, oss.str(), "jinf", "validate"); } The full message _should_ have ended with something like: "... RHM_JDAT_VERSION mismatch: 0; required=1" Created attachment 518985 [details]
/var/lib/qpid database snapshot
The file /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128/JournalData.jinf is empty (0 bytes). There are several scenarios that might have caused this, including: 1. Insufficient disk space on /var when this journal was being created; 2. File loss during resizing of the partition (unlikely unless the broker was running and formatting journals while the resizing was in progress); 3. Broker was interrupted (killed) while journal formatting was in progress. It would be helpful to inspect the broker logs from the period when this journal was being created to see if any error messages can shed light on a possible problem. To resolve the problem you can: 1. Delete all queues by deleting /var/lib/qpidd/rhm. You will need to recreate the queues. 2. Delete just this queue by deleting /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128 and then recreating it. 3. With a little trouble, a valid .jinf file could be created by hand which would allow recovery to proceed. I have not examined any of the other numerous queues present, so there is a chance that other queues may also have issues, particularly if the partition ran out of space at some point during queue creation. I need to look at the error reporting for this condition, it seems that a 0-length .jinf file should be intercepted prior to validation. Also how it is possible that the error message is being truncated in this condition. I forgot to talk about the insufficient disk space before the resizing. we could close this issue, but create FAQ item for this issue. I agree, lets close it. However, I do need to improve the error messages for this condition, as noted previously. It would also help to shed light on the real issue if this condition should arise again. I'll create a separate BZ for this. New BZ error message issue: Bug 732004 I just created a FAQ section at https://cwiki.apache.org/confluence/display/qpid/Starting+a+cluster |