731770 – JERR_JINF_CVALIDFAIL after a reboot

Bug 731770 - JERR_JINF_CVALIDFAIL after a reboot

Summary: JERR_JINF_CVALIDFAIL after a reboot

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Kim van der Riet
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2011-08-18 15:15 UTC by gautric
Modified:	2013-02-27 10:42 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-27 10:41:03 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
/var/lib/qpid database snapshot (6.09 MB, application/x-gzip) 2011-08-19 08:06 UTC, gautric	no flags	Details
View All

Description gautric 2011-08-18 15:15:39 UTC

Description of problem:

get a JERR_JINF_CVALIDFAIL error after a reboot


Version-Release number of selected component (if applicable):
2.0 

How reproducible:

no yet tested

Steps to Reproduce:
1.
2.
3.
  
Actual results:


***********************************************
[root@tet-l ~]# service qpidd start
Starting Qpid AMQP daemon: Daemon startup failed: Queue pmu_replyTo128: recoverQueues() failed: jexception 0x0c00 jinf::validate() threw JERR_JINF_CVALIDFAIL: Journal compatibility validation failure. (File "/var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128//JournalData.jinf": RHM_JDAT_VERSION mismatch:  (MessageStoreImpl.cpp:820)
***********************************************

Expected results:
qpidd process run


Additional info:

upgraded the nofile limit for qpidd user
to 40960 hard/soft

we also created 451 queues

we did a lvm resizing of /var partition up to 10Go

Comment 1 gautric 2011-08-18 15:16:44 UTC

i will provide as attachement all qpid fs database ASAP

Comment 2 Kim van der Riet 2011-08-18 18:18:45 UTC

"RHM_JDAT_VERSION mismatch" implies that the store is trying to recover an old journal in which the version of the recovered store (written into the header of each journal file and into the jinf file (a sort of journal meta-data XML file) does not match the current store version. The old store is located at /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128/.

Looking at the .jinf file in this dir will provide other details like when the store was created. It is also possible that this file is damaged or incomplete is some way.

The value of the version field should be 0x01, and that has been the case since at least mid-2007. There is no change to this part of the code in recent times that should cause a problem.

A compressed tar/zip of the files in /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128/ would be useful in analyzing what is going on; if it is determined that this store is old or incomplete, then simply deleting it will solve the problem.

If this is a recent store, and this condition has arisen from the use of the current version, then further investigation would be required.

Since this store is located on /var, and you have resized this partition, I can't help wondering if there is some connection... perhaps the files in this dir were damaged in some way. In particular, was the broker running while the resizing operation was in progress?

Comment 3 Kim van der Riet 2011-08-18 18:39:17 UTC

One other observation (from looking at the source code): The message above is incomplete/truncated.

void jinf::validate()
{
    bool err = false;
    std::ostringstream oss;
    if (_jver != RHM_JDAT_VERSION)
    {
        oss << "File \"" << _filename << "\": ";
        oss << "RHM_JDAT_VERSION mismatch: " << _jver;
        oss << "; required=" << RHM_JDAT_VERSION << std::endl;
        err = true;
    }
    
... other validity checks ...

    if (err)
        throw jexception(jerrno::JERR_JINF_CVALIDFAIL, oss.str(), "jinf", "validate");
}

The full message _should_ have ended with something like:

"... RHM_JDAT_VERSION mismatch: 0; required=1"

Comment 4 gautric 2011-08-19 08:06:28 UTC

Created attachment 518985 [details]
/var/lib/qpid database snapshot

Comment 5 Kim van der Riet 2011-08-19 12:04:58 UTC

The file /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128/JournalData.jinf is empty (0 bytes).

There are several scenarios that might have caused this, including:

1. Insufficient disk space on /var when this journal was being created;
2. File loss during resizing of the partition (unlikely unless the broker was running and formatting journals while the resizing was in progress);
3. Broker was interrupted (killed) while journal formatting was in progress.

It would be helpful to inspect the broker logs from the period when this journal was being created to see if any error messages can shed light on a possible problem.

To resolve the problem you can:
1. Delete all queues by deleting /var/lib/qpidd/rhm. You will need to recreate the queues.
2. Delete just this queue by deleting /var/lib/qpidd/rhm/jrnl/0001/pmu_replyTo128 and then recreating it.
3. With a little trouble, a valid .jinf file could be created by hand which would allow recovery to proceed.

I have not examined any of the other numerous queues present, so there is a chance that other queues may also have issues, particularly if the partition ran out of space at some point during queue creation.

I need to look at the error reporting for this condition, it seems that a 0-length .jinf file should be intercepted prior to validation. Also how it is possible that the error message is being truncated in this condition.

Comment 6 gautric 2011-08-19 12:21:58 UTC

I forgot to talk about the insufficient disk space before the resizing.

we could close this issue, but create FAQ item for this issue.

Comment 7 Kim van der Riet 2011-08-19 12:33:36 UTC

I agree, lets close it.

However, I do need to improve the error messages for this condition, as noted previously. It would also help to shed light on the real issue if this condition should arise again. I'll create a separate BZ for this.

Comment 8 Kim van der Riet 2011-08-19 12:42:51 UTC

New BZ error message issue: Bug 732004

Comment 9 gautric 2011-08-19 13:01:17 UTC

I just created a FAQ section at
https://cwiki.apache.org/confluence/display/qpid/Starting+a+cluster

Note You need to log in before you can comment on or make changes to this bug.