Bug 474366

Summary: qpidd+store jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH on journal recovery
Product: Red Hat Enterprise MRG Reporter: Frantisek Reznicek <freznice>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: urgent    
Version: 1.1CC: esammons, gsim
Target Milestone: 1.1.1   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-04-21 16:16:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Frantisek Reznicek 2008-12-03 15:07:38 UTC
Description of problem:

qpidd with store plugin loaded fails i.e. exits during the journal recovery with message (exit code=1):

 'critical Broker start-up failed: Queue 474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-3: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0001 rid=0x00e15fc0 foffs=0x00000300 expected_fid=0x0000) (MessageStoreImpl.cpp:721)'


Detected on revs: 719671 / 2804, not yet detected on latest rpm revs (722122 / 2898)

Data stored in:
mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_fails_081202.tar.bz2

  qpid_test_transaction_integrity/qpid_test_transaction_integrity.log:15864
  qpid_test_transaction_integrity/qpidd_txtest.transcript.log:78592
  qpidd_journal_b0035.tar.bz2
  qpidd_journal_b0036.tar.bz2
  ...
  qpidd_journal_b0044.tar.bz2


Version-Release number of selected component (if applicable):
qpidd rev. 719671 / rhm rev. 2804

How reproducible:
sporadic

Steps to Reproduce:
1. run qpid_test_transaction_integrity
2. check qpidd_txtest.transcript.log for message
3. go-to 1. if not found otherwise analyze stored journals
  
Actual results:
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/daemon/xml.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/daemon/acl.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/daemon/rdma.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/daemon/ssl.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/client/sslconnector.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/client/rdmaconnector.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/daemon/cluster.so
2008-nov-29 04:49:11 info Loaded Module: /usr/lib64/qpid/daemon/msgstore.so
2008-nov-29 04:49:11 info Management enabled
2008-nov-29 04:49:11 notice Journal "TplStore": Created
2008-nov-29 04:49:11 notice Store module initialized; dir=/home/freznice/prjs/rhts_tests/tests/distribution/MRG_Messaging/qpid_test_transaction_integ
rity/rhts_qpidd/20081128_155024/broker.9980
2008-nov-29 04:49:11 info > Default files per journal: 8
2008-nov-29 04:49:11 info > Default journal file size: 24 (wpgs)
2008-nov-29 04:49:11 info > Default write cache page size: 32 (Kib)
2008-nov-29 04:49:11 info > Default number of write cache pages: 32
2008-nov-29 04:49:11 info > TPL files per journal: 8
2008-nov-29 04:49:11 info > TPL journal file size: 24 (wpgs)
2008-nov-29 04:49:11 info > TPL write cache page size: 4 (Kib)
2008-nov-29 04:49:11 info > TPL number of write cache pages: 64
2008-nov-29 04:49:11 notice Journal "474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-1": Created
2008-nov-29 04:49:11 info Recovered queue "474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-1": 6207 messages recovered; 0 messages in-do
ubt.
2008-nov-29 04:49:11 notice Journal "474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-2": Created
2008-nov-29 04:49:12 info Recovered queue "474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-2": 1166 messages recovered; 0 messages in-do
ubt.
2008-nov-29 04:49:12 notice Journal "474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-3": Created
2008-nov-29 04:49:12 critical Broker start-up failed: Queue 474bc746d3f20a45947b49ecf6c2a088cf0dfba8bbb5e83c5cb1b582ba7e-3: recoverQueues() failed: j
exception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0001 rid=0x00e15fc0 foffs=0x00000300 expected_fid=0x0000) (MessageStoreImpl.cpp:721)



Expected results:
Successful journal recovery

Additional info:

  Data stored in:
  mrg3.lab.bos.redhat.com:
  /root/qpid_test_transaction_integrity_fails_081202.tar.bz2

Comment 1 Frantisek Reznicek 2008-12-04 10:28:02 UTC
Still not detected on latest packages -> lowering priority (might be solved after rev. 2804).

Comment 2 Kim van der Riet 2008-12-05 15:39:25 UTC
This needs to be carefully analysed, the journal artifacts show odd write behavior, but there is no time to get to the bottom of this before the 1.1 release deadline. Only one known occurrence. Pushing to 1.1.1.

Comment 3 Frantisek Reznicek 2008-12-15 13:30:35 UTC
The bug observed again on an RHEL 5.2 x86_64 using MRG 1.1 RC packages
(qpidd-0.4.725652-2.el5, rhm-0.4.2970-1.el5)

I performed very extreme almost 2-days run of qpid_test_transaction_integrity
where this issue is observed:

qpidd+store refused to recover journal with JERR_JCNTL_OWIMISMATCH message:
  2008-dec-14 01:49:35 critical Broker start-up failed: Queue 686f3364cdce3160f2ddfde76d79-7: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0003 rid=0x1387cc58 foffs=0x00000200 expected_fid=0x0004) (MessageStoreImpl.cpp:721)

Observed at run(B) 699/700 and 700/700

Once this failure is encountered then qpidd cannot start with that journal!

Data stored in:
mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_fails_rhel5.2_x86_64_081215.tar.bz2

Packages (details in qpid_test_transaction_integrity/qpid_test_transaction_integrity.log:31-54):
qpidd-0.4.725652-2.el5, rhm-0.4.2970-1.el5

look at:
qpid_test_transaction_integrity/qpid_test_transaction_integrity.log:37138-37193
qpid_test_transaction_integrity/qpidd_txtest.transcript.log:222781,222819,222855,222891,222927

corresponding journals: 
  qpid_test_transaction_integrity/qpidd_journal_b0699.tar.bz2
  qpid_test_transaction_integrity/qpidd_journal_b0700.tar.bz2

Comment 4 Frantisek Reznicek 2009-01-05 15:09:05 UTC
Another two occurences of JERR_JCNTL_OWIMISMATCH issue triggered by journal tests. In both cases happen during transaction integrity test - part B when qpidd journal is not cleared in between runs.
Once this issue happens then qpidd will not properly start up. (unless journal is deleted)

=-= [1st occurence] =-= RHEL 4.7 x86_64

2008-dec-22 16:30:48 critical Broker start-up failed: Queue 4b44b0e0bebe65e1193c80c03ff5881ae2419-7: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0002 rid=0x01459743 foffs=0x00000300 expected_fid=0x0000) (MessageStoreImpl.cpp:717)


package mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_108fails_rhel4.7_x86_64_090105.tar.bz2
  qpid_test_transaction_integrity/qpid_test_transaction_integrity.log line 6629
  qpid_test_transaction_integrity/qpidd_txtest.transcript.log line 34217
  qpid_test_transaction_integrity/qpidd_journal_b0086.tar.bz2 ... qpidd_journal_b0095.tar.bz2

=-= [2nd occurence] =-= RHEL 5.2 i386

2008-dec-22 17:38:01 critical Broker start-up failed: Queue ca543c04e5e2ae6dbda5f-3: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0002 rid=0x00f8b5ef foffs=0x00000200 expected_fid=0x0005) (MessageStoreImpl.cpp:721)

package mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_108fails_rhel5.2_i386_090105.tar.bz2
  qpid_test_transaction_integrity/qpid_test_transaction_integrity.log line 6750
  qpid_test_transaction_integrity/qpidd_txtest.transcript.log line 37876
  qpid_test_transaction_integrity/qpidd_journal_b0092.tar.bz2 ... qpidd_journal_b0101.tar.bz2

Comment 5 Frantisek Reznicek 2009-02-02 09:18:15 UTC
One more occurence of JERR_JCNTL_OWIMISMATCH on RHEL4.7 i386 using packages 
rhm-0.4.3075-3.el4 qpidd-0.4.738274-1.el4:

Package uploaded to:
mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_0004_fails_rhel47_090202.tar.bz2

What happened? At run (B) 20/75 qpidd detected bad alighment, then txtest --check failed.
In every next run qpidd does not start because of JERR_JCNTL_OWIMISMATCH.
  run (B) 20/75 1st failure of txtest --check
  run (B) 21/75 2st failure of qpidd - JERR_JCNTL_OWIMISMATCH

Rough log in
qpid_test_transaction_integrity.0004/qpid_test_transaction_integrity.0004.log:2816,2847

Detailed log here:
Bad alighment detected before txtest --check (run(B) 20/75 @step7])
  qpid_test_transaction_integrity.0004/qpidd_txtest.transcript.log:13082
Then the txtest --check fails (run(B) 20/75 @step9]) with missing and extra IDs in journal
  qpid_test_transaction_integrity.0004/qpidd_txtest.transcript.log:13128, 13331
Any next qpidd start-up fails with JERR_JCNTL_OWIMISMATCH
  qpid_test_transaction_integrity.0004/qpidd_txtest.transcript.log:14279, 14309, ...

  Detailed failure ID:
  critical Broker start-up failed: Queue 73d192bf1bd9771f78b6f42f77746e142094cbf61a8c2cac1-4: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0002 rid=0x0023c1e1 foffs=0x00000400 expected_fid=0x0007) (MessageStoreImpl.cpp:717)

Corresponding journals included:
qpidd_journal_b0017.tar.bz2
qpidd_journal_b0018.tar.bz2
qpidd_journal_b0019.tar.bz2 <- last passing one
qpidd_journal_b0020.tar.bz2 <- 1st failing txtest
qpidd_journal_b0021.tar.bz2 <- 2nd failing qpidd 'JERR_JCNTL_OWIMISMATCH'
...

Comment 6 Kim van der Riet 2009-02-04 21:21:29 UTC
This looks like another outcome for the bug fixed in Bug 466533. Every case of this error is preceded by the prior test correcting a misalignment in the same journal but one file prior to that where the JERR_JCNTL_OWIMISMATCH error occurred. In all cases, the misalignment was in the last sblock in the file, so that the filler(s) included offset 0x180180 (the last dblock in the file).

In the case of the log from comment #5 above, there were additional failures after these fillers were first written. Here is a summary:

       081202: part B 698/700: Fillers written to 0x180100, 0x180180 in fid 2
               part B 699/700: JERR_JCNTL_OWIMISMATCH in fid 3

       081215: part B  34/500: Filler written to 0x180180 in fid 0
               part B  35/500: JERR_JCNTL_OWIMISMATCH in fid 1

x86_64.090105: part B  85/150: Filler written to 0x180180 in fid 1
               part B  86/150: JERR_JCNTL_OWIMISMATCH in fid 2

  i386.090105: part B  91/150: Filler written to 0x180180 in fid 1
               part B  92/150: JERR_JCNTL_OWIMISMATCH in fid 2

       090202: part B  19/75: Fillers written to 0x180100, 0x180180 in fid 1
               part B  20/75: txtest fails with both missing and extra ids
               part B  21/75: JERR_JCNTL_OWIMISMATCH in fid 2

Bug 466533 exhibited itself as JERR_RMGR_UNKNOWNMAGIC 0x00000000 for all part A tests, and for part B tests as JERR_RMGR_UNKNOWNMAGIC with a non-zero magic where the first record in the following file was a continuation of a previous record. However, if the first record in the following file happens to be a known magic (ie the previous record happens to align correctly), then I suspect this bug is the outcome.

Since bug 466533 was fixed in r.3088, and the last occurrence of this bug was in r.3075, I suggest that we retest for this. If no further occurrence is found, this should be closed as a duplicate of bug 466533.

Added NEEDINFO flag.

Comment 7 Frantisek Reznicek 2009-02-09 14:42:24 UTC
Unfortunately two new occurrences of JERR_JCNTL_OWIMISMATCH were detected on last packages (qpidd-0.4.741135-1.el5, rhm-0.4.3108-1.el5):

It is interesting indication that this issue is triggered more frequently on i386/i686 architecture.

The data were uploaded to:
mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity.rhel4753_JERR_JCNTL_OWIMISMATCHes.090209.tar.bz2


case a] nec-em8.rhts.bos.redhat.com RHEL 4.7 i386
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
directory JERR_JCNTL_OWIMISMATCH/qpid_test_transaction_integrity.0002/*
rough log     .../qpid_test_transaction_integrity.0002.log
detailed log  .../qpidd_txtest.transcript.log
corresponding journals
              .../qpidd_journal_b0015.tar.bz2
              .../qpidd_journal_b0016.tar.bz2
              .../qpidd_journal_b0017.tar.bz2 <- last passing
              .../qpidd_journal_b0018.tar.bz2 <- 1st failing
              .../qpidd_journal_b0019.tar.bz2 <- 2nd failing
              .../qpidd_journal_b0020.tar.bz2
              .../qpidd_journal_b0021.tar.bz2

The JERR_JCNTL_OWIMISMATCH occurred in part B of thest where journal is growing and prevented correct qpidd journal recovery start-up (step 8)

Detailed failure ID:
critical Broker start-up failed: Queue 859474c0bfb76b80bec3ba2b2-5: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x654d4852 fid=0x0003 rid=0x0022d1ee foffs=0x00000280 expected_fid=0x0006) (MessageStoreImpl.cpp:728)


case b] dell-pe2850-01.rhts.bos.redhat.com RHEL 5.3 i386
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
directory JERR_JCNTL_OWIMISMATCH/qpid_test_transaction_integrity.0006/*
rough log     .../qpid_test_transaction_integrity.0006.log
detailed log  .../qpidd_txtest.transcript.log
corresponding journals
              .../qpidd_journal_b0060.tar.bz2
              .../qpidd_journal_b0061.tar.bz2
              .../qpidd_journal_b0062.tar.bz2 <- last passing
              .../qpidd_journal_b0063.tar.bz2 <- 1st failing
              .../qpidd_journal_b0064.tar.bz2 <- 2nd failing
              .../qpidd_journal_b0065.tar.bz2

The JERR_JCNTL_OWIMISMATCH occurred in part B of thest where journal is growing and prevented correct qpidd journal recovery start-up (step 8)

Detailed failure ID:
critical Broker start-up failed: Queue b6ef2d177b5c552c9769bc6c04075f8094a-1: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0005 rid=0x0117c2ac foffs=0x00000200 expected_fid=0x0003) (MessageStoreImpl.cpp:732)

There was txtest crash detected! see backtrace at the bottom of .../qpid_test_transaction_integrity.0006.log .

Comment 8 Kim van der Riet 2009-02-10 16:01:47 UTC
Closer examination reveals an undetected logic error in recovery from the test _prior_ to that where the JERR_JCNTL_OWIMISMATCH occurred. In the cases where a record overflows onto the next file, the following file's header was written but the continuation portion of the record did not, then this bug is triggered.

Recover was correctly fixing the corrupted record by writing filler records, but it was not correcting the file header analysis for the first and last file ids, and was starting the new write activity in the file _following_ the one where the failed continuation record should have been.

Fixed in r.3114.

QA: Check as before using txtest soak test.

Comment 9 Kim van der Riet 2009-02-10 16:24:26 UTC
Small logic correction made to this fix in r.3115.

Comment 10 Frantisek Reznicek 2009-02-16 12:44:11 UTC
Re-tested long-term on RHEL 4.7 / 5.3 i386 / x86_64 on packages:
 qpidd-0.4.743861-1.el5,  rhm-0.4.3116-1.el5 and proved that issue has been fixed.

->VERIFIED

Comment 11 Frantisek Reznicek 2009-02-25 10:16:41 UTC
Unfortunately new occurence of JERR_JCNTL_OWIMISMATCH detected on packages
qpidd-0.4.744917-1.el4,rhm-0.4.3116-3.el4 on RHEL 4.7.

The test detail are attached to RHTS test qpid_test_transaction_integrity:
  https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=7027532

After run(B) 48/75 qpidd recovery, txtest highlights missing and extra messages in journal.
Then in next run(B) 49/75 txtest and qpidd start up well but after kill, qpidd doesn't start up/recover from journal anymore.

rough log
  https://rhts.redhat.com/testlogs/47853/161762/1352905/TESTOUT.log
fine log
  https://rhts.redhat.com/testlogs/47853/161762/1352905/qpidd_txtest.transcript.log
corresponding journals
  https://rhts.redhat.com/testlogs/47853/161762/1352905/qpidd_journal_b0045-0048.tar.bz2

->ASSIGNED

Comment 12 Kim van der Riet 2009-02-25 14:36:31 UTC
This bug is caused by a failure to check journal alignment when the decode in the last record in the last file fails. In reality, the previous test (test 47, which passed) failed to correct the partly written record, and this caused later tests which did not overwrite this record to fail. The extra/missing records of test 48 occurred because the journal found the misaligned record left from 47 and corrected it, but in so doing ended journal recovery. The OWIMISMATCH follows from this error.

Fixed in r.3134

QA (sigh) More soak tests...

Comment 14 Frantisek Reznicek 2009-03-06 08:39:12 UTC
The issue has been fixed, validated using long term journal tests on RHEL 4.7 / 5.2 / 5.3 i386 / x86_64 on packages: qpidd-0.4.750054-1.el5 ,rhm-0.4.3138-2.el5.

->VERIFIED

Comment 16 errata-xmlrpc 2009-04-21 16:16:51 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html