Bug 474366
Summary: | qpidd+store jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH on journal recovery | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | Frantisek Reznicek <freznice> |
Component: | qpid-cpp | Assignee: | Kim van der Riet <kim.vdriet> |
Status: | CLOSED ERRATA | QA Contact: | Frantisek Reznicek <freznice> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 1.1 | CC: | esammons, gsim |
Target Milestone: | 1.1.1 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-04-21 16:16:51 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Frantisek Reznicek
2008-12-03 15:07:38 UTC
Still not detected on latest packages -> lowering priority (might be solved after rev. 2804). This needs to be carefully analysed, the journal artifacts show odd write behavior, but there is no time to get to the bottom of this before the 1.1 release deadline. Only one known occurrence. Pushing to 1.1.1. The bug observed again on an RHEL 5.2 x86_64 using MRG 1.1 RC packages (qpidd-0.4.725652-2.el5, rhm-0.4.2970-1.el5) I performed very extreme almost 2-days run of qpid_test_transaction_integrity where this issue is observed: qpidd+store refused to recover journal with JERR_JCNTL_OWIMISMATCH message: 2008-dec-14 01:49:35 critical Broker start-up failed: Queue 686f3364cdce3160f2ddfde76d79-7: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0003 rid=0x1387cc58 foffs=0x00000200 expected_fid=0x0004) (MessageStoreImpl.cpp:721) Observed at run(B) 699/700 and 700/700 Once this failure is encountered then qpidd cannot start with that journal! Data stored in: mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_fails_rhel5.2_x86_64_081215.tar.bz2 Packages (details in qpid_test_transaction_integrity/qpid_test_transaction_integrity.log:31-54): qpidd-0.4.725652-2.el5, rhm-0.4.2970-1.el5 look at: qpid_test_transaction_integrity/qpid_test_transaction_integrity.log:37138-37193 qpid_test_transaction_integrity/qpidd_txtest.transcript.log:222781,222819,222855,222891,222927 corresponding journals: qpid_test_transaction_integrity/qpidd_journal_b0699.tar.bz2 qpid_test_transaction_integrity/qpidd_journal_b0700.tar.bz2 Another two occurences of JERR_JCNTL_OWIMISMATCH issue triggered by journal tests. In both cases happen during transaction integrity test - part B when qpidd journal is not cleared in between runs. Once this issue happens then qpidd will not properly start up. (unless journal is deleted) =-= [1st occurence] =-= RHEL 4.7 x86_64 2008-dec-22 16:30:48 critical Broker start-up failed: Queue 4b44b0e0bebe65e1193c80c03ff5881ae2419-7: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0002 rid=0x01459743 foffs=0x00000300 expected_fid=0x0000) (MessageStoreImpl.cpp:717) package mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_108fails_rhel4.7_x86_64_090105.tar.bz2 qpid_test_transaction_integrity/qpid_test_transaction_integrity.log line 6629 qpid_test_transaction_integrity/qpidd_txtest.transcript.log line 34217 qpid_test_transaction_integrity/qpidd_journal_b0086.tar.bz2 ... qpidd_journal_b0095.tar.bz2 =-= [2nd occurence] =-= RHEL 5.2 i386 2008-dec-22 17:38:01 critical Broker start-up failed: Queue ca543c04e5e2ae6dbda5f-3: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0002 rid=0x00f8b5ef foffs=0x00000200 expected_fid=0x0005) (MessageStoreImpl.cpp:721) package mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_108fails_rhel5.2_i386_090105.tar.bz2 qpid_test_transaction_integrity/qpid_test_transaction_integrity.log line 6750 qpid_test_transaction_integrity/qpidd_txtest.transcript.log line 37876 qpid_test_transaction_integrity/qpidd_journal_b0092.tar.bz2 ... qpidd_journal_b0101.tar.bz2 One more occurence of JERR_JCNTL_OWIMISMATCH on RHEL4.7 i386 using packages rhm-0.4.3075-3.el4 qpidd-0.4.738274-1.el4: Package uploaded to: mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity_0004_fails_rhel47_090202.tar.bz2 What happened? At run (B) 20/75 qpidd detected bad alighment, then txtest --check failed. In every next run qpidd does not start because of JERR_JCNTL_OWIMISMATCH. run (B) 20/75 1st failure of txtest --check run (B) 21/75 2st failure of qpidd - JERR_JCNTL_OWIMISMATCH Rough log in qpid_test_transaction_integrity.0004/qpid_test_transaction_integrity.0004.log:2816,2847 Detailed log here: Bad alighment detected before txtest --check (run(B) 20/75 @step7]) qpid_test_transaction_integrity.0004/qpidd_txtest.transcript.log:13082 Then the txtest --check fails (run(B) 20/75 @step9]) with missing and extra IDs in journal qpid_test_transaction_integrity.0004/qpidd_txtest.transcript.log:13128, 13331 Any next qpidd start-up fails with JERR_JCNTL_OWIMISMATCH qpid_test_transaction_integrity.0004/qpidd_txtest.transcript.log:14279, 14309, ... Detailed failure ID: critical Broker start-up failed: Queue 73d192bf1bd9771f78b6f42f77746e142094cbf61a8c2cac1-4: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0002 rid=0x0023c1e1 foffs=0x00000400 expected_fid=0x0007) (MessageStoreImpl.cpp:717) Corresponding journals included: qpidd_journal_b0017.tar.bz2 qpidd_journal_b0018.tar.bz2 qpidd_journal_b0019.tar.bz2 <- last passing one qpidd_journal_b0020.tar.bz2 <- 1st failing txtest qpidd_journal_b0021.tar.bz2 <- 2nd failing qpidd 'JERR_JCNTL_OWIMISMATCH' ... This looks like another outcome for the bug fixed in Bug 466533. Every case of this error is preceded by the prior test correcting a misalignment in the same journal but one file prior to that where the JERR_JCNTL_OWIMISMATCH error occurred. In all cases, the misalignment was in the last sblock in the file, so that the filler(s) included offset 0x180180 (the last dblock in the file). In the case of the log from comment #5 above, there were additional failures after these fillers were first written. Here is a summary: 081202: part B 698/700: Fillers written to 0x180100, 0x180180 in fid 2 part B 699/700: JERR_JCNTL_OWIMISMATCH in fid 3 081215: part B 34/500: Filler written to 0x180180 in fid 0 part B 35/500: JERR_JCNTL_OWIMISMATCH in fid 1 x86_64.090105: part B 85/150: Filler written to 0x180180 in fid 1 part B 86/150: JERR_JCNTL_OWIMISMATCH in fid 2 i386.090105: part B 91/150: Filler written to 0x180180 in fid 1 part B 92/150: JERR_JCNTL_OWIMISMATCH in fid 2 090202: part B 19/75: Fillers written to 0x180100, 0x180180 in fid 1 part B 20/75: txtest fails with both missing and extra ids part B 21/75: JERR_JCNTL_OWIMISMATCH in fid 2 Bug 466533 exhibited itself as JERR_RMGR_UNKNOWNMAGIC 0x00000000 for all part A tests, and for part B tests as JERR_RMGR_UNKNOWNMAGIC with a non-zero magic where the first record in the following file was a continuation of a previous record. However, if the first record in the following file happens to be a known magic (ie the previous record happens to align correctly), then I suspect this bug is the outcome. Since bug 466533 was fixed in r.3088, and the last occurrence of this bug was in r.3075, I suggest that we retest for this. If no further occurrence is found, this should be closed as a duplicate of bug 466533. Added NEEDINFO flag. Unfortunately two new occurrences of JERR_JCNTL_OWIMISMATCH were detected on last packages (qpidd-0.4.741135-1.el5, rhm-0.4.3108-1.el5): It is interesting indication that this issue is triggered more frequently on i386/i686 architecture. The data were uploaded to: mrg3.lab.bos.redhat.com:/root/qpid_test_transaction_integrity.rhel4753_JERR_JCNTL_OWIMISMATCHes.090209.tar.bz2 case a] nec-em8.rhts.bos.redhat.com RHEL 4.7 i386 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ directory JERR_JCNTL_OWIMISMATCH/qpid_test_transaction_integrity.0002/* rough log .../qpid_test_transaction_integrity.0002.log detailed log .../qpidd_txtest.transcript.log corresponding journals .../qpidd_journal_b0015.tar.bz2 .../qpidd_journal_b0016.tar.bz2 .../qpidd_journal_b0017.tar.bz2 <- last passing .../qpidd_journal_b0018.tar.bz2 <- 1st failing .../qpidd_journal_b0019.tar.bz2 <- 2nd failing .../qpidd_journal_b0020.tar.bz2 .../qpidd_journal_b0021.tar.bz2 The JERR_JCNTL_OWIMISMATCH occurred in part B of thest where journal is growing and prevented correct qpidd journal recovery start-up (step 8) Detailed failure ID: critical Broker start-up failed: Queue 859474c0bfb76b80bec3ba2b2-5: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x654d4852 fid=0x0003 rid=0x0022d1ee foffs=0x00000280 expected_fid=0x0006) (MessageStoreImpl.cpp:728) case b] dell-pe2850-01.rhts.bos.redhat.com RHEL 5.3 i386 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ directory JERR_JCNTL_OWIMISMATCH/qpid_test_transaction_integrity.0006/* rough log .../qpid_test_transaction_integrity.0006.log detailed log .../qpidd_txtest.transcript.log corresponding journals .../qpidd_journal_b0060.tar.bz2 .../qpidd_journal_b0061.tar.bz2 .../qpidd_journal_b0062.tar.bz2 <- last passing .../qpidd_journal_b0063.tar.bz2 <- 1st failing .../qpidd_journal_b0064.tar.bz2 <- 2nd failing .../qpidd_journal_b0065.tar.bz2 The JERR_JCNTL_OWIMISMATCH occurred in part B of thest where journal is growing and prevented correct qpidd journal recovery start-up (step 8) Detailed failure ID: critical Broker start-up failed: Queue b6ef2d177b5c552c9769bc6c04075f8094a-1: recoverQueues() failed: jexception 0x0206 jcntl::check_owi() threw JERR_JCNTL_OWIMISMATCH: Overwrite Indecator (OWI) change found in unexpected location. (Magic=0x644d4852 fid=0x0005 rid=0x0117c2ac foffs=0x00000200 expected_fid=0x0003) (MessageStoreImpl.cpp:732) There was txtest crash detected! see backtrace at the bottom of .../qpid_test_transaction_integrity.0006.log . Closer examination reveals an undetected logic error in recovery from the test _prior_ to that where the JERR_JCNTL_OWIMISMATCH occurred. In the cases where a record overflows onto the next file, the following file's header was written but the continuation portion of the record did not, then this bug is triggered. Recover was correctly fixing the corrupted record by writing filler records, but it was not correcting the file header analysis for the first and last file ids, and was starting the new write activity in the file _following_ the one where the failed continuation record should have been. Fixed in r.3114. QA: Check as before using txtest soak test. Small logic correction made to this fix in r.3115. Re-tested long-term on RHEL 4.7 / 5.3 i386 / x86_64 on packages: qpidd-0.4.743861-1.el5, rhm-0.4.3116-1.el5 and proved that issue has been fixed. ->VERIFIED Unfortunately new occurence of JERR_JCNTL_OWIMISMATCH detected on packages qpidd-0.4.744917-1.el4,rhm-0.4.3116-3.el4 on RHEL 4.7. The test detail are attached to RHTS test qpid_test_transaction_integrity: https://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=7027532 After run(B) 48/75 qpidd recovery, txtest highlights missing and extra messages in journal. Then in next run(B) 49/75 txtest and qpidd start up well but after kill, qpidd doesn't start up/recover from journal anymore. rough log https://rhts.redhat.com/testlogs/47853/161762/1352905/TESTOUT.log fine log https://rhts.redhat.com/testlogs/47853/161762/1352905/qpidd_txtest.transcript.log corresponding journals https://rhts.redhat.com/testlogs/47853/161762/1352905/qpidd_journal_b0045-0048.tar.bz2 ->ASSIGNED This bug is caused by a failure to check journal alignment when the decode in the last record in the last file fails. In reality, the previous test (test 47, which passed) failed to correct the partly written record, and this caused later tests which did not overwrite this record to fail. The extra/missing records of test 48 occurred because the journal found the misaligned record left from 47 and corrected it, but in so doing ended journal recovery. The OWIMISMATCH follows from this error. Fixed in r.3134 QA (sigh) More soak tests... The issue has been fixed, validated using long term journal tests on RHEL 4.7 / 5.2 / 5.3 i386 / x86_64 on packages: qpidd-0.4.750054-1.el5 ,rhm-0.4.3138-2.el5. ->VERIFIED An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0434.html |