Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1063700

Summary: [linearstore] broker restart fails under stress test
Product: Red Hat Enterprise MRG Reporter: Pavel Moravec <pmoravec>
Component: qpid-cppAssignee: Kim van der Riet <kim.vdriet>
Status: CLOSED CURRENTRELEASE QA Contact: Zdenek Kraus <zkraus>
Severity: high Docs Contact:
Priority: high    
Version: 3.0CC: esammons, iboverma, jross, zkraus
Target Milestone: 3.0Keywords: Reproducer
Target Release: ---   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: qpid-cpp-0.22-36 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-01-21 12:55:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 709325    
Attachments:
Description Flags
reproducer script
none
/var/lib/qpidd from failing broker start
none
/var/lib/qpidd_bckp before starting the broker and /var/lib/qpidd after the broker startup failure none

Description Pavel Moravec 2014-02-11 09:56:38 UTC
Description of problem:
Running a stress test where many messages are purged/consumed from a queue at the same time (just to try returning many EPL together) while new messages are still coming, broker restart sometime fails with JERR_JREC_BADRECTAIL.


Version-Release number of selected component (if applicable):
qpid-cpp-server-linearstore-0.22-35.el6.x86_64


How reproducible:
100% within 15 minutes, but in non-deterministic scenario


Steps to Reproduce:
Run attached script


Actual results:
Script terminates after several broker restart attempts (usually within 5, sometimes much more), with error log:

2014-02-11 09:19:45 [Broker] critical Unexpected error: Daemon startup failed: Queue testQueue: recoverMessages() failed: jexception 0x0701 RecoveryManager::readNextRemainingRecord() threw JERR_JREC_BADRECTAIL: Invalid record tail. (Bad record tail:
  Magic: expected 0x9aacb3ae; found 0xae
  Serial: expected 0xdfd9ef00f95ec0e8; found 0x3bd597e6c9
  Record Id: expected 0x200422; found 0x0
  Checksum: expected 0xfa45392f; found 0x10) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)


Expected results:
Script never terminates


Additional info:
Attaching also one /var/lib/qpidd directory where the broker cant restart

Comment 1 Pavel Moravec 2014-02-11 09:58:43 UTC
Created attachment 861729 [details]
reproducer script

Comment 2 Pavel Moravec 2014-02-11 10:16:34 UTC
Created attachment 861741 [details]
/var/lib/qpidd from failing broker start

Broker logs:

2014-02-11 10:13:46 [Broker] critical Unexpected error: Queue testQueue: recoverMessages() failed: jexception 0x0701 RecoveryManager::readNextRemainingRecord() threw JERR_JREC_BADRECTAIL: Invalid record tail. (Bad record tail:
  Serial: expected 0xb0b6a3e94bf87441; found 0x3bd597e6c9
  Record Id: expected 0x1978c; found 0x0
  Checksum: expected 0x42a237f9; found 0x10) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)

Comment 3 Kim van der Riet 2014-02-11 18:12:42 UTC
If you have the store files from the failure, this would be most valuable. Presumably your script preserved the store before the recovery was attempted; replacing the qls store directory with the saved store and restarting the broker should make the error 100% reproducible.

If you have the store, please send it to me or attach it to this bug.

Comment 4 Pavel Moravec 2014-02-12 08:14:24 UTC
Created attachment 862147 [details]
/var/lib/qpidd_bckp before starting the broker and /var/lib/qpidd after the broker startup failure

See /var/lib/qpidd_bckp as backed up directory before invoking "service qpiddd start".

Broker logged meantime:

2014-02-12 08:06:07 [Broker] critical Unexpected error: Daemon startup failed: Queue testQueue: recoverMessages() failed: jexception 0x0701 RecoveryManager::readNextRemainingRecord() threw JERR_JREC_BADRECTAIL: Invalid record tail. (Bad record tail:
  Magic: expected 0x9aacb3ae; found 0x6
  Serial: expected 0x3e8b1a565b177139; found 0x3bd597e6c9
  Record Id: expected 0x1978b; found 0x0
  Checksum: expected 0x7f893667; found 0x10) (/builddir/build/BUILD/qpid-0.22/cpp/src/qpid/linearstore/MessageStoreImpl.cpp:1004)

Comment 5 Kim van der Riet 2014-03-05 13:56:45 UTC
Upstream bug at https://issues.apache.org/jira/browse/QPID-5603

Comment 6 Kim van der Riet 2014-03-05 14:00:15 UTC
The attached store journal contains a record in which the record tail is exactly divided by a file boundary from the rest of the record. This meets the conditions of the upstream bug in comment #5.

Comment 7 Kim van der Riet 2014-03-05 14:53:19 UTC
Fixed in r.1574513. See upstream bug for further comments.

Comment 8 Zdenek Kraus 2014-03-18 08:21:35 UTC
this issue was tested on RHEL 6.5 i686 & x86_64 with following packages:

perl-qpid-0.22-11.el6
python-qpid-0.22-12.el6
python-qpid-qmf-0.22-28.el6
qpid-cpp-client-0.22-36.el6
qpid-cpp-client-devel-0.22-36.el6
qpid-cpp-client-devel-docs-0.22-36.el6
qpid-cpp-debuginfo-0.22-36.el6
qpid-cpp-server-0.22-36.el6
qpid-cpp-server-devel-0.22-36.el6
qpid-cpp-server-ha-0.22-36.el6
qpid-cpp-server-linearstore-0.22-36.el6
qpid-cpp-server-xml-0.22-36.el6
qpid-java-client-0.22-6.el6
qpid-java-common-0.22-6.el6
qpid-java-example-0.22-6.el6
qpid-jca-0.22-2.el6
qpid-jca-xarecovery-0.22-2.el6
qpid-proton-c-0.6-1.el6
qpid-proton-c-devel-0.6-1.el6
qpid-proton-debuginfo-0.6-1.el6
qpid-qmf-0.22-28.el6
qpid-qmf-debuginfo-0.22-28.el6
qpid-snmpd-1.0.0-16.el6
qpid-snmpd-debuginfo-1.0.0-16.el6
qpid-tools-0.22-9.el6
ruby-qpid-qmf-0.22-28.el6


-> VERIFIED