Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 639994

Summary: Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy ()
Product: Red Hat Enterprise MRG Reporter: Tomas Rusnak <trusnak>
Component: qpid-cppAssignee: mick <mgoulish>
Status: CLOSED ERRATA QA Contact: Frantisek Reznicek <freznice>
Severity: high Docs Contact:
Priority: urgent    
Version: DevelopmentCC: esammons, freznice, gsim, iboverma, jneedle, tross
Target Milestone: 1.3.2-RC2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 1.3.0-GA Doc Type: Bug Fix
Doc Text:
There was a race condition in cluster code that allowed a cluster connection to be deleted before it was fully initialized. ( Before its codec was conctructed. ) This could occasionally cause a clustered broker to crash if it was shut down soon after being added to its cluster. By adding and then shutting down a new broker to a cluster under load, the crash could be observed about 2% of the time. Alan prevented the race by delaying adding the connection to the Cluster's map until the connection's codec is constructed. I was then able to run a loop that added and then shut down a new broker to a cluster under load 1000 times with no crash.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-02-15 12:11:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 501305, 509796    
Attachments:
Description Flags
core dump from qpidd
none
reproducing script for 2nd node restart none

Description Tomas Rusnak 2010-10-04 14:03:22 UTC
Created attachment 451398 [details]
core dump from qpidd

Description of problem:
Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy (). See coredump in attachment.

Version-Release number of selected component (if applicable):
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-mrg-debuginfo-0.7.946106-17.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5
qpid-cpp-client-0.7.946106-17.el5
python-qpid-0.7.946106-14.el5
qpid-tools-0.7.946106-11.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-cpp-server-store-0.7.946106-17.el5
qpid-java-common-0.7.946106-10.el5
qpid-cpp-client-devel-0.7.946106-17.el5

How reproducible:
low (2%)

Steps to Reproduce:
1.set up cluster with one node
2.start perftest
3.set up another node to cluster and keep it restarting in a loop (see attachment)
  
Actual results:
qpidd crashing

Expected results:
qpidd not crashing

Comment 1 Tomas Rusnak 2010-10-04 14:04:52 UTC
Created attachment 451400 [details]
reproducing script for 2nd node restart

Comment 2 Tomas Rusnak 2010-10-04 14:12:03 UTC
# stat core.12762
  File: `core.12762'
  Size: 344858624  Blocks: 14840      IO Block: 4096   regular file
Device: fd00h/64768d Inode: 3112981     Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-10-04 10:06:21.000000000 -0400
Modify: 2010-10-04 10:06:21.000000000 -0400
Change: 2010-10-04 10:06:21.000000000 -0400

qpidd2.log:
2010-10-04 10:06:18 info cluster(10.16.64.195:12762 CATCHUP) member update:
10.16.64.195:11320(member) 10.16.64.195:12762(member) frameSeq=25849970
2010-10-04 10:06:18 notice cluster(10.16.64.195:12762 READY) caught up.
2010-10-04 10:06:21 notice Shut down
2010-10-04 10:06:26 info Loaded Module: /usr/lib/qpid/daemon/xml.so
2010-10-04 10:06:26 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so

The seqfault happens in exact time of shutdown notice in log file. It's
reproducible on RHEL5/i386 only.

Comment 4 mick 2011-01-11 21:25:45 UTC
I think this one is gone.

On latest trunk -- rev 1057786 -- I have seen no failure after 300 repetitions of the second broker being added and then shut down ( while perftest was runing, pointing at the first broker).  Same original broker still running -- no core files produced.

If the bug was being seen about 2% of the time, then there is 1 chance in 500 of the no-show happening accidentally.

Comment 5 Gordon Sim 2011-01-12 10:06:48 UTC
Possibly fixed by http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commit;h=af12fa7cae14821a12b0a3199ab59ca146cac7a7; marking modified for verification.

Comment 6 mick 2011-01-22 15:04:26 UTC
It looks likes this bug is absent from the current build.

Re-tested on an i386 arch machine using RHEL5.  ( this is the arch mentioned above.)

Using latest dash-17 build, installed RPMs.

Broker configured with following modules:

  cluster
  msgstore
  xml

While running a perftest ( and sending 2 billion messages ) to the original cluster broker I used the QE script attached above to start and stop another broker in the same cluster 1000 times.

result -- the secondary broker always started and shut down normally, and the original broker was still running at the end of the test.  No core files were produced.   

If the original odds of failure were about 2%, then there is basically no chance that this result happened accidentally.  ( 1.7 chances in 1 billion ) .

It looks likes this bug is absent from the current build.

Comment 7 mick 2011-01-24 19:28:27 UTC
In the about comment, when I said "dash-17" I meant "dash-27".   Off-by-ten error.

Comment 8 mick 2011-01-25 10:43:43 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
There was a race condition in cluster code that allowed
a cluster connection to be deleted before it was fully initialized.  ( Before its codec was conctructed. )

This could occasionally cause a clustered broker to crash if it was shut down soon after being added to its cluster.

By adding and then shutting down a new broker to a cluster
under load, the crash could be observed about 2% of the time.

Alan prevented the race by delaying adding the connection to the Cluster's map until the connection's codec is constructed.

I was then able to run a loop that added and then shut down a new broker to a cluster under load 1000 times with no crash.

Comment 9 Frantisek Reznicek 2011-01-26 09:30:04 UTC
The defect is under heavy test, so far all tests prove that issue is gone.

Unfortunately the scenario is the same as bug 509796 which seems to not be fixed, this fact makes me wonder if I'm testing enough or not.


Update soon.

Comment 10 Frantisek Reznicek 2011-01-27 08:35:41 UTC
The issue has been fixed as extreme long runs on multiple machines do not trigger this issue anymore. Tested on RHEL 5.6 i386 / x86_64 on packages:
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5
qpid-cpp-client-rdma-0.7.946106-27.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-mrg-debuginfo-0.7.946106-27.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-server-rdma-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-dotnet-0.4.738274-2.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-12.el5
rh-qpid-cpp-tests-0.7.946106-27.el5


-> VERIFIED

Comment 11 errata-xmlrpc 2011-02-15 12:11:05 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html