639994 – Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy ()

Bug 639994 - Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy ()

Summary: Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy ()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	1.3.2-RC2
Target Release:	---
Assignee:	mick
QA Contact:	Frantisek Reznicek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	501305 509796
TreeView+	depends on / blocked

Reported:	2010-10-04 14:03 UTC by Tomas Rusnak
Modified:	2015-11-16 01:12 UTC (History)
CC List:	6 users (show)
Fixed In Version:	1.3.0-GA
Doc Type:	Bug Fix
Doc Text:	There was a race condition in cluster code that allowed a cluster connection to be deleted before it was fully initialized. ( Before its codec was conctructed. ) This could occasionally cause a clustered broker to crash if it was shut down soon after being added to its cluster. By adding and then shutting down a new broker to a cluster under load, the crash could be observed about 2% of the time. Alan prevented the race by delaying adding the connection to the Cluster's map until the connection's codec is constructed. I was then able to run a loop that added and then shut down a new broker to a cluster under load 1000 times with no crash.
Clone Of:
Environment:
Last Closed:	2011-02-15 12:11:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
core dump from qpidd (17.64 KB, application/octet-stream) 2010-10-04 14:03 UTC, Tomas Rusnak	no flags	Details
reproducing script for 2nd node restart (264 bytes, application/x-sh) 2010-10-04 14:04 UTC, Tomas Rusnak	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:0217	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging and Grid bug fix and enhancement update	2011-02-15 12:10:15 UTC

Description Tomas Rusnak 2010-10-04 14:03:22 UTC

Created attachment 451398 [details]
core dump from qpidd

Description of problem:
Clustered qpidd broker crashing during shutdown in ObjectId(...) -> memcpy (). See coredump in attachment.

Version-Release number of selected component (if applicable):
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-mrg-debuginfo-0.7.946106-17.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5
qpid-cpp-client-0.7.946106-17.el5
python-qpid-0.7.946106-14.el5
qpid-tools-0.7.946106-11.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-cpp-server-store-0.7.946106-17.el5
qpid-java-common-0.7.946106-10.el5
qpid-cpp-client-devel-0.7.946106-17.el5

How reproducible:
low (2%)

Steps to Reproduce:
1.set up cluster with one node
2.start perftest
3.set up another node to cluster and keep it restarting in a loop (see attachment)
  
Actual results:
qpidd crashing

Expected results:
qpidd not crashing

Comment 1 Tomas Rusnak 2010-10-04 14:04:52 UTC

Created attachment 451400 [details]
reproducing script for 2nd node restart

Comment 2 Tomas Rusnak 2010-10-04 14:12:03 UTC

# stat core.12762
  File: `core.12762'
  Size: 344858624  Blocks: 14840      IO Block: 4096   regular file
Device: fd00h/64768d Inode: 3112981     Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-10-04 10:06:21.000000000 -0400
Modify: 2010-10-04 10:06:21.000000000 -0400
Change: 2010-10-04 10:06:21.000000000 -0400

qpidd2.log:
2010-10-04 10:06:18 info cluster(10.16.64.195:12762 CATCHUP) member update:
10.16.64.195:11320(member) 10.16.64.195:12762(member) frameSeq=25849970
2010-10-04 10:06:18 notice cluster(10.16.64.195:12762 READY) caught up.
2010-10-04 10:06:21 notice Shut down
2010-10-04 10:06:26 info Loaded Module: /usr/lib/qpid/daemon/xml.so
2010-10-04 10:06:26 info Loaded Module: /usr/lib/qpid/daemon/msgstore.so

The seqfault happens in exact time of shutdown notice in log file. It's
reproducible on RHEL5/i386 only.

Comment 4 mick 2011-01-11 21:25:45 UTC

I think this one is gone.

On latest trunk -- rev 1057786 -- I have seen no failure after 300 repetitions of the second broker being added and then shut down ( while perftest was runing, pointing at the first broker).  Same original broker still running -- no core files produced.

If the bug was being seen about 2% of the time, then there is 1 chance in 500 of the no-show happening accidentally.

Comment 5 Gordon Sim 2011-01-12 10:06:48 UTC

Possibly fixed by http://mrg1.lab.bos.redhat.com/git/?p=qpid.git;a=commit;h=af12fa7cae14821a12b0a3199ab59ca146cac7a7; marking modified for verification.

Comment 6 mick 2011-01-22 15:04:26 UTC

It looks likes this bug is absent from the current build.

Re-tested on an i386 arch machine using RHEL5.  ( this is the arch mentioned above.)

Using latest dash-17 build, installed RPMs.

Broker configured with following modules:

  cluster
  msgstore
  xml

While running a perftest ( and sending 2 billion messages ) to the original cluster broker I used the QE script attached above to start and stop another broker in the same cluster 1000 times.

result -- the secondary broker always started and shut down normally, and the original broker was still running at the end of the test.  No core files were produced.   

If the original odds of failure were about 2%, then there is basically no chance that this result happened accidentally.  ( 1.7 chances in 1 billion ) .

It looks likes this bug is absent from the current build.

Comment 7 mick 2011-01-24 19:28:27 UTC

In the about comment, when I said "dash-17" I meant "dash-27".   Off-by-ten error.

Comment 8 mick 2011-01-25 10:43:43 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
There was a race condition in cluster code that allowed
a cluster connection to be deleted before it was fully initialized.  ( Before its codec was conctructed. )

This could occasionally cause a clustered broker to crash if it was shut down soon after being added to its cluster.

By adding and then shutting down a new broker to a cluster
under load, the crash could be observed about 2% of the time.

Alan prevented the race by delaying adding the connection to the Cluster's map until the connection's codec is constructed.

I was then able to run a loop that added and then shut down a new broker to a cluster under load 1000 times with no crash.

Comment 9 Frantisek Reznicek 2011-01-26 09:30:04 UTC

The defect is under heavy test, so far all tests prove that issue is gone.

Unfortunately the scenario is the same as bug 509796 which seems to not be fixed, this fact makes me wonder if I'm testing enough or not.


Update soon.

Comment 10 Frantisek Reznicek 2011-01-27 08:35:41 UTC

The issue has been fixed as extreme long runs on multiple machines do not trigger this issue anymore. Tested on RHEL 5.6 i386 / x86_64 on packages:
python-qpid-0.7.946106-15.el5
qpid-cpp-client-0.7.946106-27.el5
qpid-cpp-client-devel-0.7.946106-27.el5
qpid-cpp-client-devel-docs-0.7.946106-27.el5
qpid-cpp-client-rdma-0.7.946106-27.el5
qpid-cpp-client-ssl-0.7.946106-27.el5
qpid-cpp-mrg-debuginfo-0.7.946106-27.el5
qpid-cpp-server-0.7.946106-27.el5
qpid-cpp-server-cluster-0.7.946106-27.el5
qpid-cpp-server-devel-0.7.946106-27.el5
qpid-cpp-server-rdma-0.7.946106-27.el5
qpid-cpp-server-ssl-0.7.946106-27.el5
qpid-cpp-server-store-0.7.946106-27.el5
qpid-cpp-server-xml-0.7.946106-27.el5
qpid-dotnet-0.4.738274-2.el5
qpid-java-client-0.7.946106-14.el5
qpid-java-common-0.7.946106-14.el5
qpid-java-example-0.7.946106-14.el5
qpid-tools-0.7.946106-12.el5
rh-qpid-cpp-tests-0.7.946106-27.el5


-> VERIFIED

Comment 11 errata-xmlrpc 2011-02-15 12:11:05 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0217.html

Note You need to log in before you can comment on or make changes to this bug.