800028 – clustered qpidd aborts in qpid::broker::ConnectionHandler::handle() -> operator () -> __cxa_pure_virtual () -> terminate () ...

Bug 800028 - clustered qpidd aborts in qpid::broker::ConnectionHandler::handle() -> operator () -> __cxa_pure_virtual () -> terminate () ...

Summary: clustered qpidd aborts in qpid::broker::ConnectionHandler::handle() -> operat...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	Development
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	2.1.2
Target Release:	---
Assignee:	Ken Giusti
QA Contact:	Petr Matousek
Docs Contact:
URL:
Whiteboard:
Depends On:	804001
Blocks:	806279
TreeView+	depends on / blocked

Reported:	2012-03-05 16:07 UTC by Frantisek Reznicek
Modified:	2015-11-16 01:14 UTC (History)
CC List:	4 users (show)
Fixed In Version:	qpid-cpp-mrg-0.14-14
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-12-07 17:43:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
The full backtrace and unit test code (10.63 KB, application/x-tbz) 2012-03-05 16:13 UTC, Frantisek Reznicek	no flags	Details
clustered brokers logs (42.82 KB, application/x-gnome-theme-package) 2012-03-08 14:47 UTC, Petr Matousek	no flags	Details
clustered brokers logs (23.56 KB, application/x-gnome-theme-package) 2012-03-16 17:04 UTC, Petr Matousek	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Apache JIRA	QPID-3896	0	None	None	None	Never

Description Frantisek Reznicek 2012-03-05 16:07:38 UTC

Description of problem:

Custom unit test custom_py_units.queue.LLAPITests. \
                 test_queue_nonexclusive_timed_autodelete_w_msgs_on_sclose
triggers clustered broker abort[s] in qpid::broker::ConnectionHandler::handle() -> operator () -> __cxa_pure_virtual () -> terminate () ...

  Thread 1 (Thread 0x43b28940 (LWP 16777)):
  #0  0x000000312b230265 in raise () from /lib64/libc.so.6
  #1  0x000000312b231d10 in abort () from /lib64/libc.so.6
  #2  0x000000312fabed14 in __gnu_cxx::__verbose_terminate_handler() ()
     from /usr/lib64/libstdc++.so.6
  #3  0x000000312fabce16 in ?? () from /usr/lib64/libstdc++.so.6
  #4  0x000000312fabce43 in std::terminate() () from /usr/lib64/libstdc++.so.6
  #5  0x000000312fabd34f in __cxa_pure_virtual () from /usr/lib64/libstdc++.so.6
  #6  0x000000312e35cee1 in operator() (this=0x6bfb1e0, frame=...)
      at qpid/framing/Handler.h:42
  #7  qpid::broker::ConnectionHandler::handle (this=0x6bfb1e0, frame=...)
      at qpid/broker/ConnectionHandler.cpp:87
  #8  0x000000312e3520a8 in qpid::broker::Connection::received (this=0x6bfb000, 
      frame=...) at qpid/broker/Connection.cpp:159
  #9  0x00002b0543e56941 in qpid::cluster::Connection::deliveredFrame (
      this=0x6bf7850, f=...) at qpid/cluster/Connection.cpp:248
  #10 0x00002b0543e24ad9 in qpid::cluster::Cluster::processFrame (
      this=0x2aed1b0, e=..., l=<value optimized out>)
      at qpid/cluster/Cluster.cpp:548
  ...

Unfortunately there is no log file available atm from above aborts.

I believe broker ends with "pure virtual method called" indicated as:
  #4  0x000000312fabce43 in std::terminate() () from /usr/lib64/libstdc++.so.6
  #5  0x000000312fabd34f in __cxa_pure_virtual () 

This abort is fatal condition and should be fixed.
See unit test code attached below.

Version-Release number of selected component (if applicable):
  python-qpid-0.14-4.el5
  python-qpid-qmf-0.14-3.el5
  qpid-cpp-client-0.14-7.el5
  qpid-cpp-client-devel-0.14-7.el5
  qpid-cpp-client-devel-docs-0.14-7.el5
  qpid-cpp-client-rdma-0.14-7.el5
  qpid-cpp-client-ssl-0.14-7.el5
  qpid-cpp-mrg-debuginfo-0.14-7.el5
  qpid-cpp-server-0.14-7.el5
  qpid-cpp-server-cluster-0.14-7.el5
  qpid-cpp-server-devel-0.14-7.el5
  qpid-cpp-server-rdma-0.14-7.el5
  qpid-cpp-server-ssl-0.14-7.el5
  qpid-cpp-server-store-0.14-7.el5
  qpid-cpp-server-xml-0.14-7.el5
  qpid-java-client-0.14-3.el5
  qpid-java-common-0.14-3.el5
  qpid-java-example-0.14-3.el5
  qpid-qmf-0.14-3.el5
  qpid-qmf-debuginfo-0.14-3.el5
  qpid-qmf-devel-0.14-3.el5
  qpid-tests-0.14-1.el5
  qpid-tools-0.14-1.el5
  rh-qpid-cpp-tests-0.14-7.el5
  ruby-qpid-qmf-0.14-3.el5


How reproducible:
80%

Steps to Reproduce:
1. set-up clustered broker
2. run test_queue_nonexclusive_timed_autodelete_w_msgs_on_sclose unit
  

Actual results:
  Clustered broker[s] abort.

Expected results:
  Clustered broker[s] should survive the test.


Additional info:

  Aborts were detected on both RHEL5.7 and 5.8

Comment 1 Frantisek Reznicek 2012-03-05 16:13:49 UTC

Created attachment 567681 [details]
The full backtrace and unit test code

Attachment consist of:
- unit test code
  (test_queue_nonexclusive_timed_autodelete_w_msgs_on_sclose.py)
- full backtrace of two detected aborts (bz800028.txt)

Comment 5 Petr Matousek 2012-03-08 14:47:22 UTC

Created attachment 568653 [details]
clustered brokers logs

Reproduced on RHEL5.8 x86_64 on the latest 0.14-10 packages, attaching broker logs.

broker logs indicates that both the clustered brokers are aborted with following exception at the same time:
2012-03-08 09:31:21 error Unexpected exception: Task already exists with name DelayedAutoDeletion (qpid/cluster/ClusterTimer.cpp:71)

# stat -c %x%y%z core.3671 core.3672
2012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -0500
2012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -05002012-03-08 09:31:21.000000000 -0500

Comment 6 Ken Giusti 2012-03-13 15:38:45 UTC

Root cause for this crash appears to be in ClusterTimer::add(), which throws an exception should two timers be created with the same name.

Upstream JIRA created, with a workaround patch that "fixes" the auto delete queue crash:

https://issues.apache.org/jira/browse/QPID-3896

Comment 8 Petr Matousek 2012-03-16 17:04:37 UTC

Created attachment 570656 [details]
clustered brokers logs

retested on RHEL6 i686/x86_64, broker crash is fixed, but running the test causes all cluster nodes except one (that one remain standing) to exit the cluster with the following cause:

critical Error delivering frames: Cluster timer wakeup non-existent task DelayedAutoDeletion:642922c3-6cb2-4c6b-b332-d2929ab95c34 (qpid/cluster/ClusterTimer.cpp:112)
  
Waiting for qpid-qmf rebuild for RHEL5 retest.

Comment 9 Petr Matousek 2012-03-16 17:29:12 UTC

The above mentioned issue is probably the root cause of this new crash, moving back to assigned.

-> ASSIGNED

Comment 10 Ken Giusti 2012-03-19 19:06:08 UTC

Updated the upstream with a new patch that should actually work in a cluster.

See https://issues.apache.org/jira/secure/attachment/12518923/qpid-3896.patch

Comment 11 Petr Matousek 2012-03-20 16:47:33 UTC

Retested on RHEL6 i686 on qpid-0.14-12 with the patch from Comment 10 applied, the issue seems to be  fixed. Test from Comment 1 is passing with clustered brokers, no broker failure detected.

Waiting for packages for overall retest.

Comment 12 Ken Giusti 2012-03-20 17:48:12 UTC

Fix posted upstream trunk:

http://svn.apache.org/viewvc?view=rev&rev=1303035

Comment 14 Petr Matousek 2012-03-26 09:16:04 UTC

This issue has been fixed for rhel5, verified on rhel5.8 i386 & x86_64 - qpid-cpp-mrg-0.14-14.el5.

Waiting for rhel6 packages for retest.

Comment 16 Petr Matousek 2012-04-03 09:00:21 UTC

Issue fixed, tested on rhel5.8 / 6.2 i/x on packages:
qpid-cpp-mrg-0.14-14.el5
qpid-cpp-0.14-14.el6_2

-> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.