506298 – Crash in Poller::wait()

Bug 506298 - Crash in Poller::wait()

Summary: Crash in Poller::wait()

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	1.1.6
Target Release:	---
Assignee:	Andrew Stitcher
QA Contact:	Jan Sarenik
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	506102 506498 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-06-16 16:14 UTC by Gordon Sim
Modified:	2009-07-14 17:31 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-07-14 17:31:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Combination of fixes that solve this issue (8.25 KB, patch) 2009-06-23 02:54 UTC, Andrew Stitcher	no flags	Details \| Diff
reproducer lines (266 bytes, application/x-sh) 2009-07-03 13:18 UTC, Jan Sarenik	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1153	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging bug fixing update	2009-07-14 17:31:48 UTC

Description Gordon Sim 2009-06-16 16:14:48 UTC

Description of problem:

Core was generated by `/usr/sbin/qpidd --daemon --port=5672 --data-dir=/app/qgw_tt2/common/data/1 --pi'.
Program terminated with signal 11, Segmentation fault.
[New process 29927]
[New process 29928]
[New process 29926]
[New process 29924]
[New process 29923]
[New process 29922]
[New process 29921]
#0  0x0000003c1a008293 in pthread_mutex_lock () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003c1a008293 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00000036ad6c3920 in qpid::sys::Mutex::lock () from /usr/lib64/libqpidbroker.so.0
#2  0x00000036ad1768e2 in qpid::sys::Poller::wait () from /usr/lib64/libqpidcommon.so.0
#3  0x00000036ad177667 in qpid::sys::Poller::run () from /usr/lib64/libqpidcommon.so.0
#4  0x00000036ad16e64a in ?? () from /usr/lib64/libqpidcommon.so.0
#5  0x0000003c1a006367 in start_thread () from /lib64/libpthread.so.0
#6  0x00000030364d30ad in clone () from /lib64/libc.so.6
(gdb) info threads
  7 process 29921  0x0000003c1a00b5b5 in pthread_sigmask () from /lib64/libpthread.so.0
  6 process 29922  0x0000003c1a00ab00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  5 process 29923  0x0000003c1a00ab00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4 process 29924  0x0000003c1a00ab00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  3 process 29926  0x00000036ad18f4ab in qpid::SessionId::operator< () from /usr/lib64/libqpidcommon.so.0
  2 process 29928  0x00002aef4c7d2850 in qpid::cluster::EventHeader::EventHeader () from /usr/lib64/qpid/daemon/cluster.so
* 1 process 29927  0x0000003c1a008293 in pthread_mutex_lock () from /lib64/libpthread.so.0



Version-Release number of selected component (if applicable):

1.1.2 (qpidd-0.5.752581-17.el5)

How reproducible:

Seems quite frequent.

Steps to Reproduce:

As yet unknown

Comment 1 Andrew Stitcher 2009-06-17 19:24:09 UTC

This appears to be the broker side analog of BZ505231

There is a broker race that can crash the broker if the client heartbeat timeout timer fires at the same time as the client disconnects.

Reproducer:

Run the server thusly on one terminal:

src/qpidd --auth no --port 21022 --no-data-dir --worker-threads 1 & while true; do sleep 1; kill -STOP %%; sleep 2; kill -CONT %%; done

[This runs a broker and stops and continues it, this will make clients disconnect due to heartbeat loss, and start the broker again as the disconnect is going on]

On another terminal run:

while true; do src/tests/perftest --port 21022 --heartbeat 1 & sleep 2 ; kill -STOP %% ; sleep 4 ; kill -CONT %%; done

[This will simultaneously run perftest with heartbeats in a loop, stopping it long enough that the broker will disconnect it due to heartbeat failures]

This test failed within a few minutes in my testing

Comment 2 Andrew Stitcher 2009-06-17 19:25:23 UTC

*** Bug 506102 has been marked as a duplicate of this bug. ***

Comment 3 Andrew Stitcher 2009-06-17 19:25:56 UTC

*** Bug 506498 has been marked as a duplicate of this bug. ***

Comment 4 Gordon Sim 2009-06-22 15:17:41 UTC

Fixed in -19

Comment 5 Andrew Stitcher 2009-06-23 02:54:28 UTC

Created attachment 349030 [details]
Combination of fixes that solve this issue

Comment 6 Jan Sarenik 2009-07-03 13:01:33 UTC

Reproduced on RHEL5, qpid build -17. It took no more than
a minute using the 'while' loops to get a segfault.

No signs of segfault even after 20 minutes of the same
loops running on RHEL5, 0.5.752581-22, i386 and x86_64;
and RHEL4, 0.5.752581-21, i386 and x86_64.

Comment 7 Jan Sarenik 2009-07-03 13:18:26 UTC

Created attachment 350431 [details]
reproducer lines

nothing more than what is already written above

Comment 9 errata-xmlrpc 2009-07-14 17:31:54 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1153.html

Note You need to log in before you can comment on or make changes to this bug.