502193 – Asynch replication using acknowledged federation links with cluster as source cause invalid-argument errors on shadow sessions

Bug 502193 - Asynch replication using acknowledged federation links with cluster as source cause invalid-argument errors on shadow sessions

Summary: Asynch replication using acknowledged federation links with cluster as source...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.1
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	1.1.2
Target Release:	---
Assignee:	Gordon Sim
QA Contact:	Frantisek Reznicek
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-22 13:52 UTC by Gordon Sim
Modified:	2015-11-16 00:07 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-06-12 17:39:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Fix (option a) (690 bytes, patch) 2009-05-26 10:22 UTC, Gordon Sim	no flags	Details \| Diff
Fix (option b) (4.39 KB, patch) 2009-05-26 15:18 UTC, Gordon Sim	no flags	Details \| Diff
Fix (option b), created against r752581 (4.94 KB, patch) 2009-05-26 15:42 UTC, Gordon Sim	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:1097	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging bug fixing update	2009-06-12 17:38:48 UTC

Description Gordon Sim 2009-05-22 13:52:31 UTC

Description of problem:

If the federation link used to transfer replication events has a clustered node as the source and is configured to use acknowledgements, then the other nodes on the source cluster detect what they perceive to be inconsistent session state.

E.g. Execution exception: invalid-argument: guest: confirmed < (202+0) but only sent < (197+0) (qpid/SessionState.cpp:163)

If the cluster error checking is on, this causes those other nodes to exit.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-7.el5 and earlier

(Note that for qpidd-0.5.752581-7.el5 with --cluster-check-errors false the nodes don't exit and merely report the error).

How reproducible:

Fairly easily though it is timing dependent so multi-processor boxes with reasonable load show it more easily.

Steps to Reproduce:
1. start n node cluster as primary with replication listener configured
e.g. add --replication-queue REPLICATION_QUEUE --create-replication-queue
2. start another broker (or cluster) as backup
3. enable replication between the two
e.g.
qpid-config -a $DR1 add exchange replication REPLICATION_EXCHANGE
qpid-route --ack 1 queue add $DR1 $PRIMARY1 REPLICATION_EXCHANGE REPLICATION_QUEUE
4. create a queue on primary cluster with event generation enabled
e.g. qpid-config -a PRIMARY2 add queue test-queue --generate-queue-events 1
5. create queue with same name on backup
e.g. qpid-config -a DR2 add queue test-queue
6. send messages to that queue on primary
e.g. for i in `seq 1 10000`; do echo Message $i; done | ./qpid/cpp/src/tests/sender -p PRIMARY_PORT4

Actual results:

Some number (< n) of the primary clusters nodes exit (or if using qpidd-0.5.752581-7.el5 and --cluster-check-errors is false, they merely log the occurence of errors on shadow sessions).

Expected results:

No crashes, no errors.

Additional Info:

Believed to be caused by non-determinism in the ordering of cluster operations and processing of events.

Comment 1 Gordon Sim 2009-05-26 10:22:50 UTC

Created attachment 345436 [details]
Fix (option a)

This fix would also require the 1.1.2 fix from BZ499872; it suppresses the exception caused by temporary drift in session state across the primary nodes for the replicating bridges session.

Comment 2 Frantisek Reznicek 2009-05-26 10:50:37 UTC

FYI
Reproduced fine on RHEL 5.3 x86_64 and qpidd-0.5.752581-5.el5 with following backtrace:

GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...
Reading symbols from /usr/lib64/libqpidbroker.so.0...Reading symbols from /usr/lib/debug/usr/lib64/libqpidbroker.so.0.1.0.debug...done.
done.
...
Reading symbols from /usr/lib64/libaio.so.1...done.
Loaded symbols for /usr/lib64/libaio.so.1
Core was generated by `qpidd -p 10002 --auth no --log-enable info+ --cluster-name dhcp-lab-200.englab.'.
Program terminated with signal 6, Aborted.
[New process 12046]
[New process 12052]
[New process 12048]
#0  0x0000003564030215 in raise () from /lib64/libc.so.6
(gdb) 
Thread 3 (process 12048):
#0  0x00000035640dee6e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x0000003564076944 in _L_lock_15349 () from /lib64/libc.so.6
#2  0x0000003564075901 in free () from /lib64/libc.so.6
#3  0x00000036787a32dc in std::_Rb_tree<qpid::management::ObjectId, std::pair<qpid::management::ObjectId const, qpid::management::ManagementObject*>, std::_Select1st<std::pair<qpid::management::ObjectId const, qpid::management::ManagementObject*> >, std::less<qpid::management::ObjectId>, std::allocator<std::pair<qpid::management::ObjectId const, qpid::management::ManagementObject*> > >::_M_erase (this=<value optimized out>, __x=<value optimized out>) at /usr/include/c++/4.1.2/ext/new_allocator.h:94
#4  0x00000036787973af in qpid::management::ManagementBroker::moveNewObjectsLH (this=<value optimized out>)
    at /usr/include/c++/4.1.2/bits/stl_tree.h:692
#5  0x00000036787a1a87 in qpid::management::ManagementBroker::periodicProcessing (this=<value optimized out>)
    at qpid/management/ManagementBroker.cpp:352
#6  0x00000036787a2608 in qpid::management::ManagementBroker::Periodic::fire (this=<value optimized out>)
    at qpid/management/ManagementBroker.cpp:252
#7  0x000000367878797a in qpid::broker::Timer::run (this=<value optimized out>) at qpid/broker/Timer.cpp:67
#8  0x000000367816c76a in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#9  0x0000003564c06367 in start_thread () from /lib64/libpthread.so.0
#10 0x00000035640d30ad in clone () from /lib64/libc.so.6

Thread 2 (process 12052):
#0  0x0000003564c0ab00 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000003678787cbf in qpid::broker::Timer::run (this=<value optimized out>) at qpid/sys/posix/Condition.h:69
#2  0x000000367816c76a in runRunnable (p=<value optimized out>) at qpid/sys/posix/Thread.cpp:35
#3  0x0000003564c06367 in start_thread () from /lib64/libpthread.so.0
#4  0x00000035640d30ad in clone () from /lib64/libc.so.6

Thread 1 (process 12046):
#0  0x0000003564030215 in raise () from /lib64/libc.so.6
#1  0x0000003564031cc0 in abort () from /lib64/libc.so.6
#2  0x000000356406a7fb in __libc_message () from /lib64/libc.so.6
#3  0x00000035640700b3 in malloc_consolidate () from /lib64/libc.so.6
#4  0x00000035640723fd in _int_malloc () from /lib64/libc.so.6
#5  0x000000356407402a in malloc () from /lib64/libc.so.6
#6  0x00002b581b75f16c in __os_malloc () from /usr/lib64/libdb_cxx-4.3.so
#7  0x00002b581b75e05a in __memp_sync_int () from /usr/lib64/libdb_cxx-4.3.so
#8  0x00002b581b717d2b in __db_sync () from /usr/lib64/libdb_cxx-4.3.so
#9  0x00002b581b7171af in __db_refresh () from /usr/lib64/libdb_cxx-4.3.so
#10 0x00002b581b71730e in __db_close () from /usr/lib64/libdb_cxx-4.3.so
#11 0x00002b581b726440 in __db_close_pp () from /usr/lib64/libdb_cxx-4.3.so
#12 0x00002b581b6b547c in Db::close () from /usr/lib64/libdb_cxx-4.3.so
#13 0x00002b581b3d7050 in mrg::msgstore::MessageStoreImpl::~MessageStoreImpl () from /usr/lib64/qpid/daemon/msgstore.so
#14 0x00000036787421db in ~MessageStoreModule (this=<value optimized out>) at qpid/broker/MessageStoreModule.cpp:39
#15 0x00000036786c9424 in ~Broker (this=<value optimized out>) at /usr/include/c++/4.1.2/memory:259
#16 0x000000356403363e in __cxa_finalize () from /lib64/libc.so.6
#17 0x0000003678688506 in __do_global_dtors_aux () from /usr/lib64/libqpidbroker.so.0
#18 0x0000000000000000 in ?? ()
(gdb) quit

Comment 3 Gordon Sim 2009-05-26 15:18:25 UTC

Created attachment 345483 [details]
Fix (option b)

This is an alternative (or additional fix). It adds an option that controls whether queue events are processed by listeners on the thread that generates the event (made the default option by this patch) or whether they are handed off via a PollableQueue to another thread (the current approach).

Comment 4 Gordon Sim 2009-05-26 15:42:56 UTC

Created attachment 345489 [details]
Fix (option b), created against r752581

This is just a slightly altered version of option (b) that will apply cleanly to r752581 (previous attachment for option (b) will apply against trunk).

Comment 5 Gordon Sim 2009-05-28 10:32:20 UTC

Fixed in qpidd-0.5.752581-8

Comment 6 Frantisek Reznicek 2009-05-29 15:22:45 UTC

The issue has been fixed, validated on RHEL 5.3 i386/x86_64 on both 
qpidd-0.5.752581-9.el5 and qpidd-0.5.752581-10.el5.

->VERIFIED

Comment 8 errata-xmlrpc 2009-06-12 17:39:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1097.html

Note You need to log in before you can comment on or make changes to this bug.