Bug 504977

Summary:

No knowledge of sessions on channel 0 is transferred to new cluster nodes

Product:

Red Hat Enterprise MRG

Reporter:

Gordon Sim <gsim>

Component:

qpid-cpp

Assignee:

messaging-bugs <messaging-bugs>

Status:

CLOSED ERRATA

QA Contact:

Jan Sarenik <jsarenik>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.0

CC:

jsarenik

Target Milestone:

1.1.2

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-06-12 17:39:31 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Fix for this issue.	none

Description Gordon Sim 2009-06-10 09:40:23 UTC

Description of problem:

Knowledge of client sessions that are attached on channel 0 of their respective connections is not propagated to new nodes that join a cluster. This can result in  errors when further commands are issued over those sessions after a new node has joined.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-14.el5

How reproducible:

100%

Steps to Reproduce:
1. start one cluster node
2. start python pubsub example's topic_subscriber.py
3. start second node for cluster
4. run python pubsub example's topic_publisher.py
  
Actual results:

New node exits with:

2009-jun-10 05:31:35 error Channel exception: not-attached: session.completed: channel 0 is not attached (qpid/amqp_0_10/SessionHandler.cpp:232)
2009-jun-10 05:31:35 critical 10.16.44.221:15217(READY/error) Error 312 did not occur on 10.16.44.221:15149
2009-jun-10 05:31:35 error Error delivering frames: Aborted by local failure that did not occur on all replicas


Expected results:

No node crashes.

Additional info:

You need to be quick to start step 3 above after 2 or the subscriber will timeout. Editing the timeout in the example makes that a little easier.

Comment 1 Gordon Sim 2009-06-10 09:47:50 UTC

Same error can be observed using the JMS pubsub example as well.

Comment 2 Gordon Sim 2009-06-10 10:25:06 UTC

Another reproducer: start first node, connect with qpid-tool, add second node, invoke echo command on broker through the qpid-tool session.

Comment 3 Jan Sarenik 2009-06-10 13:03:43 UTC

Reproduced on RHEL5 qpidd-0.5.752581-14.el5
using qpid-tool.

--------------------------------------------------------------------------
qpid: list broker
Objects of type org.apache.qpid.broker:broker
    ID   Created   Destroyed  Index
    ====================================
    102  12:55:43  -          101.5672
qpid: call 102 echo 1 ahoj
qpid: Call Result: echo  0 (OK) {'body': u'ahoj', 'sequence': 1L}
--------------------------------------------------------------------------

The second qpidd cluster then leaves the cluster and shuts down
with these messages:

-------------------------------------------------------------------------------
2009-jun-10 14:59:26 info Passivating links
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY) DLVR: Event[192.168.254.1:2575-1 data 237 bytes]
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY) DLVR 168:  Frame[Bbe; channel=0; {MessageTransferBody: destination=qpid.management; }] data 192.168.254.1:2575-1
2009-jun-10 14:59:30 debug Exception constructed: receiving Frame[Bbe; channel=0; {MessageTransferBody: destination=qpid.management; }]: channel 0 is not attached (qpid/amqp_0_10/SessionHandler.cpp:79)
2009-jun-10 14:59:30 error Channel exception: not-attached: receiving Frame[Bbe; channel=0; {MessageTransferBody: destination=qpid.management; }]: channel 0 is not attached (qpid/amqp_0_10/SessionHandler.cpp:79)
2009-jun-10 14:59:30 debug 192.168.254.254:17492(READY/error) Session error 168 unresolved: 192.168.254.1:2575 192.168.254.254:17492 
2009-jun-10 14:59:30 trace MCAST 192.168.254.254:17492-0: {ClusterErrorCheckBody: type=1; frame-seq=168; }
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY/error) DLVR: Event[192.168.254.254:17492-0 control 27 bytes]
2009-jun-10 14:59:30 debug 192.168.254.254:17492(READY/error) Error 168 outcome agrees with 192.168.254.254:17492
2009-jun-10 14:59:30 debug 192.168.254.254:17492(READY/error) Error 168 still unresolved: 192.168.254.1:2575 
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY/error) DLVR: Event[192.168.254.1:2575-1 control 22 bytes]
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY/error) DLVR: Event[192.168.254.1:2575-0 control 27 bytes]
2009-jun-10 14:59:30 critical 192.168.254.254:17492(READY/error) Error 168 did not occur on 192.168.254.1:2575
2009-jun-10 14:59:30 debug Exception constructed: Aborted by local failure that did not occur on all replicas
2009-jun-10 14:59:30 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-10 14:59:30 notice 192.168.254.254:17492(LEFT/error) leaving cluster jasan
2009-jun-10 14:59:30 debug 192.168.254.254:17492(LEFT/error) deleted connection: 192.168.254.1:2575-1(shadow)
2009-jun-10 14:59:30 debug Message 0x90326f8 enqueued on mgmt-rhel5.virtual.jsn.2587[0x9075738]
2009-jun-10 14:59:30 debug Shutting down CPG
2009-jun-10 14:59:30 notice Shut down
2009-jun-10 14:59:30 debug Journal "TplStore": Destroyed
-------------------------------------------------------------------------------

Comment 4 Alan Conway 2009-06-10 13:08:35 UTC

Created attachment 347216 [details]
Fix for this issue.

The patch fixes both sides: the broker handles channel 0 correctly and the python client no longer uses channel 0 by default.

Comment 5 Jan Sarenik 2009-06-11 07:13:47 UTC

Reproduced on RHEL5-i386 <=> RHEL5-i386

Verified on qpidc-15 build in these scenarios

  First in cluster | Second
   RHEL5-i386      |  RHEL5-i386
   RHEL5-x86_64    |  RHEL5-i386
   RHEL5-i386      |  RHEL5-x86_64
   RHEL5-x86_64    |  RHEL5-x86_64

Comment 7 errata-xmlrpc 2009-06-12 17:39:31 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1097.html