Bug 504977

Summary: No knowledge of sessions on channel 0 is transferred to new cluster nodes
Product: Red Hat Enterprise MRG Reporter: Gordon Sim <gsim>
Component: qpid-cppAssignee: messaging-bugs <messaging-bugs>
Status: CLOSED ERRATA QA Contact: Jan Sarenik <jsarenik>
Severity: high Docs Contact:
Priority: high    
Version: 1.0CC: jsarenik
Target Milestone: 1.1.2   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-12 17:39:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Fix for this issue. none

Description Gordon Sim 2009-06-10 09:40:23 UTC
Description of problem:

Knowledge of client sessions that are attached on channel 0 of their respective connections is not propagated to new nodes that join a cluster. This can result in  errors when further commands are issued over those sessions after a new node has joined.

Version-Release number of selected component (if applicable):

qpidd-0.5.752581-14.el5

How reproducible:

100%

Steps to Reproduce:
1. start one cluster node
2. start python pubsub example's topic_subscriber.py
3. start second node for cluster
4. run python pubsub example's topic_publisher.py
  
Actual results:

New node exits with:

2009-jun-10 05:31:35 error Channel exception: not-attached: session.completed: channel 0 is not attached (qpid/amqp_0_10/SessionHandler.cpp:232)
2009-jun-10 05:31:35 critical 10.16.44.221:15217(READY/error) Error 312 did not occur on 10.16.44.221:15149
2009-jun-10 05:31:35 error Error delivering frames: Aborted by local failure that did not occur on all replicas


Expected results:

No node crashes.

Additional info:

You need to be quick to start step 3 above after 2 or the subscriber will timeout. Editing the timeout in the example makes that a little easier.

Comment 1 Gordon Sim 2009-06-10 09:47:50 UTC
Same error can be observed using the JMS pubsub example as well.

Comment 2 Gordon Sim 2009-06-10 10:25:06 UTC
Another reproducer: start first node, connect with qpid-tool, add second node, invoke echo command on broker through the qpid-tool session.

Comment 3 Jan Sarenik 2009-06-10 13:03:43 UTC
Reproduced on RHEL5 qpidd-0.5.752581-14.el5
using qpid-tool.

--------------------------------------------------------------------------
qpid: list broker
Objects of type org.apache.qpid.broker:broker
    ID   Created   Destroyed  Index
    ====================================
    102  12:55:43  -          101.5672
qpid: call 102 echo 1 ahoj
qpid: Call Result: echo  0 (OK) {'body': u'ahoj', 'sequence': 1L}
--------------------------------------------------------------------------

The second qpidd cluster then leaves the cluster and shuts down
with these messages:

-------------------------------------------------------------------------------
2009-jun-10 14:59:26 info Passivating links
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY) DLVR: Event[192.168.254.1:2575-1 data 237 bytes]
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY) DLVR 168:  Frame[Bbe; channel=0; {MessageTransferBody: destination=qpid.management; }] data 192.168.254.1:2575-1
2009-jun-10 14:59:30 debug Exception constructed: receiving Frame[Bbe; channel=0; {MessageTransferBody: destination=qpid.management; }]: channel 0 is not attached (qpid/amqp_0_10/SessionHandler.cpp:79)
2009-jun-10 14:59:30 error Channel exception: not-attached: receiving Frame[Bbe; channel=0; {MessageTransferBody: destination=qpid.management; }]: channel 0 is not attached (qpid/amqp_0_10/SessionHandler.cpp:79)
2009-jun-10 14:59:30 debug 192.168.254.254:17492(READY/error) Session error 168 unresolved: 192.168.254.1:2575 192.168.254.254:17492 
2009-jun-10 14:59:30 trace MCAST 192.168.254.254:17492-0: {ClusterErrorCheckBody: type=1; frame-seq=168; }
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY/error) DLVR: Event[192.168.254.254:17492-0 control 27 bytes]
2009-jun-10 14:59:30 debug 192.168.254.254:17492(READY/error) Error 168 outcome agrees with 192.168.254.254:17492
2009-jun-10 14:59:30 debug 192.168.254.254:17492(READY/error) Error 168 still unresolved: 192.168.254.1:2575 
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY/error) DLVR: Event[192.168.254.1:2575-1 control 22 bytes]
2009-jun-10 14:59:30 trace 192.168.254.254:17492(READY/error) DLVR: Event[192.168.254.1:2575-0 control 27 bytes]
2009-jun-10 14:59:30 critical 192.168.254.254:17492(READY/error) Error 168 did not occur on 192.168.254.1:2575
2009-jun-10 14:59:30 debug Exception constructed: Aborted by local failure that did not occur on all replicas
2009-jun-10 14:59:30 error Error delivering frames: Aborted by local failure that did not occur on all replicas
2009-jun-10 14:59:30 notice 192.168.254.254:17492(LEFT/error) leaving cluster jasan
2009-jun-10 14:59:30 debug 192.168.254.254:17492(LEFT/error) deleted connection: 192.168.254.1:2575-1(shadow)
2009-jun-10 14:59:30 debug Message 0x90326f8 enqueued on mgmt-rhel5.virtual.jsn.2587[0x9075738]
2009-jun-10 14:59:30 debug Shutting down CPG
2009-jun-10 14:59:30 notice Shut down
2009-jun-10 14:59:30 debug Journal "TplStore": Destroyed
-------------------------------------------------------------------------------

Comment 4 Alan Conway 2009-06-10 13:08:35 UTC
Created attachment 347216 [details]
Fix for this issue.

The patch fixes both sides: the broker handles channel 0 correctly and the python client no longer uses channel 0 by default.

Comment 5 Jan Sarenik 2009-06-11 07:13:47 UTC
Reproduced on RHEL5-i386 <=> RHEL5-i386

Verified on qpidc-15 build in these scenarios

  First in cluster | Second
   RHEL5-i386      |  RHEL5-i386
   RHEL5-x86_64    |  RHEL5-i386
   RHEL5-i386      |  RHEL5-x86_64
   RHEL5-x86_64    |  RHEL5-x86_64

Comment 7 errata-xmlrpc 2009-06-12 17:39:31 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1097.html