Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 633969

Summary: Resource locked exception thrown during failover of a JMS durable subscriber
Product: Red Hat Enterprise MRG Reporter: Rajith Attapattu <rattapat+nobody>
Component: qpid-cppAssignee: Rajith Attapattu <rattapat+nobody>
Status: CLOSED ERRATA QA Contact: Jeff Needle <jneedle>
Severity: high Docs Contact:
Priority: high    
Version: DevelopmentCC: gsim, rmusil
Target Milestone: 1.3   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-10-20 11:30:33 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Reproducer
none
Reproducer2
none
Proposed fix as a patch against the 1.3.x release branch none

Description Rajith Attapattu 2010-09-14 19:24:55 UTC
Description of problem:
A Resource locked exception is thrown during failover of a JMS durable subscriber. The following are important points worth keeping in mind when looking at the problem.

1. The exception only happens when the client is failing over the second time. This exception is not thrown during the first time it fails over.

2. The resource locked exception is thrown when attempting an exclusive subscription on any queue that had a valid exclusive subscription before the failover occurred. 
   If you run the attached reproducer you could see that that sometimes this exception is thrown on the queue used by the durable subscription while other times it's on the queue used by the failover exchange method.

3. When failing over the second time, the issue is reproducible whether the node it connects to is a restarted node or an original node within the cluster.

4. If you try to connect to the queue in question immediately using drain it succeeds.

Version-Release number of selected component (if applicable):
Reproducible on the current set of packages as well as on trunk (as of rev 997048).

How reproducible:
Always

Steps to Reproduce:
1. Start an n node cluster ( n >= 2)
2. Run the attached reproducer.
3. Stop the broker the client is connected to.
4. Restart the stopped broker.
5. Now stop the second broker (the broker that the client connected to after the first failover).
6  Observe the client session is terminated with the resource locked exception.

Comment 1 Rajith Attapattu 2010-09-14 21:30:25 UTC
Created attachment 447328 [details]
Reproducer

(1) The attached test class points to localhost:7672 as the initial broker.
Please modify it to suit your test environment.

(2) In order to reproduce this issue you need to side step the issue described in Bug 633942. Therefore use -Dqpid.dest_syntax=BURL (jvm arg) when running the test client.

Comment 2 Rajith Attapattu 2010-09-16 01:20:46 UTC
Created attachment 447601 [details]
Reproducer2

The same issue can be reproduced more easily with the attached reproducer2.
Extract the tar file and run the test-action.sh script to observe the error.

1. The same issue is observed with the current set of packages as well as trunk (as of rev 997048).

2. However the test case (Reproducer2) passes in some machines. Perhaps there is some timing issue?
   Another important point to note is that in the same machine Reproducer1 (the durable subscriber test) fails.

3. When running the reproducer2 in addition to the resource locked exception you could also see a series of channel not attached exceptions.

2010-09-15 18:47:48 error Channel exception: not-attached: Channel 12 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
2010-09-15 18:47:48 error Channel exception: not-attached: Channel 12 is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)

However you also see the following, which is essentially the same problem described above.

IoReceiver - /192.168.1.103:5672 2010-09-15 18:47:49,188 ERROR [apache.qpid.client.AMQConnectionDelegate_0_10] previous exception
org.apache.qpid.transport.ConnectionException: too many exceptions: ch=4 id=0 ExecutionException(errorCode=RESOURCE_LOCKED, commandId=24, classCode=4, commandCode=7, fieldIndex=0, description=resource-locked: Queue _ has an exclusive consumer. No more consumers allowed. (qpid/broker/Queue.cpp:459), errorInfo={}), ch=12 id=0 ExecutionException(errorCode=RESOURCE_LOCKED, commandId=24, classCode=4, commandCode=7, fieldIndex=0, description=resource-locked: Queue _ has an exclusive consumer. No more consumers allowed. (qpid/broker/Queue.cpp:459), errorInfo={})

Comment 4 Rajith Attapattu 2010-09-21 02:27:36 UTC
The resource locked exception is due to duplicate subscriptions created on the same queue by the Java client during failover.

These duplicate subscriptions happen at different layers in the client.
The JMS layer recreates subscriptions after failover.
However the AMQP commands stored in the lower layer (for replay) could also contain message subscriptions. If it does then after failover they get replayed and when the JMS layer tries to re-create the subscription it will fail if exclusive flag is set on the subscription.

The JMS layer has sync flag set after creating a subscription, hence the broker would have sent the completion and the MessageSubscription command should have been removed from the queue. Therefore there might be an additional issue where the Java client may not be removing commands from the internal command array once it receives the completions.

However if we modify the Java client to not store any AMQP commands other than message transfers we could easily prevent it from causing this issue and Bug 634794.
Since the 0-10 client is not implementing full session resume, there is no advantage in replaying anything other than message transfers.

Comment 5 Rajith Attapattu 2010-09-21 02:29:17 UTC
This issue is tracked upstream via QPID-2876

Following is the proposed patch for this issue.

--- qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java (original)
+++ qpid/trunk/qpid/java/common/src/main/java/org/apache/qpid/transport/Session.java Tue Sep 21 02:19:15 2010
@@ -645,7 +645,7 @@ public class Session extends SessionInvo
                {
                    sessionCommandPoint(0, 0);
                }
-                if ((!closing && !m.isUnreliable()) || m.hasCompletionListener())
+                if ((!closing && m instanceof MessageTransfer) || m.hasCompletionListener())
                {
                    commands[mod(next, commands.length)] = m;
                    commandBytes += m.getBodySize();


The patch ensures that we only store MessageTransfers

Comment 6 Rajith Attapattu 2010-09-21 18:16:22 UTC
Created attachment 448766 [details]
Proposed fix as a patch against the 1.3.x release branch

The proposed patch contains the following changes

Session.java
=============
Changes: 

Instead of storing any command thats marked reliable, we now only store message transfers. 

The initial cherry-pick from trunk contained and additional boolean called isClosing within the if condition and is related to a different commit.

Therefore the subsequent patch removes the isClosing variable from the initial commit.

Risk : 

Since we do not implement full session resume, there is no added advantage in replaying anything other than the message transfers. Therefore this change is low risk.

Comment 7 Rajith Attapattu 2010-09-23 13:18:32 UTC
The patch was committed to the 1.3.x branch in the internal git repo.

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=98e26823a4054972f6f5cd3f0db51f144a9b3015

http://mrg1.lab.bos.redhat.com/cgit/qpid.git/commit/?id=361a50de9bb50c30bd0dc6ca41dc167666f170f0

These changes are included in the 7.946106-10 package set.

Comment 8 Jiri Kolar 2010-09-29 15:04:19 UTC
fixed in qpid-cpp-server-0.7.946106-17

validated on RHEL5.5  i386 / x86_64  

packages:
# rpm -qa | grep -E '(qpid|openais|rhm)' | sort -u

openais-0.80.6-16.el5_5.7
openais-devel-0.80.6-16.el5_5.7
python-qpid-0.7.946106-14.el5
qpid-cpp-client-0.7.946106-17.el5
qpid-cpp-client-devel-0.7.946106-17.el5
qpid-cpp-client-devel-docs-0.7.946106-17.el5
qpid-cpp-client-ssl-0.7.946106-17.el5
qpid-cpp-mrg-debuginfo-0.7.946106-14.el5
qpid-cpp-server-0.7.946106-17.el5
qpid-cpp-server-cluster-0.7.946106-17.el5
qpid-cpp-server-devel-0.7.946106-17.el5
qpid-cpp-server-ssl-0.7.946106-17.el5
qpid-cpp-server-store-0.7.946106-17.el5
qpid-cpp-server-xml-0.7.946106-17.el5
qpid-java-client-0.7.946106-10.el5
qpid-java-common-0.7.946106-10.el5
qpid-tools-0.7.946106-11.el5
rhm-docs-0.7.946106-5.el5
rh-tests-distribution-MRG-Messaging-qpid_common-1.6-53


->VERIFIED