601236 – Persistent cluster problems after reboot -f

Bug 601236 - Persistent cluster problems after reboot -f

Summary: Persistent cluster problems after reboot -f

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	1.1.6
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.2.2
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Jeff Needle
Docs Contact:
URL:
Whiteboard:
Depends On:	541927
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-07 14:18 UTC by Mike Cressman
Modified:	2010-10-08 01:49 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Some information about consumers on a queue was not being replicated to new members joining a cluster. Consequence: After a new member was added to a cluster with active consumers, a cluster member could occasionally fail with "error xxx did not occur on all members". Fix: Replicate the missing information to new members joining a cluster. Consequence: The error no longer occurs.
Clone Of:	541927
Environment:
Last Closed:	2010-10-08 01:49:57 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0756	0	normal	SHIPPED_LIVE	Moderate: Red Hat Enterprise MRG Messaging security and bug fix update 1.2.2	2010-10-08 01:49:47 UTC

Description Mike Cressman 2010-06-07 14:18:44 UTC

+++ This bug was initially created as a clone of Bug #541927 +++

Created an attachment (id=374245)
broker 1 log

See the attached message files.   When the "max-queue-count" limit is reached,  qpidd in node 1 aborted and left the cluster. Node 2 reported   "Channel exception: not-attached: Channel 2 is not attached"  error.  The consumer client blocked and could
not consume any message.   ( the issue is easy to re-produce)

    In my opinion,  when "max-queue-count" limit is reached,  a  "resource-limit-exceeded" exception can be delivered to the
sender client so that it can pause a while and re-send messages again;  and the consumer client should always be able to consume
messages.   It seems the "resource-limit-exceeded" was also delivered on the qpidd server and made some wrong thing  so that
no message is put onto the channel used by the consumer client.   Is this a bug ?     qpidd on the server is  0.5.752581-26.el5 ;   qpidc used is  0.5.752581-34.el5

--- Additional comment from aconway on 2009-11-27 11:06:23 EST ---

Created an attachment (id=374246)
broker 2 log

--- Additional comment from frzhang on 2009-11-29 20:48:32 EST ---

Created an attachment (id=374658)
Testing source codes

The codes are some complicated, but there is a  README.TXT inside explaining what they do.

--- Additional comment from aconway on 2009-11-30 09:53:30 EST ---

Please add a detailed step-by-step description of how to reproduce the bug including:
 - options to qpidd, how many qpidd running.
 - sequence to run clients, with exact command line parameters

--- Additional comment from frzhang on 2009-11-30 20:20:01 EST ---

Created an attachment (id=374939)
scripts for creating exchanges and queues

scripts for creating exchanges and queues

--- Additional comment from frzhang on 2009-11-30 20:21:17 EST ---

Created an attachment (id=374940)
scripts for cleaning up queues and exchanges

scripts for cleaning up queues and exchanges

--- Additional comment from frzhang on 2009-11-30 20:24:35 EST ---

Created an attachment (id=374941)
/etc/sysconfig/qpidd    on one nodes

The /etc/sysconfig/qpidd on the other node is almost the same except it has

    --cluster-url=amqp:tcp:192.168.100.152:5672

--- Additional comment from frzhang on 2009-11-30 20:38:07 EST ---

Created an attachment (id=374942)
steps to run the test

Also please refer to the README.TXT in the source codes about what the example codes do

--- Additional comment from aconway on 2009-12-01 12:28:45 EST ---

I am not able to reproduce the problem. I see broker logs similar to your broker 2 log on all brokers: resource-limit-exceeded followed by channel-not-attached.

Note the channel-not-attached error is correct. An exception automatically closes the session, you need to create a new session to continue sending. The channel-not-attached error is the result of trying to use the old session which has been closed.

You mention you are mixing versions 0.5.752581-26.el5 and 0.5.752581-34.el5. Can  you update everything to -34 just to eliminate that as a possible problem.

Can you try adding these qpidd options: 
   --log-enable=info+ --log-enable=debug+:cluster
and attach the full logs from each of the brokers?

--- Additional comment from aconway on 2009-12-02 08:47:46 EST ---

From frzhang:
    Thank you very much!      1)  The following failure just could not be reproduced by me.  Probably because I made change to my codes too fast,  and could not
          rollback to the point that lead to the issue.
         "Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 critical 192.168.99.11:2337(READY/error) error 72594429 did not occur on 192.168.99.12:2505
          Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 error Error delivering frames: Aborted by local failure that did not occur on all replicas"

--- Additional comment from frzhang on 2009-12-09 21:27:53 EST ---

Created an attachment (id=377362)
Source codes and  other documents 

Attached the latest source code to reproduce the issue.  The .tgz  includes all the source codes and a  /etc/qpidd.conf, and a README.txt describing the codes, and a run_test.txt showing how to do the test,  and some shell scripts to create exchanges and queues.  To reproduce the issue:
      1) run the client applications according to the description in run_test.txt
      2) #> reboot -f      ;to forcibly reboot one node and wait it to start up
      3) #> qpid-cluster   ;to check the status of the cluster,  if it is normal
      4) #> reboot -f      ;to forcibly reboot the other node again,   do the whole steps repeatedly, and the issue will occur

--- Additional comment from frzhang on 2009-12-09 21:43:31 EST ---

As described by the run_test.txt,  the testing environment (all client and server nodes) is using  MRG 1.2 on RHEL 5.4

--- Additional comment from frzhang on 2009-12-09 22:35:38 EST ---

Created an attachment (id=377366)
complete error logs on both nodes

/var/log/messages from node1 and node2.   This time, the issue occured with node2, node1 is still running normally.  After several "reboot -f" of node1 and node2,  it comes that node2 left the cluster shortly after its starting up.  The time of the failure is  "Dec 10 11:14:57".

--- Additional comment from aconway on 2009-12-17 09:52:34 EST ---

I can reproduce this problem, I'm working on a solution now.

--- Additional comment from gsim on 2009-12-21 14:21:20 EST ---

Created an attachment (id=379684)
alternate (simpler) reproducer

1. start first node
2. start attached test consumer
3. send 7 messages to test-queue: (e.g. for m in `seq 1 7`; do echo msg_$m; done | ./sender)
4. start second node for cluster
5. send further 3 messages to test-queue: (e.g. for m in `seq 8 10`; do echo msg_$m; done | ./sender)

At this point the nodes are no longer consistent w.r.t the contents of test-queue. The first node has no messages on the queue, the second has three messages on the queue. This inconsistency could be used to engineer a failure such as that reported above.

The issue appears to be that the messages that are on the queue at the time the second node joins (i.e. all sent messages in above example as they are not acquired until the tenth message is received) are delivered to the subscriber on the second node on the next doOoutput even though they were already delivered and are recorded. Thus there are extra delivery records created on that second node and the delivery ids get out of sync. This then means that the acquire fails to acquire all messages creating an inconsistency on the queue.

--- Additional comment from gsim on 2009-12-21 15:14:33 EST ---

Created an attachment (id=379703)
suggested fix?

--- Additional comment from aconway on 2009-12-22 22:00:32 EST ---

The suggested fix works, comitted r893175

--- Additional comment from aconway on 2009-12-23 12:44:04 EST ---

The fix improves the situation but does not eliminate the problem.

The fix solves the test case from #14, and it makes it much harder to reproduce the problem in the test from #10, but eventually I see the same problem  (without the fix I see the problem in < 1 minute, with the fix it takes up to 10 minutes to reproduce)

--- Additional comment from aconway on 2009-12-23 16:05:52 EST ---

Adding another patch resolves the problem fully, it has been committed on the mrg_1.1.x branch:

    Backport from r888874, required to complete the fix for 541927
    
    QPID-2253 -  Cluster node shuts down with inconsistent error.
    
    Add a missing memberUpdate on the transition to CATCHUP mode.
    The inconsistent error was caused because the newly updated member
    did not have its membership updated and so was missing an failover
    update message that the existing members sent to a new client.

Comment 2 Alan Conway 2010-09-24 21:08:20 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Cause: Some information about consumers on a queue was not being replicated to new members joining a cluster.

Consequence: After a new member was added to a cluster with active consumers, a cluster member could occasionally fail with "error xxx did not occur on all members".

Fix: Replicate the missing information to new members joining a cluster.

Consequence: The error no longer occurs.

Comment 4 errata-xmlrpc 2010-10-08 01:49:57 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0756.html

Note You need to log in before you can comment on or make changes to this bug.