Red Hat Bugzilla – Bug 601236
Persistent cluster problems after reboot -f
Last modified: 2010-10-07 21:49:57 EDT
+++ This bug was initially created as a clone of Bug #541927 +++
Created an attachment (id=374245)
broker 1 log
See the attached message files. When the "max-queue-count" limit is reached, qpidd in node 1 aborted and left the cluster. Node 2 reported "Channel exception: not-attached: Channel 2 is not attached" error. The consumer client blocked and could
not consume any message. ( the issue is easy to re-produce)
In my opinion, when "max-queue-count" limit is reached, a "resource-limit-exceeded" exception can be delivered to the
sender client so that it can pause a while and re-send messages again; and the consumer client should always be able to consume
messages. It seems the "resource-limit-exceeded" was also delivered on the qpidd server and made some wrong thing so that
no message is put onto the channel used by the consumer client. Is this a bug ? qpidd on the server is 0.5.752581-26.el5 ; qpidc used is 0.5.752581-34.el5
--- Additional comment from email@example.com on 2009-11-27 11:06:23 EST ---
Created an attachment (id=374246)
broker 2 log
--- Additional comment from firstname.lastname@example.org on 2009-11-29 20:48:32 EST ---
Created an attachment (id=374658)
Testing source codes
The codes are some complicated, but there is a README.TXT inside explaining what they do.
--- Additional comment from email@example.com on 2009-11-30 09:53:30 EST ---
Please add a detailed step-by-step description of how to reproduce the bug including:
- options to qpidd, how many qpidd running.
- sequence to run clients, with exact command line parameters
--- Additional comment from firstname.lastname@example.org on 2009-11-30 20:20:01 EST ---
Created an attachment (id=374939)
scripts for creating exchanges and queues
scripts for creating exchanges and queues
--- Additional comment from email@example.com on 2009-11-30 20:21:17 EST ---
Created an attachment (id=374940)
scripts for cleaning up queues and exchanges
scripts for cleaning up queues and exchanges
--- Additional comment from firstname.lastname@example.org on 2009-11-30 20:24:35 EST ---
Created an attachment (id=374941)
/etc/sysconfig/qpidd on one nodes
The /etc/sysconfig/qpidd on the other node is almost the same except it has
--- Additional comment from email@example.com on 2009-11-30 20:38:07 EST ---
Created an attachment (id=374942)
steps to run the test
Also please refer to the README.TXT in the source codes about what the example codes do
--- Additional comment from firstname.lastname@example.org on 2009-12-01 12:28:45 EST ---
I am not able to reproduce the problem. I see broker logs similar to your broker 2 log on all brokers: resource-limit-exceeded followed by channel-not-attached.
Note the channel-not-attached error is correct. An exception automatically closes the session, you need to create a new session to continue sending. The channel-not-attached error is the result of trying to use the old session which has been closed.
You mention you are mixing versions 0.5.752581-26.el5 and 0.5.752581-34.el5. Can you update everything to -34 just to eliminate that as a possible problem.
Can you try adding these qpidd options:
and attach the full logs from each of the brokers?
--- Additional comment from email@example.com on 2009-12-02 08:47:46 EST ---
Thank you very much! 1) The following failure just could not be reproduced by me. Probably because I made change to my codes too fast, and could not
rollback to the point that lead to the issue.
"Nov 26 13:20:46 dellpc2 qpidd: 2009-nov-26 13:20:46 critical 192.168.99.11:2337(READY/error) error 72594429 did not occur on 192.168.99.12:2505
Nov 26 13:20:46 dellpc2 qpidd: 2009-nov-26 13:20:46 error Error delivering frames: Aborted by local failure that did not occur on all replicas"
--- Additional comment from firstname.lastname@example.org on 2009-12-09 21:27:53 EST ---
Created an attachment (id=377362)
Source codes and other documents
Attached the latest source code to reproduce the issue. The .tgz includes all the source codes and a /etc/qpidd.conf, and a README.txt describing the codes, and a run_test.txt showing how to do the test, and some shell scripts to create exchanges and queues. To reproduce the issue:
1) run the client applications according to the description in run_test.txt
2) #> reboot -f ;to forcibly reboot one node and wait it to start up
3) #> qpid-cluster ;to check the status of the cluster, if it is normal
4) #> reboot -f ;to forcibly reboot the other node again, do the whole steps repeatedly, and the issue will occur
--- Additional comment from email@example.com on 2009-12-09 21:43:31 EST ---
As described by the run_test.txt, the testing environment (all client and server nodes) is using MRG 1.2 on RHEL 5.4
--- Additional comment from firstname.lastname@example.org on 2009-12-09 22:35:38 EST ---
Created an attachment (id=377366)
complete error logs on both nodes
/var/log/messages from node1 and node2. This time, the issue occured with node2, node1 is still running normally. After several "reboot -f" of node1 and node2, it comes that node2 left the cluster shortly after its starting up. The time of the failure is "Dec 10 11:14:57".
--- Additional comment from email@example.com on 2009-12-17 09:52:34 EST ---
I can reproduce this problem, I'm working on a solution now.
--- Additional comment from firstname.lastname@example.org on 2009-12-21 14:21:20 EST ---
Created an attachment (id=379684)
alternate (simpler) reproducer
1. start first node
2. start attached test consumer
3. send 7 messages to test-queue: (e.g. for m in `seq 1 7`; do echo msg_$m; done | ./sender)
4. start second node for cluster
5. send further 3 messages to test-queue: (e.g. for m in `seq 8 10`; do echo msg_$m; done | ./sender)
At this point the nodes are no longer consistent w.r.t the contents of test-queue. The first node has no messages on the queue, the second has three messages on the queue. This inconsistency could be used to engineer a failure such as that reported above.
The issue appears to be that the messages that are on the queue at the time the second node joins (i.e. all sent messages in above example as they are not acquired until the tenth message is received) are delivered to the subscriber on the second node on the next doOoutput even though they were already delivered and are recorded. Thus there are extra delivery records created on that second node and the delivery ids get out of sync. This then means that the acquire fails to acquire all messages creating an inconsistency on the queue.
--- Additional comment from email@example.com on 2009-12-21 15:14:33 EST ---
Created an attachment (id=379703)
--- Additional comment from firstname.lastname@example.org on 2009-12-22 22:00:32 EST ---
The suggested fix works, comitted r893175
--- Additional comment from email@example.com on 2009-12-23 12:44:04 EST ---
The fix improves the situation but does not eliminate the problem.
The fix solves the test case from #14, and it makes it much harder to reproduce the problem in the test from #10, but eventually I see the same problem (without the fix I see the problem in < 1 minute, with the fix it takes up to 10 minutes to reproduce)
--- Additional comment from firstname.lastname@example.org on 2009-12-23 16:05:52 EST ---
Adding another patch resolves the problem fully, it has been committed on the mrg_1.1.x branch:
Backport from r888874, required to complete the fix for 541927
QPID-2253 - Cluster node shuts down with inconsistent error.
Add a missing memberUpdate on the transition to CATCHUP mode.
The inconsistent error was caused because the newly updated member
did not have its membership updated and so was missing an failover
update message that the existing members sent to a new client.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Cause: Some information about consumers on a queue was not being replicated to new members joining a cluster.
Consequence: After a new member was added to a cluster with active consumers, a cluster member could occasionally fail with "error xxx did not occur on all members".
Fix: Replicate the missing information to new members joining a cluster.
Consequence: The error no longer occurs.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.