Created attachment 374245 [details] broker 1 log See the attached message files. When the "max-queue-count" limit is reached, qpidd in node 1 aborted and left the cluster. Node 2 reported "Channel exception: not-attached: Channel 2 is not attached" error. The consumer client blocked and could not consume any message. ( the issue is easy to re-produce) In my opinion, when "max-queue-count" limit is reached, a "resource-limit-exceeded" exception can be delivered to the sender client so that it can pause a while and re-send messages again; and the consumer client should always be able to consume messages. It seems the "resource-limit-exceeded" was also delivered on the qpidd server and made some wrong thing so that no message is put onto the channel used by the consumer client. Is this a bug ? qpidd on the server is 0.5.752581-26.el5 ; qpidc used is 0.5.752581-34.el5
Created attachment 374246 [details] broker 2 log
Created attachment 374658 [details] Testing source codes The codes are some complicated, but there is a README.TXT inside explaining what they do.
Please add a detailed step-by-step description of how to reproduce the bug including: - options to qpidd, how many qpidd running. - sequence to run clients, with exact command line parameters
Created attachment 374939 [details] scripts for creating exchanges and queues scripts for creating exchanges and queues
Created attachment 374940 [details] scripts for cleaning up queues and exchanges scripts for cleaning up queues and exchanges
Created attachment 374941 [details] /etc/sysconfig/qpidd on one nodes The /etc/sysconfig/qpidd on the other node is almost the same except it has --cluster-url=amqp:tcp:192.168.100.152:5672
Created attachment 374942 [details] steps to run the test Also please refer to the README.TXT in the source codes about what the example codes do
I am not able to reproduce the problem. I see broker logs similar to your broker 2 log on all brokers: resource-limit-exceeded followed by channel-not-attached. Note the channel-not-attached error is correct. An exception automatically closes the session, you need to create a new session to continue sending. The channel-not-attached error is the result of trying to use the old session which has been closed. You mention you are mixing versions 0.5.752581-26.el5 and 0.5.752581-34.el5. Can you update everything to -34 just to eliminate that as a possible problem. Can you try adding these qpidd options: --log-enable=info+ --log-enable=debug+:cluster and attach the full logs from each of the brokers?
From frzhang: Thank you very much! 1) The following failure just could not be reproduced by me. Probably because I made change to my codes too fast, and could not rollback to the point that lead to the issue. "Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 critical 192.168.99.11:2337(READY/error) error 72594429 did not occur on 192.168.99.12:2505 Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 error Error delivering frames: Aborted by local failure that did not occur on all replicas"
Created attachment 377362 [details] Source codes and other documents Attached the latest source code to reproduce the issue. The .tgz includes all the source codes and a /etc/qpidd.conf, and a README.txt describing the codes, and a run_test.txt showing how to do the test, and some shell scripts to create exchanges and queues. To reproduce the issue: 1) run the client applications according to the description in run_test.txt 2) #> reboot -f ;to forcibly reboot one node and wait it to start up 3) #> qpid-cluster ;to check the status of the cluster, if it is normal 4) #> reboot -f ;to forcibly reboot the other node again, do the whole steps repeatedly, and the issue will occur
As described by the run_test.txt, the testing environment (all client and server nodes) is using MRG 1.2 on RHEL 5.4
Created attachment 377366 [details] complete error logs on both nodes /var/log/messages from node1 and node2. This time, the issue occured with node2, node1 is still running normally. After several "reboot -f" of node1 and node2, it comes that node2 left the cluster shortly after its starting up. The time of the failure is "Dec 10 11:14:57".
I can reproduce this problem, I'm working on a solution now.
Created attachment 379684 [details] alternate (simpler) reproducer 1. start first node 2. start attached test consumer 3. send 7 messages to test-queue: (e.g. for m in `seq 1 7`; do echo msg_$m; done | ./sender) 4. start second node for cluster 5. send further 3 messages to test-queue: (e.g. for m in `seq 8 10`; do echo msg_$m; done | ./sender) At this point the nodes are no longer consistent w.r.t the contents of test-queue. The first node has no messages on the queue, the second has three messages on the queue. This inconsistency could be used to engineer a failure such as that reported above. The issue appears to be that the messages that are on the queue at the time the second node joins (i.e. all sent messages in above example as they are not acquired until the tenth message is received) are delivered to the subscriber on the second node on the next doOoutput even though they were already delivered and are recorded. Thus there are extra delivery records created on that second node and the delivery ids get out of sync. This then means that the acquire fails to acquire all messages creating an inconsistency on the queue.
Created attachment 379703 [details] suggested fix?
The suggested fix works, comitted r893175
The fix improves the situation but does not eliminate the problem. The fix solves the test case from #14, and it makes it much harder to reproduce the problem in the test from #10, but eventually I see the same problem (without the fix I see the problem in < 1 minute, with the fix it takes up to 10 minutes to reproduce)
Adding another patch resolves the problem fully, it has been committed on the mrg_1.1.x branch: Backport from r888874, required to complete the fix for 541927 QPID-2253 - Cluster node shuts down with inconsistent error. Add a missing memberUpdate on the transition to CATCHUP mode. The inconsistent error was caused because the newly updated member did not have its membership updated and so was missing an failover update message that the existing members sent to a new client.
Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: Newly added members now update their membership properly. As a result, new clients no longer miss failover update messages, and no longer cause the cluster to shut down because of the inconsistencies between the nodes.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0773.html