Bug 541927 - Persistent cluster problems after reboot -f
Summary: Persistent cluster problems after reboot -f
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.1.6
Hardware: All
OS: Linux
high
high
Target Milestone: 1.3
: ---
Assignee: Alan Conway
QA Contact: Jeff Needle
URL:
Whiteboard:
Depends On:
Blocks: 601236
TreeView+ depends on / blocked
 
Reported: 2009-11-27 16:05 UTC by Alan Conway
Modified: 2018-10-27 12:50 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Newly added members now update their membership properly. As a result, new clients no longer miss failover update messages, and no longer cause the cluster to shut down because of the inconsistencies between the nodes.
Clone Of:
: 601236 (view as bug list)
Environment:
Last Closed: 2010-10-14 16:02:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
broker 1 log (1.03 KB, text/plain)
2009-11-27 16:05 UTC, Alan Conway
no flags Details
broker 2 log (1.08 KB, text/plain)
2009-11-27 16:06 UTC, Alan Conway
no flags Details
Testing source codes (7.36 KB, application/x-gzip)
2009-11-30 01:48 UTC, Qianfeng Zhang
no flags Details
scripts for creating exchanges and queues (878 bytes, application/x-shellscript)
2009-12-01 01:20 UTC, Qianfeng Zhang
no flags Details
scripts for cleaning up queues and exchanges (679 bytes, application/x-shellscript)
2009-12-01 01:21 UTC, Qianfeng Zhang
no flags Details
/etc/sysconfig/qpidd on one nodes (193 bytes, application/octet-stream)
2009-12-01 01:24 UTC, Qianfeng Zhang
no flags Details
steps to run the test (525 bytes, text/plain)
2009-12-01 01:38 UTC, Qianfeng Zhang
no flags Details
Source codes and other documents (9.42 KB, application/x-gzip)
2009-12-10 02:27 UTC, Qianfeng Zhang
no flags Details
complete error logs on both nodes (313.07 KB, application/x-gzip)
2009-12-10 03:35 UTC, Qianfeng Zhang
no flags Details
alternate (simpler) reproducer (1.58 KB, text/x-c++src)
2009-12-21 19:21 UTC, Gordon Sim
no flags Details
suggested fix? (2.23 KB, patch)
2009-12-21 20:14 UTC, Gordon Sim
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0773 0 normal SHIPPED_LIVE Moderate: Red Hat Enterprise MRG Messaging and Grid Version 1.3 2010-10-14 15:56:44 UTC

Description Alan Conway 2009-11-27 16:05:30 UTC
Created attachment 374245 [details]
broker 1 log

See the attached message files.   When the "max-queue-count" limit is reached,  qpidd in node 1 aborted and left the cluster. Node 2 reported   "Channel exception: not-attached: Channel 2 is not attached"  error.  The consumer client blocked and could
not consume any message.   ( the issue is easy to re-produce)

    In my opinion,  when "max-queue-count" limit is reached,  a  "resource-limit-exceeded" exception can be delivered to the
sender client so that it can pause a while and re-send messages again;  and the consumer client should always be able to consume
messages.   It seems the "resource-limit-exceeded" was also delivered on the qpidd server and made some wrong thing  so that
no message is put onto the channel used by the consumer client.   Is this a bug ?     qpidd on the server is  0.5.752581-26.el5 ;   qpidc used is  0.5.752581-34.el5

Comment 1 Alan Conway 2009-11-27 16:06:23 UTC
Created attachment 374246 [details]
broker 2 log

Comment 2 Qianfeng Zhang 2009-11-30 01:48:32 UTC
Created attachment 374658 [details]
Testing source codes

The codes are some complicated, but there is a  README.TXT inside explaining what they do.

Comment 3 Alan Conway 2009-11-30 14:53:30 UTC
Please add a detailed step-by-step description of how to reproduce the bug including:
 - options to qpidd, how many qpidd running.
 - sequence to run clients, with exact command line parameters

Comment 4 Qianfeng Zhang 2009-12-01 01:20:01 UTC
Created attachment 374939 [details]
scripts for creating exchanges and queues

scripts for creating exchanges and queues

Comment 5 Qianfeng Zhang 2009-12-01 01:21:17 UTC
Created attachment 374940 [details]
scripts for cleaning up queues and exchanges

scripts for cleaning up queues and exchanges

Comment 6 Qianfeng Zhang 2009-12-01 01:24:35 UTC
Created attachment 374941 [details]
/etc/sysconfig/qpidd    on one nodes

The /etc/sysconfig/qpidd on the other node is almost the same except it has

    --cluster-url=amqp:tcp:192.168.100.152:5672

Comment 7 Qianfeng Zhang 2009-12-01 01:38:07 UTC
Created attachment 374942 [details]
steps to run the test

Also please refer to the README.TXT in the source codes about what the example codes do

Comment 8 Alan Conway 2009-12-01 17:28:45 UTC
I am not able to reproduce the problem. I see broker logs similar to your broker 2 log on all brokers: resource-limit-exceeded followed by channel-not-attached.

Note the channel-not-attached error is correct. An exception automatically closes the session, you need to create a new session to continue sending. The channel-not-attached error is the result of trying to use the old session which has been closed.

You mention you are mixing versions 0.5.752581-26.el5 and 0.5.752581-34.el5. Can  you update everything to -34 just to eliminate that as a possible problem.

Can you try adding these qpidd options: 
   --log-enable=info+ --log-enable=debug+:cluster
and attach the full logs from each of the brokers?

Comment 9 Alan Conway 2009-12-02 13:47:46 UTC
From frzhang:
    Thank you very much!      1)  The following failure just could not be reproduced by me.  Probably because I made change to my codes too fast,  and could not
          rollback to the point that lead to the issue.
         "Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 critical 192.168.99.11:2337(READY/error) error 72594429 did not occur on 192.168.99.12:2505
          Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 error Error delivering frames: Aborted by local failure that did not occur on all replicas"

Comment 10 Qianfeng Zhang 2009-12-10 02:27:53 UTC
Created attachment 377362 [details]
Source codes and  other documents 

Attached the latest source code to reproduce the issue.  The .tgz  includes all the source codes and a  /etc/qpidd.conf, and a README.txt describing the codes, and a run_test.txt showing how to do the test,  and some shell scripts to create exchanges and queues.  To reproduce the issue:
      1) run the client applications according to the description in run_test.txt
      2) #> reboot -f      ;to forcibly reboot one node and wait it to start up
      3) #> qpid-cluster   ;to check the status of the cluster,  if it is normal
      4) #> reboot -f      ;to forcibly reboot the other node again,   do the whole steps repeatedly, and the issue will occur

Comment 11 Qianfeng Zhang 2009-12-10 02:43:31 UTC
As described by the run_test.txt,  the testing environment (all client and server nodes) is using  MRG 1.2 on RHEL 5.4

Comment 12 Qianfeng Zhang 2009-12-10 03:35:38 UTC
Created attachment 377366 [details]
complete error logs on both nodes

/var/log/messages from node1 and node2.   This time, the issue occured with node2, node1 is still running normally.  After several "reboot -f" of node1 and node2,  it comes that node2 left the cluster shortly after its starting up.  The time of the failure is  "Dec 10 11:14:57".

Comment 13 Alan Conway 2009-12-17 14:52:34 UTC
I can reproduce this problem, I'm working on a solution now.

Comment 14 Gordon Sim 2009-12-21 19:21:20 UTC
Created attachment 379684 [details]
alternate (simpler) reproducer

1. start first node
2. start attached test consumer
3. send 7 messages to test-queue: (e.g. for m in `seq 1 7`; do echo msg_$m; done | ./sender)
4. start second node for cluster
5. send further 3 messages to test-queue: (e.g. for m in `seq 8 10`; do echo msg_$m; done | ./sender)

At this point the nodes are no longer consistent w.r.t the contents of test-queue. The first node has no messages on the queue, the second has three messages on the queue. This inconsistency could be used to engineer a failure such as that reported above.

The issue appears to be that the messages that are on the queue at the time the second node joins (i.e. all sent messages in above example as they are not acquired until the tenth message is received) are delivered to the subscriber on the second node on the next doOoutput even though they were already delivered and are recorded. Thus there are extra delivery records created on that second node and the delivery ids get out of sync. This then means that the acquire fails to acquire all messages creating an inconsistency on the queue.

Comment 15 Gordon Sim 2009-12-21 20:14:33 UTC
Created attachment 379703 [details]
suggested fix?

Comment 16 Alan Conway 2009-12-23 03:00:32 UTC
The suggested fix works, comitted r893175

Comment 17 Alan Conway 2009-12-23 17:44:04 UTC
The fix improves the situation but does not eliminate the problem.

The fix solves the test case from #14, and it makes it much harder to reproduce the problem in the test from #10, but eventually I see the same problem  (without the fix I see the problem in < 1 minute, with the fix it takes up to 10 minutes to reproduce)

Comment 18 Alan Conway 2009-12-23 21:05:52 UTC
Adding another patch resolves the problem fully, it has been committed on the mrg_1.1.x branch:

    Backport from r888874, required to complete the fix for 541927
    
    QPID-2253 -  Cluster node shuts down with inconsistent error.
    
    Add a missing memberUpdate on the transition to CATCHUP mode.
    The inconsistent error was caused because the newly updated member
    did not have its membership updated and so was missing an failover
    update message that the existing members sent to a new client.

Comment 21 Jaromir Hradilek 2010-10-08 14:06:49 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Newly added members now update their membership properly. As a result, new clients no longer miss failover update messages, and no longer cause the cluster to shut down because of the inconsistencies between the nodes.

Comment 23 errata-xmlrpc 2010-10-14 16:02:17 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html


Note You need to log in before you can comment on or make changes to this bug.