Bug 541927

Summary:

Persistent cluster problems after reboot -f

Product:

Red Hat Enterprise MRG

Reporter:

Alan Conway <aconway>

Component:

qpid-cpp

Assignee:

Alan Conway <aconway>

Status:

CLOSED ERRATA

QA Contact:

Jeff Needle <jneedle>

Severity:

high

Docs Contact:

Priority:

high

Version:

1.1.6

CC:

freznice, frzhang, iboverma, jneedle, mcressma, tao

Target Milestone:

1.3

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Newly added members now update their membership properly. As a result, new clients no longer miss failover update messages, and no longer cause the cluster to shut down because of the inconsistencies between the nodes.

Story Points:

---

Clone Of:

Clones:

601236 (view as bug list)

Environment:

Last Closed:

2010-10-14 16:02:17 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

601236

Attachments:

Description	Flags
broker 1 log	none
broker 2 log	none
Testing source codes	none
scripts for creating exchanges and queues	none
scripts for cleaning up queues and exchanges	none
/etc/sysconfig/qpidd on one nodes	none
steps to run the test	none
Source codes and other documents	none
complete error logs on both nodes	none
alternate (simpler) reproducer	none
suggested fix?	none

Description Alan Conway 2009-11-27 16:05:30 UTC

Created attachment 374245 [details]
broker 1 log

See the attached message files.   When the "max-queue-count" limit is reached,  qpidd in node 1 aborted and left the cluster. Node 2 reported   "Channel exception: not-attached: Channel 2 is not attached"  error.  The consumer client blocked and could
not consume any message.   ( the issue is easy to re-produce)

    In my opinion,  when "max-queue-count" limit is reached,  a  "resource-limit-exceeded" exception can be delivered to the
sender client so that it can pause a while and re-send messages again;  and the consumer client should always be able to consume
messages.   It seems the "resource-limit-exceeded" was also delivered on the qpidd server and made some wrong thing  so that
no message is put onto the channel used by the consumer client.   Is this a bug ?     qpidd on the server is  0.5.752581-26.el5 ;   qpidc used is  0.5.752581-34.el5

Comment 1 Alan Conway 2009-11-27 16:06:23 UTC

Created attachment 374246 [details]
broker 2 log

Comment 2 Qianfeng Zhang 2009-11-30 01:48:32 UTC

Created attachment 374658 [details]
Testing source codes

The codes are some complicated, but there is a  README.TXT inside explaining what they do.

Comment 3 Alan Conway 2009-11-30 14:53:30 UTC

Please add a detailed step-by-step description of how to reproduce the bug including:
 - options to qpidd, how many qpidd running.
 - sequence to run clients, with exact command line parameters

Comment 4 Qianfeng Zhang 2009-12-01 01:20:01 UTC

Created attachment 374939 [details]
scripts for creating exchanges and queues

scripts for creating exchanges and queues

Comment 5 Qianfeng Zhang 2009-12-01 01:21:17 UTC

Created attachment 374940 [details]
scripts for cleaning up queues and exchanges

scripts for cleaning up queues and exchanges

Comment 6 Qianfeng Zhang 2009-12-01 01:24:35 UTC

Created attachment 374941 [details]
/etc/sysconfig/qpidd    on one nodes

The /etc/sysconfig/qpidd on the other node is almost the same except it has

    --cluster-url=amqp:tcp:192.168.100.152:5672

Comment 7 Qianfeng Zhang 2009-12-01 01:38:07 UTC

Created attachment 374942 [details]
steps to run the test

Also please refer to the README.TXT in the source codes about what the example codes do

Comment 8 Alan Conway 2009-12-01 17:28:45 UTC

I am not able to reproduce the problem. I see broker logs similar to your broker 2 log on all brokers: resource-limit-exceeded followed by channel-not-attached.

Note the channel-not-attached error is correct. An exception automatically closes the session, you need to create a new session to continue sending. The channel-not-attached error is the result of trying to use the old session which has been closed.

You mention you are mixing versions 0.5.752581-26.el5 and 0.5.752581-34.el5. Can  you update everything to -34 just to eliminate that as a possible problem.

Can you try adding these qpidd options: 
   --log-enable=info+ --log-enable=debug+:cluster
and attach the full logs from each of the brokers?

Comment 9 Alan Conway 2009-12-02 13:47:46 UTC

From frzhang:
    Thank you very much!      1)  The following failure just could not be reproduced by me.  Probably because I made change to my codes too fast,  and could not
          rollback to the point that lead to the issue.
         "Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 critical 192.168.99.11:2337(READY/error) error 72594429 did not occur on 192.168.99.12:2505
          Nov 26 13:20:46 dellpc2 qpidd[2337]: 2009-nov-26 13:20:46 error Error delivering frames: Aborted by local failure that did not occur on all replicas"

Comment 10 Qianfeng Zhang 2009-12-10 02:27:53 UTC

Created attachment 377362 [details]
Source codes and  other documents 

Attached the latest source code to reproduce the issue.  The .tgz  includes all the source codes and a  /etc/qpidd.conf, and a README.txt describing the codes, and a run_test.txt showing how to do the test,  and some shell scripts to create exchanges and queues.  To reproduce the issue:
      1) run the client applications according to the description in run_test.txt
      2) #> reboot -f      ;to forcibly reboot one node and wait it to start up
      3) #> qpid-cluster   ;to check the status of the cluster,  if it is normal
      4) #> reboot -f      ;to forcibly reboot the other node again,   do the whole steps repeatedly, and the issue will occur

Comment 11 Qianfeng Zhang 2009-12-10 02:43:31 UTC

As described by the run_test.txt,  the testing environment (all client and server nodes) is using  MRG 1.2 on RHEL 5.4

Comment 12 Qianfeng Zhang 2009-12-10 03:35:38 UTC

Created attachment 377366 [details]
complete error logs on both nodes

/var/log/messages from node1 and node2.   This time, the issue occured with node2, node1 is still running normally.  After several "reboot -f" of node1 and node2,  it comes that node2 left the cluster shortly after its starting up.  The time of the failure is  "Dec 10 11:14:57".

Comment 13 Alan Conway 2009-12-17 14:52:34 UTC

I can reproduce this problem, I'm working on a solution now.

Comment 14 Gordon Sim 2009-12-21 19:21:20 UTC

Created attachment 379684 [details]
alternate (simpler) reproducer

1. start first node
2. start attached test consumer
3. send 7 messages to test-queue: (e.g. for m in `seq 1 7`; do echo msg_$m; done | ./sender)
4. start second node for cluster
5. send further 3 messages to test-queue: (e.g. for m in `seq 8 10`; do echo msg_$m; done | ./sender)

At this point the nodes are no longer consistent w.r.t the contents of test-queue. The first node has no messages on the queue, the second has three messages on the queue. This inconsistency could be used to engineer a failure such as that reported above.

The issue appears to be that the messages that are on the queue at the time the second node joins (i.e. all sent messages in above example as they are not acquired until the tenth message is received) are delivered to the subscriber on the second node on the next doOoutput even though they were already delivered and are recorded. Thus there are extra delivery records created on that second node and the delivery ids get out of sync. This then means that the acquire fails to acquire all messages creating an inconsistency on the queue.

Comment 15 Gordon Sim 2009-12-21 20:14:33 UTC

Created attachment 379703 [details]
suggested fix?

Comment 16 Alan Conway 2009-12-23 03:00:32 UTC

The suggested fix works, comitted r893175

Comment 17 Alan Conway 2009-12-23 17:44:04 UTC

The fix improves the situation but does not eliminate the problem.

The fix solves the test case from #14, and it makes it much harder to reproduce the problem in the test from #10, but eventually I see the same problem  (without the fix I see the problem in < 1 minute, with the fix it takes up to 10 minutes to reproduce)

Comment 18 Alan Conway 2009-12-23 21:05:52 UTC

Adding another patch resolves the problem fully, it has been committed on the mrg_1.1.x branch:

    Backport from r888874, required to complete the fix for 541927
    
    QPID-2253 -  Cluster node shuts down with inconsistent error.
    
    Add a missing memberUpdate on the transition to CATCHUP mode.
    The inconsistent error was caused because the newly updated member
    did not have its membership updated and so was missing an failover
    update message that the existing members sent to a new client.

Comment 21 Jaromir Hradilek 2010-10-08 14:06:49 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Newly added members now update their membership properly. As a result, new clients no longer miss failover update messages, and no longer cause the cluster to shut down because of the inconsistencies between the nodes.

Comment 23 errata-xmlrpc 2010-10-14 16:02:17 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0773.html