Bug 684783 - Clustered QPID Broker crash
Summary: Clustered QPID Broker crash
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp
Version: 1.3
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Alan Conway
QA Contact: MRG Quality Engineering
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-03-14 13:48 UTC by Lester Claudio
Modified: 2018-11-14 13:53 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-11 20:00:32 UTC
Target Upstream Version:


Attachments (Terms of Use)
Error log file for clustered VM (5.27 KB, text/plain)
2011-03-14 13:48 UTC, Lester Claudio
no flags Details

Description Lester Claudio 2011-03-14 13:48:36 UTC
Created attachment 484188 [details]
Error log file for clustered VM

Description of problem:
There seems to be an issue with running a broker cluster on two separate VM guests.

Ran the qpid daemon process on two separate Virtual Machine guests. The brokers where clustered using the cluster-name="MRGVMCluster".  Left the brokers up and running overnight.  After a while, one of the clustered brokers shuts down because it thinks it's the last broker standing.  

From the log:

2011-03-13 11:05:47 warning LinkRegistry timer woken up 516ms late
2011-03-13 11:05:50 trace cluster(192.168.0.157:4072 READY) DLVR: Frame[BEbe; channel=0; {ClusterConfigChangeBody: members=\x9D\x00\xA8\xC0\x00\x00\x0F\xE8; joined=\x9E\x00\xA8\xC0\x00\x00\x10Z; left=; }] control 192.168.0.157:4072-0
2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) configuration change: 192.168.0.157:4072
2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) Members left: 192.168.0.158:4186
2011-03-13 11:05:50 info cluster(192.168.0.157:4072 READY) member update: 192.168.0.157:4072(member) frameSeq=1013
2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) last broker standing, update queue policies
2011-03-13 11:05:50 trace cluster(192.168.0.157:4072 READY) DLVR: Frame[BEbe; channel=0; {ClusterConfigChangeBody: members=; joined=\x9D\x00\xA8\xC0\x00\x00\x0F\xE8; left=; }] control 192.168.0.157:4072-0
2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) configuration change:
2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) Members left: 192.168.0.157:4072
2011-03-13 11:05:50 info cluster(192.168.0.157:4072 READY) member update: frameSeq=1013
2011-03-13 11:05:54 trace Sending cluster timer wakeup ManagementAgent::periodicProcessing
2011-03-13 11:05:54 trace MCAST Event[192.168.0.157:4072-0 Frame[BEbe; channel=0; {ClusterTimerWakeupBody: name=ManagementAgent::periodicProcessing; }]]
2011-03-13 11:05:54 debug Exception constructed: Cannot mcast to CPG group "MRGVMCluster": not exist (12)
2011-03-13 11:05:54 critical Multicast error: Cannot mcast to CPG group "MRGVMCluster": not exist (12)
2011-03-13 11:05:54 notice cluster(192.168.0.157:4072 LEFT) leaving cluster "MRGVMCluster"
2011-03-13 11:05:54 debug Shutting down CPG
2011-03-13 11:05:54 notice Shut down

Version-Release number of selected component (if applicable):
MRG 1.3.2

How reproducible:
Can be reproduced by standing up 2 separate VM's and running the brokers with a similar configuration:

auth=no
trace=yes
debug=yes

pid-dir=/var/local/qpid-clustering/broker1
data-dir=/var/local/qpid-clustering/broker1
log-output=/var/local/qpid-clustering/broker1/qpidd.log
cluster-name="MRGVMCluster"
cluster-size=2


Steps to Reproduce:
1. Create 2 VM guests
2. Install MRG 1.3.2 bits
3. Run a clustered broker on each VM using the above qpidd.conf configuration.
4. Leave the brokers running for a long time ( > 8 hours )
5. One of the brokers will eventually shut down
  
Actual results:
One of the brokers shuts down due to management heartbeat traffic.

Expected results:
Broker should not go down.

Additional info:

If you add the following to the qpidd.conf file broker appear to behave as expected.  Issue with this is that we can't use qpid-tool or qpid-config.

# Test if management is the culprit for VM crash
mgmt-enable=no

Comment 1 Alan Conway 2011-03-14 19:03:17 UTC
From the log messages posted above it appears the broker shut down due to problems communicating with CPG, not from management traffic. Was there anything in the openais or corosync logs?

There have been several bugs recently fixed for issues with management and clustering. They all involve a broker exiting with a critical message containing "invalid argument". Was this seen in any of the broker logs? All the fixes will be in the coming MRG 2.0 release.

Comment 2 Alan Conway 2011-03-15 14:38:54 UTC
From Steve Dake:

[Does corosync work on VMs?]

Yes, but we have some speculation that there may be problems with single
cpu VMs and deadlock (deadlock triggered by running corosync in realtime
priority.  Another possibility could be an OOM situation.

Ask them to rerun tests with cpu=2 for the vms.

Ask them to rerun tests with current cpu settings but with -p command
line option to corosync:

corosync: 
usage:
        -f     : Start application in foreground.
        -p     : Do not set process priority.
        -v     : Display version and SVN revision of Corosync and exit.

Comment 4 Lester Claudio 2011-03-16 19:11:52 UTC
Alan, 
   We have version openais-0.80.6-28.el5 for openais.  I will look into how the VM's where created and get back to you.  As it stands now even when I put the mgmt-pub-interval=15 in the configuration the brokers send management traffic every second.  

I have the logs etc but they are really big. I will try to post them in a minute.

Comment 5 Alan Conway 2011-03-23 15:53:15 UTC
Are there any active clients connected to the broker? Are you running sesame or some other management tool that might be making requests every second? In your test scenario, what clients (including management tools or agents) are running?

Can you give some step-by-step details of how I can reproduce this: the VM setup, the qpidd config, openais config, any running services (e.g. sesame) that interact with qpidd and command lines for any clients you run during the test.

Comment 6 Lester Claudio 2011-03-23 16:32:52 UTC
Alan,
Q: Are there any active clients connected to the broker?
    
Answer: We initially run qpid-tool and qpid-cluster to make sure the cluster is up and running. After that we leave the brokers alone.  

Q: Are you running sesame or some other management tool that might be making requests every second?

Answer: We are not running any management tools.

Q: In your test scenario, what clients (including management tools or agents) are running?

Answer: No clients are connected to these brokers.

The qpidd.conf file looks like this:

# "qpidd --help" or "man qpidd" for more details.
cluster-mechanism=ANONYMOUS
port=5681
auth=no
cluster-name="poccluster"
cluster-size=2

log-to-file=/tmp/qpid-clustering/bora/qpidd.log
data-dir=/tmp/qpid-clustering/bora/

trace=yes
debug=yes
log-enable=info+
log-enable=debug+:cluster
log-enable=trace+:management
mgmt-pub-interval=15


The openais.conf:

# Please read the openais.conf.5 manual page

totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber: 0
		bindnetaddr: 192.168.53.0
		mcastaddr: 226.94.1.1
		mcastport: 5405
	}
}

# This directive controls how OpenAIS logs it's messages. All variables here
# are optional.
logging {
        # These three options control where log messages are sent. Logs can be
        # sent to two or all three. The three options are: 'to_file',
        # 'to_syslog' and 'to_stderr'. All three can either be 'yes' or 'no'.
        # When set to 'yes', logs are sent to the relative destination. The
        # default is to write to the syslog and to stderr.
        #
        # In version 0.70, there were the following options:
        # logoutput: syslog
        # logoutput: stderr
        # logoutput: file
        # NOTE: These are deprecated and should not be used.

        # Default is 'no'. If set, 'logfile' must be set below.
        to_file: yes

        # Default is 'yes'.
        to_syslog: yes

        # Default is 'yes'.
        to_stderr: yes

        # When 'logoutput: file' is set, this is required. It is the full
        # path and file name to write the logs to.
        logfile: /var/log/bova-ais.log

        # Setting this to 'on', the default, generates a lot of debug messages
        # in the log. It is generally not advised unless you are tracing a
        # specific bug.
        debug: on

        # Setting this to 'on', the default, will add a time stamp to log
        # messages.
        timestamp: on

        # Setting this to 'on' will replace the logger name in the log entries
        # with the file and line generating the log entry. The default is
        # 'off'.
        fileline: off

        # When writing to syslog, this sets the syslog facility to use. Valid
        # options are:
        # daemon, local0, local1, local2, local3, local4, local5, local6 and
        # local7
        # The default is 'daemon'.
        syslog_facility: daemon

        # This is an optional directive that controls detailed logging
        # features. Generally, this is only needed by developers.
        #logger {
                # This specifies the identity logging being specified.
                # MADI: What?
                #ident: ?

                # This enables or disables debug log messages for the component
                # identified above. The default is 'off'.
                #debug: off

                # This specifies which tags should be logged for this
                # component. This is only valid when debug is enabled above.
                # Multiple tags are specified with a pipe (|) as the logical
                # OR seperator. The default is 'none'.
                #tags: enter|return|trace1|trace2
        #}
}


amf {
	mode: disabled
}


I will update the BZ with the VM setup.

Lester

Comment 13 Alan Conway 2011-05-02 18:13:59 UTC
The failure in their logs was due to a network partition or "split brain". Am working with the customer to set up a qpid cluster using CMAN to manage a quorum to avoid this condition.


Note You need to log in before you can comment on or make changes to this bug.