Hide Forgot
Created attachment 484188 [details] Error log file for clustered VM Description of problem: There seems to be an issue with running a broker cluster on two separate VM guests. Ran the qpid daemon process on two separate Virtual Machine guests. The brokers where clustered using the cluster-name="MRGVMCluster". Left the brokers up and running overnight. After a while, one of the clustered brokers shuts down because it thinks it's the last broker standing. From the log: 2011-03-13 11:05:47 warning LinkRegistry timer woken up 516ms late 2011-03-13 11:05:50 trace cluster(192.168.0.157:4072 READY) DLVR: Frame[BEbe; channel=0; {ClusterConfigChangeBody: members=\x9D\x00\xA8\xC0\x00\x00\x0F\xE8; joined=\x9E\x00\xA8\xC0\x00\x00\x10Z; left=; }] control 192.168.0.157:4072-0 2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) configuration change: 192.168.0.157:4072 2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) Members left: 192.168.0.158:4186 2011-03-13 11:05:50 info cluster(192.168.0.157:4072 READY) member update: 192.168.0.157:4072(member) frameSeq=1013 2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) last broker standing, update queue policies 2011-03-13 11:05:50 trace cluster(192.168.0.157:4072 READY) DLVR: Frame[BEbe; channel=0; {ClusterConfigChangeBody: members=; joined=\x9D\x00\xA8\xC0\x00\x00\x0F\xE8; left=; }] control 192.168.0.157:4072-0 2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) configuration change: 2011-03-13 11:05:50 notice cluster(192.168.0.157:4072 READY) Members left: 192.168.0.157:4072 2011-03-13 11:05:50 info cluster(192.168.0.157:4072 READY) member update: frameSeq=1013 2011-03-13 11:05:54 trace Sending cluster timer wakeup ManagementAgent::periodicProcessing 2011-03-13 11:05:54 trace MCAST Event[192.168.0.157:4072-0 Frame[BEbe; channel=0; {ClusterTimerWakeupBody: name=ManagementAgent::periodicProcessing; }]] 2011-03-13 11:05:54 debug Exception constructed: Cannot mcast to CPG group "MRGVMCluster": not exist (12) 2011-03-13 11:05:54 critical Multicast error: Cannot mcast to CPG group "MRGVMCluster": not exist (12) 2011-03-13 11:05:54 notice cluster(192.168.0.157:4072 LEFT) leaving cluster "MRGVMCluster" 2011-03-13 11:05:54 debug Shutting down CPG 2011-03-13 11:05:54 notice Shut down Version-Release number of selected component (if applicable): MRG 1.3.2 How reproducible: Can be reproduced by standing up 2 separate VM's and running the brokers with a similar configuration: auth=no trace=yes debug=yes pid-dir=/var/local/qpid-clustering/broker1 data-dir=/var/local/qpid-clustering/broker1 log-output=/var/local/qpid-clustering/broker1/qpidd.log cluster-name="MRGVMCluster" cluster-size=2 Steps to Reproduce: 1. Create 2 VM guests 2. Install MRG 1.3.2 bits 3. Run a clustered broker on each VM using the above qpidd.conf configuration. 4. Leave the brokers running for a long time ( > 8 hours ) 5. One of the brokers will eventually shut down Actual results: One of the brokers shuts down due to management heartbeat traffic. Expected results: Broker should not go down. Additional info: If you add the following to the qpidd.conf file broker appear to behave as expected. Issue with this is that we can't use qpid-tool or qpid-config. # Test if management is the culprit for VM crash mgmt-enable=no
From the log messages posted above it appears the broker shut down due to problems communicating with CPG, not from management traffic. Was there anything in the openais or corosync logs? There have been several bugs recently fixed for issues with management and clustering. They all involve a broker exiting with a critical message containing "invalid argument". Was this seen in any of the broker logs? All the fixes will be in the coming MRG 2.0 release.
From Steve Dake: [Does corosync work on VMs?] Yes, but we have some speculation that there may be problems with single cpu VMs and deadlock (deadlock triggered by running corosync in realtime priority. Another possibility could be an OOM situation. Ask them to rerun tests with cpu=2 for the vms. Ask them to rerun tests with current cpu settings but with -p command line option to corosync: corosync: usage: -f : Start application in foreground. -p : Do not set process priority. -v : Display version and SVN revision of Corosync and exit.
Alan, We have version openais-0.80.6-28.el5 for openais. I will look into how the VM's where created and get back to you. As it stands now even when I put the mgmt-pub-interval=15 in the configuration the brokers send management traffic every second. I have the logs etc but they are really big. I will try to post them in a minute.
Are there any active clients connected to the broker? Are you running sesame or some other management tool that might be making requests every second? In your test scenario, what clients (including management tools or agents) are running? Can you give some step-by-step details of how I can reproduce this: the VM setup, the qpidd config, openais config, any running services (e.g. sesame) that interact with qpidd and command lines for any clients you run during the test.
Alan, Q: Are there any active clients connected to the broker? Answer: We initially run qpid-tool and qpid-cluster to make sure the cluster is up and running. After that we leave the brokers alone. Q: Are you running sesame or some other management tool that might be making requests every second? Answer: We are not running any management tools. Q: In your test scenario, what clients (including management tools or agents) are running? Answer: No clients are connected to these brokers. The qpidd.conf file looks like this: # "qpidd --help" or "man qpidd" for more details. cluster-mechanism=ANONYMOUS port=5681 auth=no cluster-name="poccluster" cluster-size=2 log-to-file=/tmp/qpid-clustering/bora/qpidd.log data-dir=/tmp/qpid-clustering/bora/ trace=yes debug=yes log-enable=info+ log-enable=debug+:cluster log-enable=trace+:management mgmt-pub-interval=15 The openais.conf: # Please read the openais.conf.5 manual page totem { version: 2 secauth: off threads: 0 interface { ringnumber: 0 bindnetaddr: 192.168.53.0 mcastaddr: 226.94.1.1 mcastport: 5405 } } # This directive controls how OpenAIS logs it's messages. All variables here # are optional. logging { # These three options control where log messages are sent. Logs can be # sent to two or all three. The three options are: 'to_file', # 'to_syslog' and 'to_stderr'. All three can either be 'yes' or 'no'. # When set to 'yes', logs are sent to the relative destination. The # default is to write to the syslog and to stderr. # # In version 0.70, there were the following options: # logoutput: syslog # logoutput: stderr # logoutput: file # NOTE: These are deprecated and should not be used. # Default is 'no'. If set, 'logfile' must be set below. to_file: yes # Default is 'yes'. to_syslog: yes # Default is 'yes'. to_stderr: yes # When 'logoutput: file' is set, this is required. It is the full # path and file name to write the logs to. logfile: /var/log/bova-ais.log # Setting this to 'on', the default, generates a lot of debug messages # in the log. It is generally not advised unless you are tracing a # specific bug. debug: on # Setting this to 'on', the default, will add a time stamp to log # messages. timestamp: on # Setting this to 'on' will replace the logger name in the log entries # with the file and line generating the log entry. The default is # 'off'. fileline: off # When writing to syslog, this sets the syslog facility to use. Valid # options are: # daemon, local0, local1, local2, local3, local4, local5, local6 and # local7 # The default is 'daemon'. syslog_facility: daemon # This is an optional directive that controls detailed logging # features. Generally, this is only needed by developers. #logger { # This specifies the identity logging being specified. # MADI: What? #ident: ? # This enables or disables debug log messages for the component # identified above. The default is 'off'. #debug: off # This specifies which tags should be logged for this # component. This is only valid when debug is enabled above. # Multiple tags are specified with a pipe (|) as the logical # OR seperator. The default is 'none'. #tags: enter|return|trace1|trace2 #} } amf { mode: disabled } I will update the BZ with the VM setup. Lester
The failure in their logs was due to a network partition or "split brain". Am working with the customer to set up a qpid cluster using CMAN to manage a quorum to avoid this condition.