Description of problem: AMQP messages can have a TTL, and expire when that time is up. Current broker uses local system clock to determine timeout. Clock skew differences could cause inconsistencies in the cluster. How reproducible: Diffficult to reproduce, requires hosts with deliberately skewed clocks. It is a real race condition however. Additional info: Cluster members need to exchange time messages for timed events so there is an agreed "cluster time" relative to CPG message delivery.
Test info:the issue is that existing cluster nodes each calculate TTL expiry independently so there is a small window for clients on one node to see different results from those on another node, if one client's actions occur just before the expiring according to its node, the other just after according to its node. This is very difficult to test, since the timing conditions may not occur and everything may appear fine. Best hope for testing the TTL might be to set TTL as part of a "stress test" where a cluster is subjected to intense activity for a long period of time. Due to the time consuming nature of such tests, it may be best to compile features needing testing into a single stress test that can test multiple potential problems at once.
Fixed in revision 742774
Testing method: * use two (virtual) machines: A and B * set up OpenAIS in /etc/ais/openais.conf and run it on both A and B * ensure /root/.qpidd is empty # rm -rf /root/.qpidd * on both machines run qpidd (order is not important): # qpidd -t --auth=no --cluster-name="test" * on machine A run perftest in another console root@A:~# perftest * on machine B run date and adjust clock root@B:~# date `date +%m%d`0000.00 * if nothing happens, then try to update the clock back root@B:~# ntpdate time.fi.muni.cz * the both qpidd daemons should be interrupted, on machine A with "Cannot mcast to CPG group ahoj: access denied." on machine B with "Segmentation fault" For this I used stable (1.1) versions of qpidd-cluster and qpidc-perftest: qpidd-cluster-0.4.732838-1.el5 qpidc-perftest-0.4.732838-1.el5 Alan, is it the same what you have been experiencing?
I am not able to produce anything similar on latest 1.1.1 candidate qpidd-0.4.744917-1.el5 qpidc-perftest-0.4.744917-1.el5 Even though I was running two brokers in a cluster and on one of them (B in previous example) this script was running. ---------------------------------------------------- while true do date `date +%m%d`$(((($RANDOM)%14)+10))00.00 sleep 1 ntpdate time.englab.brq.redhat.com sleep 1 done ----------------------------------------------------
I had not tried altering the clocks while a test is running. I'm not clear from your comment above, is it working correctly with the latest candidate?
Sorry for confusing wording. Yes, it is working correctly with latest candidate.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHEA-2009-0434.html