Red Hat Bugzilla – Bug 467878
Cluster to support message TTL
Last modified: 2009-04-21 12:17:36 EDT
Description of problem:
AMQP messages can have a TTL, and expire when that time is up. Current broker uses local system clock to determine timeout. Clock skew differences could cause inconsistencies in the cluster.
Diffficult to reproduce, requires hosts with deliberately skewed clocks. It is a real race condition however.
Cluster members need to exchange time messages for timed events so there is an agreed "cluster time" relative to CPG message delivery.
Test info:the issue is that existing cluster nodes each calculate TTL expiry independently so there is a small window for clients on one node to see different results from those on another node, if one client's actions occur just before the expiring according to its node, the other just after according to its node.
This is very difficult to test, since the timing conditions may not occur and everything may appear fine.
Best hope for testing the TTL might be to set TTL as part of a "stress test" where a cluster is subjected to intense activity for a long period of time. Due to the time consuming nature of such tests, it may be best to compile features needing testing into a single stress test that can test multiple potential problems at once.
Fixed in revision 742774
* use two (virtual) machines: A and B
* set up OpenAIS in /etc/ais/openais.conf and run it on both A and B
* ensure /root/.qpidd is empty
# rm -rf /root/.qpidd
* on both machines run qpidd (order is not important):
# qpidd -t --auth=no --cluster-name="test"
* on machine A run perftest in another console
* on machine B run date and adjust clock
root@B:~# date `date +%m%d`0000.00
* if nothing happens, then try to update the clock back
root@B:~# ntpdate time.fi.muni.cz
* the both qpidd daemons should be interrupted,
on machine A with "Cannot mcast to CPG group ahoj: access denied."
on machine B with "Segmentation fault"
For this I used stable (1.1) versions of qpidd-cluster and qpidc-perftest:
Alan, is it the same what you have been experiencing?
I am not able to produce anything similar on latest 1.1.1 candidate
Even though I was running two brokers in a cluster
and on one of them (B in previous example) this
script was running.
date `date +%m%d`$(((($RANDOM)%14)+10))00.00
I had not tried altering the clocks while a test is running. I'm not clear from your comment above, is it working correctly with the latest candidate?
Sorry for confusing wording.
Yes, it is working correctly with latest candidate.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.