Bug 467878 - Cluster to support message TTL
Cluster to support message TTL
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: qpid-cpp (Show other bugs)
beta
All Linux
high Severity high
: 1.1.1
: ---
Assigned To: Alan Conway
Kim van der Riet
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-10-21 10:05 EDT by Alan Conway
Modified: 2009-04-21 12:17 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-04-21 12:17:36 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Alan Conway 2008-10-21 10:05:16 EDT
Description of problem:

AMQP messages can have a TTL, and expire when that time is up. Current broker uses local system clock to determine timeout. Clock skew differences could cause inconsistencies in the cluster.

How reproducible:

Diffficult to reproduce, requires hosts with deliberately skewed clocks. It is a real race condition however.

Additional info:

Cluster members need to exchange time messages for timed events so there is an agreed "cluster time" relative to CPG message delivery.
Comment 1 Alan Conway 2008-10-31 13:56:59 EDT
Test info:the issue is that existing cluster nodes each calculate TTL expiry independently so there is a small window for clients on one node to see different results from those on another node, if one client's actions occur just before the expiring according to its node, the other just after according to its node.

This is very difficult to test, since the timing conditions may not occur and everything may appear fine.

Best hope for testing the TTL might be to set TTL as part of a "stress test" where a cluster is subjected to intense activity for a long period of time. Due to the time consuming nature of such tests, it may be best to compile features needing testing into a single stress test that can test multiple potential problems at once.
Comment 2 Alan Conway 2009-02-10 08:24:34 EST
Fixed in revision 742774
Comment 4 Jan Sarenik 2009-03-03 05:32:17 EST
Testing method:

  * use two (virtual) machines: A and B
  * set up OpenAIS in /etc/ais/openais.conf and run it on both A and B
  * ensure /root/.qpidd is empty
    # rm -rf /root/.qpidd
  * on both machines run qpidd (order is not important):
    # qpidd -t --auth=no --cluster-name="test"
  * on machine A run perftest in another console
    root@A:~# perftest
  * on machine B run date and adjust clock
    root@B:~# date `date +%m%d`0000.00
  * if nothing happens, then try to update the clock back
    root@B:~# ntpdate time.fi.muni.cz
  * the both qpidd daemons should be interrupted,
      on machine A with "Cannot mcast to CPG group ahoj: access denied."
      on machine B with "Segmentation fault"

For this I used stable (1.1) versions of qpidd-cluster and qpidc-perftest:
  qpidd-cluster-0.4.732838-1.el5
  qpidc-perftest-0.4.732838-1.el5

Alan, is it the same what you have been experiencing?
Comment 5 Jan Sarenik 2009-03-03 06:51:48 EST
I am not able to produce anything similar on latest 1.1.1 candidate
  qpidd-0.4.744917-1.el5
  qpidc-perftest-0.4.744917-1.el5

Even though I was running two brokers in a cluster
and on one of them (B in previous example) this
script was running.

----------------------------------------------------
while true
do
	date `date +%m%d`$(((($RANDOM)%14)+10))00.00
	sleep 1
	ntpdate time.englab.brq.redhat.com
	sleep 1
done
----------------------------------------------------
Comment 6 Alan Conway 2009-03-03 09:47:42 EST
I had not tried altering the clocks while a test is running. I'm not clear from your comment above, is it working correctly with the latest candidate?
Comment 7 Jan Sarenik 2009-03-03 10:02:12 EST
Sorry for confusing wording.
Yes, it is working correctly with latest candidate.
Comment 9 errata-xmlrpc 2009-04-21 12:17:36 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html

Note You need to log in before you can comment on or make changes to this bug.