467878 – Cluster to support message TTL

Bug 467878 - Cluster to support message TTL

Summary: Cluster to support message TTL

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	qpid-cpp
Sub Component:
Version:	beta
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.1.1
Target Release:	---
Assignee:	Alan Conway
QA Contact:	Kim van der Riet
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-10-21 14:05 UTC by Alan Conway
Modified:	2009-04-21 16:17 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-04-21 16:17:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2009:0434	0	normal	SHIPPED_LIVE	Red Hat Enterprise MRG Messaging and Grid Version 1.1.1	2009-04-21 16:15:50 UTC

Description Alan Conway 2008-10-21 14:05:16 UTC

Description of problem:

AMQP messages can have a TTL, and expire when that time is up. Current broker uses local system clock to determine timeout. Clock skew differences could cause inconsistencies in the cluster.

How reproducible:

Diffficult to reproduce, requires hosts with deliberately skewed clocks. It is a real race condition however.

Additional info:

Cluster members need to exchange time messages for timed events so there is an agreed "cluster time" relative to CPG message delivery.

Comment 1 Alan Conway 2008-10-31 17:56:59 UTC

Test info:the issue is that existing cluster nodes each calculate TTL expiry independently so there is a small window for clients on one node to see different results from those on another node, if one client's actions occur just before the expiring according to its node, the other just after according to its node.

This is very difficult to test, since the timing conditions may not occur and everything may appear fine.

Best hope for testing the TTL might be to set TTL as part of a "stress test" where a cluster is subjected to intense activity for a long period of time. Due to the time consuming nature of such tests, it may be best to compile features needing testing into a single stress test that can test multiple potential problems at once.

Comment 2 Alan Conway 2009-02-10 13:24:34 UTC

Fixed in revision 742774

Comment 4 Jan Sarenik 2009-03-03 10:32:17 UTC

Testing method:

  * use two (virtual) machines: A and B
  * set up OpenAIS in /etc/ais/openais.conf and run it on both A and B
  * ensure /root/.qpidd is empty
    # rm -rf /root/.qpidd
  * on both machines run qpidd (order is not important):
    # qpidd -t --auth=no --cluster-name="test"
  * on machine A run perftest in another console
    root@A:~# perftest
  * on machine B run date and adjust clock
    root@B:~# date `date +%m%d`0000.00
  * if nothing happens, then try to update the clock back
    root@B:~# ntpdate time.fi.muni.cz
  * the both qpidd daemons should be interrupted,
      on machine A with "Cannot mcast to CPG group ahoj: access denied."
      on machine B with "Segmentation fault"

For this I used stable (1.1) versions of qpidd-cluster and qpidc-perftest:
  qpidd-cluster-0.4.732838-1.el5
  qpidc-perftest-0.4.732838-1.el5

Alan, is it the same what you have been experiencing?

Comment 5 Jan Sarenik 2009-03-03 11:51:48 UTC

I am not able to produce anything similar on latest 1.1.1 candidate
  qpidd-0.4.744917-1.el5
  qpidc-perftest-0.4.744917-1.el5

Even though I was running two brokers in a cluster
and on one of them (B in previous example) this
script was running.

----------------------------------------------------
while true
do
	date `date +%m%d`$(((($RANDOM)%14)+10))00.00
	sleep 1
	ntpdate time.englab.brq.redhat.com
	sleep 1
done
----------------------------------------------------

Comment 6 Alan Conway 2009-03-03 14:47:42 UTC

I had not tried altering the clocks while a test is running. I'm not clear from your comment above, is it working correctly with the latest candidate?

Comment 7 Jan Sarenik 2009-03-03 15:02:12 UTC

Sorry for confusing wording.
Yes, it is working correctly with latest candidate.

Comment 9 errata-xmlrpc 2009-04-21 16:17:36 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHEA-2009-0434.html

Note You need to log in before you can comment on or make changes to this bug.