598653 – Update persistent cluster documentation

Bug 598653 - Update persistent cluster documentation

Summary: Update persistent cluster documentation

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	Messaging_Programming_Reference
Sub Component:
Version:	beta
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	1.3
Target Release:	---
Assignee:	Jonathan Robie
QA Contact:	ecs-bugs
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	579805 592358 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-06-01 19:32 UTC by Alan Conway
Modified:	2013-08-06 00:54 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Alan Conway 2010-06-01 19:32:07 UTC

Here's proposed text for the persistent cluster section:

7.6. Persistence in High Availability Message Clusters

Persistence and clustering are two different ways to provide reliability. Most systems that use a cluster do not enable persistence, but you can do so if you want to ensure that messages are not lost even if the last broker in a cluster fails. A cluster must have all transient or all persistent members, mixed clusters are not allowed. Each broker in a persistent cluster has it's own independent replica of the cluster's state it its store.

7.6.1 Clean and dirty stores.

When a broker leaves a running cluster because it is stopped, it crashes or the host crashes, its store is marked "dirty" because it may be out of date compared to brokers still in the cluster.

If the cluster is reduced to a single broker, its store is marked "clean" since it is the only broker making updates. If the cluster is shut down with the command "qpid-cluster -k" then all the stores are marked clean.

When a cluster is initially formed, brokers with clean stores read from their stores. Brokers with dirty stores, or brokers that join after the cluster is running, discard their old stores and initialize a new store with an update from one of the running brokers.

Discarded stores are copied to a back up directory. The active store is in <data-dir>/rhm. Back-up stores are in <data-dir>/_cluster.bak.<nnnn>/rhm, where <nnnn> is a 4 digit number. A higher number means a more recent backup.

7.6.1 Starting a persistent cluster.

When starting a persistent cluster broker, set the cluster-size option to the number of brokers in the cluster. This allows the brokers to wait until the entire cluster is running so that they can synchronize their stored state.

The cluster can start if
- all members have empty stores or
- at least one member has a clean store.

All members of the new cluster will be initialized with the state from a clean store.

7.6.2 Stopping a persistent cluster.

To cleanly shut down a persistent cluster use the command "qpid-cluster -k". This causes all brokers to synchronize their state and mark their stores as "clean" so they can be used when the cluster re-starts.

7.6.3 Starting a persistent cluster with no clean store.

If the cluster has previously had a total failure and there are no clean stores then the brokers will fail to start with the log message "Cannot recover, no clean store." If this happens you can start the cluster by marking one of the stores "clean" as follows:

1. Move the latest store backup into place in the brokers data-directory. The backups end in a 4 digit number, the latest backup is the highest number
cd <data-dir>
mv rhm rhm.bak
cp -a _cluster.bak.<nnnn>/rhm .

2. Mark the store as clean
qpid-cluster-store -c <data-dir>

Now you can start the cluster, all members will be initialized from the store you marked as clean.

7.6.4 Isolated failures in a persistent cluster.

A broker in a persistent cluster may encounter errors that other brokers in the cluster do not; if this happens, the broker shuts itself down to avoid making the cluster state inconsistent. For example a disk failure on one node will result in that node shutting down. Running out of storage capacity can also cause a node to shut down because because the brokers may not run out of storage at exactly the same point, even if they have similar storage configuration. To avoid unnecessary broker shutdowns, make sure the queue policy size of each durable queue is less than the capacity of the journal for the queue.

Comment 1 Alan Conway 2010-06-01 19:32:33 UTC

*** Bug 579805 has been marked as a duplicate of this bug. ***

Comment 2 Alan Conway 2010-06-01 19:32:39 UTC

*** Bug 592358 has been marked as a duplicate of this bug. ***

Comment 3 Jonathan Robie 2010-06-13 01:14:05 UTC

Added to User's Guide.

Comment 4 Frantisek Reznicek 2010-06-14 10:06:19 UTC

Waiting for the refresh of rhm-docs package, current rhm-docs-0.7.946106-1 does not contain the description yet.

Comment 5 Frantisek Reznicek 2010-06-22 09:00:17 UTC

The documentation above was added to the documentation package  rhm-docs-0.7.955296-1.el5.noarch.rpm

-> VERIFIED

Note You need to log in before you can comment on or make changes to this bug.