Bug 1118098 - The storage cluster maintenance job schedule needs to be configurable
Summary: The storage cluster maintenance job schedule needs to be configurable
Keywords:
Status: NEW
Alias: None
Product: RHQ Project
Classification: Other
Component: Core Server, Storage Node
Version: 4.11
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified vote
Target Milestone: GA
: RHQ 4.13
Assignee: RHQ Project Maintainer
QA Contact: Mike Foley
URL:
Whiteboard:
Depends On:
Blocks: 1118099
TreeView+ depends on / blocked
 
Reported: 2014-07-10 02:00 UTC by John Sanda
Modified: 2014-07-24 15:04 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1118099 (view as bug list)
Environment:
Last Closed:


Attachments (Terms of Use)

Description John Sanda 2014-07-10 02:00:02 UTC
Description of problem:
There is a Quartz job, StorageClusterReadRepairJob, that performs weekly cluster maintenance. It generates snapshots on each node, and runs anti-entropy repair on each node. The job is scheduled to run at 12:30 AM every Sunday. The schedule is not configurable. This can be problematic for some users because it might conflict with other maintenance windows where the RHQ server is restarted. 12:30 AM Sunday might be fine for a default, but it needs to be configurable so that users have the ability to configure to run during off-peak hours, whenever that might be.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 John Sanda 2014-07-24 15:04:10 UTC
I am not absolutely certain we need to run scheduled repair on a weekly basis. This is from an email that I sent to the rhq-devel list,

"Repair is process in Cassandra in which data is made consistent across replicas. There are two kinds - read repair and anti-entropy repair. The former happens automatically in the background on queries. The latter is done via JMX. Although nodes can remain operational while anti-entropy repair, it is very resource intensive and can take a long time to run. It can easily be on the order of hours. The Cassandra docs recommend running regularly, scheduled anti-entropy within gc_grace_seconds, which is the time to wait before Cassandra garbage collections tombstones (i.e., deletion markers). The reason for running it within gc_grace_seconds is to ensure deletes get propagated and to prevent them from being undone. gc_grace_seconds is configured per keyspace and defaults to 10 days. We set gc_grace_seconds to 8 days, and we run anti-entropy repair weekly in a Quartz job named StorageClusterReadRepairJob.

After some investigation I am now of the opinion that we do not need to run a scheduled repair job. As long as replicas are up, data will be consistent between them. If we have cluster where the nodes never go down, then there is no need to run anti-entropy repair with respect to data consistency. Of course nodes do go down. Cassandra has another mechanism called hinted handoff that comes into play. When the target replica is down, the coordinator node (the one receiving the request), stores a hint of the mutation that is intended for that replica. When the replica comes back up, it will receive the hints, making it consistent with other replicas.

There is a maximum amount of time a node can be down and other nodes will store hints. The is defined by the max_hint_window_in_ms property in cassandra.yaml, and it defaults to 3 hours. If a node is down longer than that, then other nodes assume the down node is dead unless and until it comes back up. So if we do not run scheduled repair and if a node is down for more than max_hint_window_in_ms, then need to run a full repair on the node when it comes back up to account for any dropped hints.

As for deletes, I do not think we need to be concerned for a couple reasons. First, we are dealing with append-only data, where each column is only ever written once and never updated. Secondly, we write all data with a TTL. In the event some metric data was deleted on one replica, and still alive on the other, we know that it has the TTL set and will expire; therefore, we do not need to worry about deletes being undone."

I will continue investigating to determine whether or not we need this Quartz job. If not then I create another BZ to track the work for scheduling repair when a node has been down longer than max_hint_window_in_ms.


Note You need to log in before you can comment on or make changes to this bug.