Description of problem:
From email I sent to rhq-devel list,
"Repair is process in Cassandra in which data is made consistent across replicas. There are two kinds - read repair and anti-entropy repair. The former happens automatically in the background on queries. The latter is done via JMX. Although nodes can remain operational while anti-entropy repair, it is very resource intensive and can take a long time to run. It can easily be on the order of hours. The Cassandra docs recommend running regularly, scheduled anti-entropy within gc_grace_seconds, which is the time to wait before Cassandra garbage collections tombstones (i.e., deletion markers). The reason for running it within gc_grace_seconds is to ensure deletes get propagated and to prevent them from being undone. gc_grace_seconds is configured per keyspace and defaults to 10 days. We set gc_grace_seconds to 8 days, and we run anti-entropy repair weekly in a Quartz job named StorageClusterReadRepairJob.
After some investigation I am now of the opinion that we do not need to run a scheduled repair job. As long as replicas are up, data will be consistent between them. If we have cluster where the nodes never go down, then there is no need to run anti-entropy repair with respect to data consistency. Of course nodes do go down. Cassandra has another mechanism called hinted handoff that comes into play. When the target replica is down, the coordinator node (the one receiving the request), stores a hint of the mutation that is intended for that replica. When the replica comes back up, it will receive the hints, making it consistent with other replicas.
There is a maximum amount of time a node can be down and other nodes will store hints. The is defined by the max_hint_window_in_ms property in cassandra.yaml, and it defaults to 3 hours. If a node is down longer than that, then other nodes assume the down node is dead unless and until it comes back up. So if we do not run scheduled repair and if a node is down for more than max_hint_window_in_ms, then need to run a full repair on the node when it comes back up to account for any dropped hints.
As for deletes, I do not think we need to be concerned for a couple reasons. First, we are dealing with append-only data, where each column is only ever written once and never updated. Secondly, we write all data with a TTL. In the event some metric data was deleted on one replica, and still alive on the other, we know that it has the TTL set and will expire; therefore, we do not need to worry about deletes being undone."
I am not suggesting we will never have a need to run anti-entropy repair. In fact, we do run anti-entropy repair whenever the replication_factor changes when nodes are added to or removed from the cluster. And when we store other data in Cassandra, we may very well need to run scheduled anti-entropy on it.
There are few things then that need to be done for this BZ.
1) Get rid of the StorageClusterReadRepairJob Quartz job
2) Add support for updating max_hint_window_in_ms as a cluster-wide maintenance task
3) Run anti-entropy repair on a node that has come back up and has been down longer than max_hint_window_in_ms
4) Reduce gc_grace_seconds
5) Make gc_grace_seconds configurable (maybe?)
Version-Release number of selected component (if applicable):
Steps to Reproduce:
We should also switch to do snapshot repairs to avoid the performance spikes that are incurred from all replicas doing a validation compaction at the same time.