Red Hat Bugzilla – Bug 1017432
Increase permissions_validity_in_ms setting for storage node
Last modified: 2014-01-02 15:38:08 EST
+++ This bug was initially created as a clone of Bug #1017372 +++
Description of problem:
The storage node uses org.apache.cassandra.auth.CassandraAuthorizer for authorization checks. This imposes a non-trivial amount of overhead because now authorization checks are performed for each read/write request. To mitigate that overhead, a local cache of permissions is stored. The default lifetime for a cache entry is set by the permissions_validity_in_ms property in cassandra.yaml. It defaults to two seconds.
When a node comes under heavy load, I have on several occassions started seeing read timeout exceptions, even on writes. This is because of the authorization check which very frequently has to query the system_auth.permissions table. The exceptions look like in rhq-storage.log look like,
ERROR [Native-Transport-Requests:1101] 2013-10-07 14:06:52,730 ErrorMessage.java (line 210) Unexpected exception during request
com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 0 responses.
I want to make set permissions_validity_in_ms to five minutes which substantially reduces the overhead of the authorization checks but does not allow the permissions to get stale either.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
--- Additional comment from John Sanda on 2013-10-09 14:17:02 EDT ---
I committed the change to master. I set the timeout to 10 minutes though. I had been testing with 10 minutes, not 5.
master commit hash: d61b7ed441b25
Commit pushed to release/jon3.2.x branch.
commit hash: c4018b21d3af
Moving to ON_QA for testing in the next build.
Created attachment 824558 [details]
no time out exceptions in rhq-storage.log -> http://d.pr/f/FHE9
1. installing storage, server and agent on slow environment - IP1
2. installing and starting storage on IP2 and connected to IP1
3. ran repair operation on storage in IP1 (expecting time out - no time out visible)