Bug 1029668

Summary: Operation take snapshot on storage service failed
Product: [Other] RHQ Project Reporter: Stefan Negrea <snegrea>
Component: Core Server, Operations, Storage NodeAssignee: Stefan Negrea <snegrea>
Status: CLOSED CURRENTRELEASE QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.9CC: fbrychta, hrupp, snegrea
Target Milestone: GA   
Target Release: RHQ 4.10   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1029095 Environment:
Last Closed: 2014-03-25 21:02:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1029095    

Description Stefan Negrea 2013-11-12 21:25:38 UTC
+++ This bug was initially created as a clone of Bug #1029095 +++

Description of problem:
Take snapshot operation scheduled by default to every Sunday at 12:30 AM (https://docs.jboss.org/author/display/RHQ/Backup+and+Restore) failed.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.ER5
Build Number :	
2cb2bc9:225c796

How reproducible:
1/1

Steps to Reproduce:
No exact repro steps available. I had following set up:
machine1: Jon server, jon agent and storage node 
machine2: Jon agent and storage node

Actual results:
Found following problems with storage node on machine2.
StorageNodeSnapshotFailure alert shown on StorageService resource and following exception in jon server log:
00:30:01,251 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-3) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10263, uuid=16c9d18a-bbae-431a-8445-85ac34d1801d, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10263--1783761045-1384061400594], job-group=[rhq-resource-10263], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}
        at org.rhq.enterprise.server.authz.RequiredPermissionsInterceptor.buildPermissionException(RequiredPermissionsInterceptor.java:164) [rhq-server.jar:4.9.0.JON320ER5]
.
.


Additional info:
Password for storage nodes was updated and storage nodes were restarted several times on Friday.
Manual execution of take snapshot operation under user rhqadmin was successful. 
Logs attached.

Comment 1 Stefan Negrea 2013-11-12 21:35:01 UTC
The password change for the storage cluster is not related to the error reported above. The management for the storage nodes is done over the JMX interface which does not require CQL crendentials.

While the error was reported on a weekly maintenance job, it could occur on any operation response related to storage nodes. The code fix makes use of a tested API designed for re-attaching user sessions.

For manually triggering this issue, invoke StorageNodeManager.runClusterMaintenance(); from the CLI. This is equivalent to the weekly job. 

For regression testing, any test cases that makes use of storage operations (adding, removing, maintenance) would exercise the code change.



master branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=d184f7ea4b7238032a6ed04d26ca0ac6776c5f25