1029668 – Operation take snapshot on storage service failed

Bug 1029668 - Operation take snapshot on storage service failed

Summary: Operation take snapshot on storage service failed

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	RHQ Project
Classification:	Other
Component:	Core Server, Operations, Storage Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	GA
Target Release:	RHQ 4.10
Assignee:	Stefan Negrea
QA Contact:	Mike Foley
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1029095
TreeView+	depends on / blocked

Reported:	2013-11-12 21:25 UTC by Stefan Negrea
Modified:	2014-03-25 21:02 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:	1029095
Environment:
Last Closed:	2014-03-25 21:02:45 UTC
Embargoed:

Attachments	(Terms of Use)

Description Stefan Negrea 2013-11-12 21:25:38 UTC

+++ This bug was initially created as a clone of Bug #1029095 +++

Description of problem:
Take snapshot operation scheduled by default to every Sunday at 12:30 AM (https://docs.jboss.org/author/display/RHQ/Backup+and+Restore) failed.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.ER5
Build Number :	
2cb2bc9:225c796

How reproducible:
1/1

Steps to Reproduce:
No exact repro steps available. I had following set up:
machine1: Jon server, jon agent and storage node 
machine2: Jon agent and storage node

Actual results:
Found following problems with storage node on machine2.
StorageNodeSnapshotFailure alert shown on StorageService resource and following exception in jon server log:
00:30:01,251 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-3) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10263, uuid=16c9d18a-bbae-431a-8445-85ac34d1801d, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10263--1783761045-1384061400594], job-group=[rhq-resource-10263], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}
        at org.rhq.enterprise.server.authz.RequiredPermissionsInterceptor.buildPermissionException(RequiredPermissionsInterceptor.java:164) [rhq-server.jar:4.9.0.JON320ER5]
.
.


Additional info:
Password for storage nodes was updated and storage nodes were restarted several times on Friday.
Manual execution of take snapshot operation under user rhqadmin was successful. 
Logs attached.

Comment 1 Stefan Negrea 2013-11-12 21:35:01 UTC

The password change for the storage cluster is not related to the error reported above. The management for the storage nodes is done over the JMX interface which does not require CQL crendentials.

While the error was reported on a weekly maintenance job, it could occur on any operation response related to storage nodes. The code fix makes use of a tested API designed for re-attaching user sessions.

For manually triggering this issue, invoke StorageNodeManager.runClusterMaintenance(); from the CLI. This is equivalent to the weekly job. 

For regression testing, any test cases that makes use of storage operations (adding, removing, maintenance) would exercise the code change.



master branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=d184f7ea4b7238032a6ed04d26ca0ac6776c5f25

Note You need to log in before you can comment on or make changes to this bug.