Bug 1029668 - Operation take snapshot on storage service failed
Operation take snapshot on storage service failed
Status: CLOSED CURRENTRELEASE
Product: RHQ Project
Classification: Other
Component: Core Server, Operations, Storage Node (Show other bugs)
4.9
Unspecified Unspecified
unspecified Severity medium (vote)
: GA
: RHQ 4.10
Assigned To: Stefan Negrea
Mike Foley
:
Depends On:
Blocks: 1029095
  Show dependency treegraph
 
Reported: 2013-11-12 16:25 EST by Stefan Negrea
Modified: 2014-03-25 17:02 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1029095
Environment:
Last Closed: 2014-03-25 17:02:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Stefan Negrea 2013-11-12 16:25:38 EST
+++ This bug was initially created as a clone of Bug #1029095 +++

Description of problem:
Take snapshot operation scheduled by default to every Sunday at 12:30 AM (https://docs.jboss.org/author/display/RHQ/Backup+and+Restore) failed.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.ER5
Build Number :	
2cb2bc9:225c796

How reproducible:
1/1

Steps to Reproduce:
No exact repro steps available. I had following set up:
machine1: Jon server, jon agent and storage node 
machine2: Jon agent and storage node

Actual results:
Found following problems with storage node on machine2.
StorageNodeSnapshotFailure alert shown on StorageService resource and following exception in jon server log:
00:30:01,251 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-3) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10263, uuid=16c9d18a-bbae-431a-8445-85ac34d1801d, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10263--1783761045-1384061400594], job-group=[rhq-resource-10263], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}
        at org.rhq.enterprise.server.authz.RequiredPermissionsInterceptor.buildPermissionException(RequiredPermissionsInterceptor.java:164) [rhq-server.jar:4.9.0.JON320ER5]
.
.


Additional info:
Password for storage nodes was updated and storage nodes were restarted several times on Friday.
Manual execution of take snapshot operation under user rhqadmin was successful. 
Logs attached.
Comment 1 Stefan Negrea 2013-11-12 16:35:01 EST
The password change for the storage cluster is not related to the error reported above. The management for the storage nodes is done over the JMX interface which does not require CQL crendentials.

While the error was reported on a weekly maintenance job, it could occur on any operation response related to storage nodes. The code fix makes use of a tested API designed for re-attaching user sessions.

For manually triggering this issue, invoke StorageNodeManager.runClusterMaintenance(); from the CLI. This is equivalent to the weekly job. 

For regression testing, any test cases that makes use of storage operations (adding, removing, maintenance) would exercise the code change.



master branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?id=d184f7ea4b7238032a6ed04d26ca0ac6776c5f25

Note You need to log in before you can comment on or make changes to this bug.