Bug 1029095 - Operation take snapshot on storage service failed
Operation take snapshot on storage service failed
Status: CLOSED CURRENTRELEASE
Product: JBoss Operations Network
Classification: JBoss
Component: Core Server, Operations, Storage Node (Show other bugs)
JON 3.2
Unspecified Unspecified
unspecified Severity medium
: CR01
: JON 3.2.0
Assigned To: Stefan Negrea
Mike Foley
:
Depends On: 1029668
Blocks: 1012435
  Show dependency treegraph
 
Reported: 2013-11-11 11:25 EST by Filip Brychta
Modified: 2014-01-02 15:38 EST (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1029668 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs (489.69 KB, application/x-gzip)
2013-11-11 11:25 EST, Filip Brychta
no flags Details
operation history (164.27 KB, image/png)
2013-11-22 07:11 EST, Filip Brychta
no flags Details
complete exception (17.26 KB, text/x-log)
2013-11-22 07:13 EST, Filip Brychta
no flags Details

  None (edit)
Description Filip Brychta 2013-11-11 11:25:21 EST
Created attachment 822495 [details]
logs

Description of problem:
Take snapshot operation scheduled by default to every Sunday at 12:30 AM (https://docs.jboss.org/author/display/RHQ/Backup+and+Restore) failed.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.ER5
Build Number :	
2cb2bc9:225c796

How reproducible:
1/1

Steps to Reproduce:
No exact repro steps available. I had following set up:
machine1: Jon server, jon agent and storage node 
machine2: Jon agent and storage node

Actual results:
Found following problems with storage node on machine2.
StorageNodeSnapshotFailure alert shown on StorageService resource and following exception in jon server log:
00:30:01,251 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-3) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10263, uuid=16c9d18a-bbae-431a-8445-85ac34d1801d, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10263--1783761045-1384061400594], job-group=[rhq-resource-10263], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}
        at org.rhq.enterprise.server.authz.RequiredPermissionsInterceptor.buildPermissionException(RequiredPermissionsInterceptor.java:164) [rhq-server.jar:4.9.0.JON320ER5]
.
.


Additional info:
Password for storage nodes was updated and storage nodes were restarted several times on Friday.
Manual execution of take snapshot operation under user rhqadmin was successful. 
Logs attached.
Comment 1 Stefan Negrea 2013-11-12 16:35:36 EST
The password change for the storage cluster is not related to the error reported above. The management for the storage nodes is done over the JMX interface which does not require CQL crendentials.

While the error was reported on a weekly maintenance job, it could occur on any operation response related to storage nodes. The code fix makes use of a tested API designed for re-attaching user sessions.

For manually triggering this issue, invoke StorageNodeManager.runClusterMaintenance(); from the CLI. This is equivalent to the weekly job. 

For regression testing, any test cases that makes use of storage operations (adding, removing, maintenance) would exercise the code change.
Comment 3 Simeon Pinder 2013-11-19 10:49:13 EST
Moving to ON_QA as available for testing with new brew build.
Comment 4 Filip Brychta 2013-11-20 06:35:37 EST
Moving back to modified. Commit from comment 2 didn't make it to rc/jon3.2.0.ER6
Comment 5 Simeon Pinder 2013-11-21 23:45:01 EST
Moving to ER07 target milestone and MODIFIED as initially rejected by QE because of bad ER6 build but put into varied states(MODIFIED & ASSIGNED).
Comment 6 Simeon Pinder 2013-11-22 00:01:58 EST
Moving to ON_QA as available to test in ER7 and later brew builds.
Comment 7 Filip Brychta 2013-11-22 07:10:31 EST
Used following scenario:
1- installed and started jon to server1 (./rhqctl install --start)
2- installed second storage node to server2 (./rhqctl install --storage --agent-preference="rhq.agent.server.bind-address=10.16.23.54")
3- both storage nodes are up and running (available)
4- invoked StorageNodeManager.runClusterMaintenance();

Result:
Error in jon server log:
06:39:43,786 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-5) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10020, uuid=40a03a48-cb10-419b-8ffd-b2729476d002, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10020--1783761045-1385120382841], job-group=[rhq-resource-10020], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}


Complete exception attached.
Take snapshot operation in StorageService's operation history remains in status In Progress. See attached screen shot
Comment 8 Filip Brychta 2013-11-22 07:11:08 EST
Created attachment 827708 [details]
operation history
Comment 9 Filip Brychta 2013-11-22 07:13:46 EST
Created attachment 827711 [details]
complete exception
Comment 11 Filip Brychta 2013-11-22 09:58:23 EST
Clarification:
Attached screen shot of operation history shows more then 1 operation. This is not relevant for this issue. The screen shot is there just to illustrate that operation is In Progress and 'Date Submitted' column shows not yet started. 

Scenario in comment 7 should be followed. So even the very first invocation of StorageNodeManager.runClusterMaintenance(); causes this issue. Sorry for confusing screen shot.
Comment 12 mark yarborough 2013-11-22 10:45:14 EST
What is time gap between steps 2 and 4 in comment 7?
Comment 13 John Mazzitelli 2013-11-22 12:39:04 EST
talked with stefan, john s, jay s and we think we might have a solution.

commited to master these two commits:

c9b36a8f543968c2a172e1e79322f525e27253cc (check if the session is overlord and skip if so)
32f0b9caab102c598f58df781cac474c52bed79b (synchronize getOverlord)

We think the first (c9b) is OK for JON 3.2 inclusion. We will not cherry pick the second (32f) since it might have ramifications we aren't sure of thus we don't want to risk it in 3.2

cherry picked c9b to release/jon3.2.x branch: 13e8b144a59eb585ad168634654cb072eee280a7
Comment 14 John Mazzitelli 2013-11-22 14:39:18 EST
release/jon3.2.x commit: 892b278064e03d2b22ea05f4c68275b1d47435d7

this is a cherry pick of something that is related that needed to come over from master. Cherry picked 5c7499fd0e887f0608ca8e53b8abd7922d9ded85
Comment 15 Filip Brychta 2013-11-25 03:32:18 EST
(In reply to mark yarborough from comment #12)
> What is time gap between steps 2 and 4 in comment 7?

Step 4 was executed when both storage node resources were available and all children resources were discovered and available.
Comment 16 Filip Brychta 2013-11-25 03:48:31 EST
Unable to reproduce on master 
Version :	
4.10.0-SNAPSHOT
Build Number :	
92464e1
Comment 17 Simeon Pinder 2013-12-03 18:19:41 EST
Moving to ON_QA for testing in latest(CR1) brew build.
Comment 18 Filip Brychta 2013-12-04 04:40:34 EST
Verified on:
Version :	
3.2.0.CR1
Build Number :	
6ecd678:d0dc0b6

Note You need to log in before you can comment on or make changes to this bug.