Bug 1029095

Summary:

Operation take snapshot on storage service failed

Product:

[JBoss] JBoss Operations Network

Reporter:

Filip Brychta <fbrychta>

Component:

Core Server, Operations, Storage Node

Assignee:

Stefan Negrea <snegrea>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Mike Foley <mfoley>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

JON 3.2

CC:

fbrychta, mazz, myarboro, snegrea

Target Milestone:

CR01

Target Release:

JON 3.2.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

1029668 (view as bug list)

Environment:

Last Closed:

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1029668

Bug Blocks:

1012435

Attachments:

Description	Flags
logs	none
operation history	none
complete exception	none

Description Filip Brychta 2013-11-11 16:25:21 UTC

Created attachment 822495 [details]
logs

Description of problem:
Take snapshot operation scheduled by default to every Sunday at 12:30 AM (https://docs.jboss.org/author/display/RHQ/Backup+and+Restore) failed.

Version-Release number of selected component (if applicable):
Version :	
3.2.0.ER5
Build Number :	
2cb2bc9:225c796

How reproducible:
1/1

Steps to Reproduce:
No exact repro steps available. I had following set up:
machine1: Jon server, jon agent and storage node 
machine2: Jon agent and storage node

Actual results:
Found following problems with storage node on machine2.
StorageNodeSnapshotFailure alert shown on StorageService resource and following exception in jon server log:
00:30:01,251 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-3) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10263, uuid=16c9d18a-bbae-431a-8445-85ac34d1801d, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10263--1783761045-1384061400594], job-group=[rhq-resource-10263], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}
        at org.rhq.enterprise.server.authz.RequiredPermissionsInterceptor.buildPermissionException(RequiredPermissionsInterceptor.java:164) [rhq-server.jar:4.9.0.JON320ER5]
.
.


Additional info:
Password for storage nodes was updated and storage nodes were restarted several times on Friday.
Manual execution of take snapshot operation under user rhqadmin was successful. 
Logs attached.

Comment 1 Stefan Negrea 2013-11-12 21:35:36 UTC

The password change for the storage cluster is not related to the error reported above. The management for the storage nodes is done over the JMX interface which does not require CQL crendentials.

While the error was reported on a weekly maintenance job, it could occur on any operation response related to storage nodes. The code fix makes use of a tested API designed for re-attaching user sessions.

For manually triggering this issue, invoke StorageNodeManager.runClusterMaintenance(); from the CLI. This is equivalent to the weekly job. 

For regression testing, any test cases that makes use of storage operations (adding, removing, maintenance) would exercise the code change.

Comment 2 Stefan Negrea 2013-11-12 21:43:18 UTC

release/jon3.2.x branch commit:

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=release/jon3.2.x&id=c0e53ba988c34f6d3fcd163450f5c2fddecf426d

Comment 3 Simeon Pinder 2013-11-19 15:49:13 UTC

Moving to ON_QA as available for testing with new brew build.

Comment 4 Filip Brychta 2013-11-20 11:35:37 UTC

Moving back to modified. Commit from comment 2 didn't make it to rc/jon3.2.0.ER6

Comment 5 Simeon Pinder 2013-11-22 04:45:01 UTC

Moving to ER07 target milestone and MODIFIED as initially rejected by QE because of bad ER6 build but put into varied states(MODIFIED & ASSIGNED).

Comment 6 Simeon Pinder 2013-11-22 05:01:58 UTC

Moving to ON_QA as available to test in ER7 and later brew builds.

Comment 7 Filip Brychta 2013-11-22 12:10:31 UTC

Used following scenario:
1- installed and started jon to server1 (./rhqctl install --start)
2- installed second storage node to server2 (./rhqctl install --storage --agent-preference="rhq.agent.server.bind-address=10.16.23.54")
3- both storage nodes are up and running (available)
4- invoked StorageNodeManager.runClusterMaintenance();

Result:
Error in jon server log:
06:39:43,786 ERROR [org.rhq.enterprise.server.operation.ResourceOperationJob] (RHQScheduler_Worker-5) Failed to execute scheduled operation [ResourceOperationSchedule: resource=[Resource[id=10020, uuid=40a03a48-cb10-419b-8ffd-b2729476d002, type={RHQStorage}StorageService, key=org.apache.cassandra.db:type=StorageService, name=Storage Service]],job-name=[rhq-resource-10020--1783761045-1385120382841], job-group=[rhq-resource-10020], operation-name=[takeSnapshot], subject=[Subject[id=1,name=admin]], description=[Run by StorageNodeManagerBean]]: org.rhq.enterprise.server.authz.PermissionException: The session ID for user [admin] is invalid!: invocation: method=public org.rhq.core.domain.util.PageList<org.rhq.core.domain.operation.ResourceOperationHistory> org.rhq.enterprise.server.operation.OperationManagerBean.findResourceOperationHistoriesByCriteria(org.rhq.core.domain.auth.Subject,org.rhq.core.domain.criteria.ResourceOperationHistoryCriteria),context-data={}


Complete exception attached.
Take snapshot operation in StorageService's operation history remains in status In Progress. See attached screen shot

Comment 8 Filip Brychta 2013-11-22 12:11:08 UTC

Created attachment 827708 [details]
operation history

Comment 9 Filip Brychta 2013-11-22 12:13:46 UTC

Created attachment 827711 [details]
complete exception

Comment 10 Mike Foley 2013-11-22 13:15:19 UTC

verifying the commit is in ER7 payload

https://git.fedorahosted.org/cgit/rhq/rhq.git/commit/?h=rc/jon3.2.0.ER7&id=c0e53ba988c34f6d3fcd163450f5c2fddecf426d

Comment 11 Filip Brychta 2013-11-22 14:58:23 UTC

Clarification:
Attached screen shot of operation history shows more then 1 operation. This is not relevant for this issue. The screen shot is there just to illustrate that operation is In Progress and 'Date Submitted' column shows not yet started. 

Scenario in comment 7 should be followed. So even the very first invocation of StorageNodeManager.runClusterMaintenance(); causes this issue. Sorry for confusing screen shot.

Comment 12 mark yarborough 2013-11-22 15:45:14 UTC

What is time gap between steps 2 and 4 in comment 7?

Comment 13 John Mazzitelli 2013-11-22 17:39:04 UTC

talked with stefan, john s, jay s and we think we might have a solution.

commited to master these two commits:

c9b36a8f543968c2a172e1e79322f525e27253cc (check if the session is overlord and skip if so)
32f0b9caab102c598f58df781cac474c52bed79b (synchronize getOverlord)

We think the first (c9b) is OK for JON 3.2 inclusion. We will not cherry pick the second (32f) since it might have ramifications we aren't sure of thus we don't want to risk it in 3.2

cherry picked c9b to release/jon3.2.x branch: 13e8b144a59eb585ad168634654cb072eee280a7

Comment 14 John Mazzitelli 2013-11-22 19:39:18 UTC

release/jon3.2.x commit: 892b278064e03d2b22ea05f4c68275b1d47435d7

this is a cherry pick of something that is related that needed to come over from master. Cherry picked 5c7499fd0e887f0608ca8e53b8abd7922d9ded85

Comment 15 Filip Brychta 2013-11-25 08:32:18 UTC

(In reply to mark yarborough from comment #12)
> What is time gap between steps 2 and 4 in comment 7?

Step 4 was executed when both storage node resources were available and all children resources were discovered and available.

Comment 16 Filip Brychta 2013-11-25 08:48:31 UTC

Unable to reproduce on master 
Version :	
4.10.0-SNAPSHOT
Build Number :	
92464e1

Comment 17 Simeon Pinder 2013-12-03 23:19:41 UTC

Moving to ON_QA for testing in latest(CR1) brew build.

Comment 18 Filip Brychta 2013-12-04 09:40:34 UTC

Verified on:
Version :	
3.2.0.CR1
Build Number :	
6ecd678:d0dc0b6