Bug 1232869

Summary: Repair resource operation in storage node plugin should handle timeouts
Product: [JBoss] JBoss Operations Network Reporter: John Sanda <jsanda>
Component: Storage NodeAssignee: John Sanda <jsanda>
Status: CLOSED WONTFIX QA Contact: Mike Foley <mfoley>
Severity: medium Docs Contact:
Priority: unspecified    
Version: JON 3.3.0CC: fbrychta, loleary, theute
Target Milestone: ---   
Target Release: JON 3.3.5   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-17 00:05:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1200594    

Description John Sanda 2015-06-17 17:09:05 UTC
Description of problem:
The repair resource operation does not handle timeouts. Because can be a long running task, the operation does time out some times. The resource operation is then reported as having failed. The plugin needs to check to see if the thread running the repair operation is interrupted. I do not think it is possible to cancel the actual repair operation being performed by the storage node. In the case of a failed deployment, the user will have to rerun it, so I do not think it makes sense for the resource operation to continue running in the event of a timeout. We can also provide more accurate results about what actually happened.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Libor Zoubek 2015-09-16 14:44:53 UTC
John,

I am not sure I understand this bug. In StorageNodeComponent I can see we repair rhq keyspace first, then system_auth, it's all synchronous. Where exactly can timeout happen? 

First I'd change the order of keyspaces, system_auth is going to be quick and if the user cancels repair (ie. by turning off some of nodes?) we can at least talk to nodes. 

Do you suggest StorageNodeComponent should execute MBean operatoins asynchronously and then start polling (assuming it's possible to query another mbean about running repair) for result?

Comment 2 John Sanda 2015-09-17 02:24:18 UTC
I am referring to the resource operation timeout that is set in StorageNodeOperationsHandlerBean.runRepair(Subject subject, StorageNode storageNode, Configuration parameters, String operation, int timeout). We call that with timeout value of 6 hours. In some situations, repair takes well beyond six hours. There is no way to cancel the actual repair tasks running in Cassandra. 

Libor, if a resource operation times out, do you know if it will be reported as a failure regardless of whether or not the operation actually completes successfully? If that is in fact the case, then StorageNodeComponent should abort the repair operation because regardless of the actual outcome, the operation will be reported as a failure due to a timeout.

Comment 3 Libor Zoubek 2015-09-18 07:56:56 UTC
When timeout is reached OperationManager interrupts the thread which invokes the operation and waits for result.

In case there is a way to cancel running "repair" operation via MBean call, we can start catching InterruptedException within StorageNodeComponent code and cancel the task.

Comment 5 Larry O'Leary 2015-11-17 00:05:03 UTC
Bug 1232847 changed the operation timeout for the repair operation from 6 hours to 7 days. This should allow a more accurate representation of the current state of the repair operation and no longer report failures due to timeouts.

The work represented by the contents of this BZ is out-of-scope for the JBoss ON maintenance stream. Therefore, this BZ is being closed as WONTFIX. 

Discussion of this topic covered the concept of a retry or storage node operation queue which could better handle situations in where an expected and required operation failed or state was undetermined. This is something that can be considered in the new upstream Hawkular project and is being tacked in the community tracker: https://issues.jboss.org/browse/HAWKULAR-801