Bug 1232869
Summary: | Repair resource operation in storage node plugin should handle timeouts | ||
---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | John Sanda <jsanda> |
Component: | Storage Node | Assignee: | John Sanda <jsanda> |
Status: | CLOSED WONTFIX | QA Contact: | Mike Foley <mfoley> |
Severity: | medium | Docs Contact: | |
Priority: | unspecified | ||
Version: | JON 3.3.0 | CC: | fbrychta, loleary, theute |
Target Milestone: | --- | ||
Target Release: | JON 3.3.5 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2015-11-17 00:05:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1200594 |
Description
John Sanda
2015-06-17 17:09:05 UTC
John, I am not sure I understand this bug. In StorageNodeComponent I can see we repair rhq keyspace first, then system_auth, it's all synchronous. Where exactly can timeout happen? First I'd change the order of keyspaces, system_auth is going to be quick and if the user cancels repair (ie. by turning off some of nodes?) we can at least talk to nodes. Do you suggest StorageNodeComponent should execute MBean operatoins asynchronously and then start polling (assuming it's possible to query another mbean about running repair) for result? I am referring to the resource operation timeout that is set in StorageNodeOperationsHandlerBean.runRepair(Subject subject, StorageNode storageNode, Configuration parameters, String operation, int timeout). We call that with timeout value of 6 hours. In some situations, repair takes well beyond six hours. There is no way to cancel the actual repair tasks running in Cassandra. Libor, if a resource operation times out, do you know if it will be reported as a failure regardless of whether or not the operation actually completes successfully? If that is in fact the case, then StorageNodeComponent should abort the repair operation because regardless of the actual outcome, the operation will be reported as a failure due to a timeout. When timeout is reached OperationManager interrupts the thread which invokes the operation and waits for result. In case there is a way to cancel running "repair" operation via MBean call, we can start catching InterruptedException within StorageNodeComponent code and cancel the task. Bug 1232847 changed the operation timeout for the repair operation from 6 hours to 7 days. This should allow a more accurate representation of the current state of the repair operation and no longer report failures due to timeouts. The work represented by the contents of this BZ is out-of-scope for the JBoss ON maintenance stream. Therefore, this BZ is being closed as WONTFIX. Discussion of this topic covered the concept of a retry or storage node operation queue which could better handle situations in where an expected and required operation failed or state was undetermined. This is something that can be considered in the new upstream Hawkular project and is being tacked in the community tracker: https://issues.jboss.org/browse/HAWKULAR-801 |