Description of problem: Cluster wide read repair runs during the deploy/undeploy processes. The timeout for the repair operation is hard coded to 6 hours. This is done in the StorageNodeOperationHandlersBean.runRepair(Subject subject) method. There are times when the resource operation can take much longer to complete. In those situations, the resource operation times out, but repair continues to run. Then the user is left with no option other than retrying the deployment. All resource operations have a timeout. If it not defined with the operation params or with the plugin meta data, then the plugin container default is used. Since repair can take an arbitrarily long time, we should set the timeout to something really high, like maybe a week. It might also be good to make the timeout configurable. This would have to exposed through the storage cluster settings in some way. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
branch: master link: https://github.com/rhq-project/rhq/commit/d9a7ed8c6 time: 2015-09-22 19:37:01 +0200 commit: d9a7ed8c64919d9252d5461c508256847f0c7b65 author: Libor Zoubek - lzoubek message: Bug 1232847 - Increase timeout for repair operation Increase timeout for all long running storage node operations to 7 days.
Cherry-picked to release/jon3.3.x: commit a540faaf2d84be66dcddd6c7f533d6cb71e4ef9b Author: Libor Zoubek <lzoubek> Date: Wed Sep 16 13:02:17 2015 +0200 Bug 1232847 - Increase timeout for repair operation Increase timeout for all long running storage node operations to 7 days. (cherry picked from commit d9a7ed8c64919d9252d5461c508256847f0c7b65)
moving to ER1 as it was already cp-ed
Moving to ON_QA as available to test with the following build: https://brewweb.devel.redhat.com/buildinfo?buildID=460382 *Note: jon-server-patch-3.3.0.GA.zip maps to ER01 build of jon-server-3.3.0.GA-update-04.zip.
Moving target milestone to ER02 to retest after latest Cassandra changes.
Moving to ON_QA as available to test with the following build: https://brewweb.devel.redhat.com//buildinfo?buildID=461043 *Note: jon-server-patch-3.3.0.GA.zip maps to ER02 build of jon-server-3.3.0.GA-update-04.zip.
Verified on: Version : 3.3.0.GA Update 04 Build Number : e9ed05b:aa79ebd Repair operation started by CLI StorageNodeManager.runClusterMaintanance() is still in progress after 15 hours -> timeout is increased.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-1947.html