Bug 1105742 - Storage node can get stuck in MAINTENANCE mode if cluster maintenance is executed when one or more agents are unavailable or restarting
Summary: Storage node can get stuck in MAINTENANCE mode if cluster maintenance is exec...
Alias: None
Product: JBoss Operations Network
Classification: JBoss
Component: Storage Node
Version: JON 3.2.1
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: JON 4.0.0
Assignee: Michael Burman
QA Contact: Mike Foley
Depends On: 1120418
TreeView+ depends on / blocked
Reported: 2014-06-07 00:01 UTC by Larry O'Leary
Modified: 2019-08-05 14:50 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2019-08-05 14:50:55 UTC
Type: Bug

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Red Hat Bugzilla 1105743 None CLOSED Viewing operation details for operations with no results causes Globally uncaught exception in UI - java.lang.IllegalArg... 2019-08-05 14:49:53 UTC
Red Hat Knowledge Base (Solution) 911393 None None None Never

Internal Links: 1105743

Description Larry O'Leary 2014-06-07 00:01:44 UTC
Description of problem:
Weekly storage cluster maintenance or running storage cluster maintenance manually can put a storage node into what appears to be an unrecoverable maintenance state if its agent is offline due to a synchronized restart.

For example, if agent processes are auto-restarted once a week, this can potentially correspond with the storage cluster auto maintenance job which also runs once a week. The result is that the storage node is left in an operation mode of MAINTENANCE and therefore has its cluster status reported as DOWN.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1.  Install and start JBoss ON 3.2 system.
2.  Install a second agent and storage node.
3.  Verify that both storage nodes are in inventory and are UP/NORMAL.
4.  Shutdown one of the agents.
5.  Invoke the following JBoss ON CLI command:

        ./rhq-cli.sh -u rhqadmin -p rhqadmin -c 'StorageNodeManager.runClusterMaintenance()'

Actual results:
Storage node running on the agent that was unavailable has its cluster status reported as DOWN and its operation mode indicates MAINTENANCE.

Expected results:
No error or bad state associated with the node.

Additional info:
Although it is understandable that maintenance can not complete while the agent is unavailable, this situation should be temporary. As soon as the agent comes back online, cluster maintenance should continue. In other words, the error state should only be reported/reflected while the agent is down.

The fact that the node is stuck in MAINTENANCE also seems to indicate a desegregated cluster. Auto maintenance shouldn't cause such situations.

Comment 1 John Sanda 2014-08-29 12:20:38 UTC
Bumping the target release due to time constraints. Work has been started though in the storage_workflow branch.

Comment 3 Filip Brychta 2019-08-05 14:50:55 UTC
JBoss ON is coming to the end of its product life cycle. For more information regarding this transition, see https://access.redhat.com/articles/3827121.
This bug report/request is being closed. If you feel this issue should not be closed or requires further review, please create a new bug report against the latest supported JBoss ON 3.3 version.

Note You need to log in before you can comment on or make changes to this bug.