Bug 1105742

Summary:	Storage node can get stuck in MAINTENANCE mode if cluster maintenance is executed when one or more agents are unavailable or restarting
Product:	[JBoss] JBoss Operations Network	Reporter:	Larry O'Leary <loleary>
Component:	Storage Node	Assignee:	Michael Burman <miburman>
Status:	CLOSED EOL	QA Contact:	Mike Foley <mfoley>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	JON 3.2.1	CC:	fbrychta
Target Milestone:	---
Target Release:	JON 4.0.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-05 14:50:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1120418
Bug Blocks:

Description Larry O'Leary 2014-06-07 00:01:44 UTC

Description of problem:
Weekly storage cluster maintenance or running storage cluster maintenance manually can put a storage node into what appears to be an unrecoverable maintenance state if its agent is offline due to a synchronized restart.

For example, if agent processes are auto-restarted once a week, this can potentially correspond with the storage cluster auto maintenance job which also runs once a week. The result is that the storage node is left in an operation mode of MAINTENANCE and therefore has its cluster status reported as DOWN.

Version-Release number of selected component (if applicable):
3.2.1

How reproducible:
Always

Steps to Reproduce:
1. Install and start JBoss ON 3.2 system.
2. Install a second agent and storage node.
3. Verify that both storage nodes are in inventory and are UP/NORMAL.
4. Shutdown one of the agents.
5. Invoke the following JBoss ON CLI command:

./rhq-cli.sh -u rhqadmin -p rhqadmin -c 'StorageNodeManager.runClusterMaintenance()'

Actual results:
Storage node running on the agent that was unavailable has its cluster status reported as DOWN and its operation mode indicates MAINTENANCE.

Expected results:
No error or bad state associated with the node.

Additional info:
Although it is understandable that maintenance can not complete while the agent is unavailable, this situation should be temporary. As soon as the agent comes back online, cluster maintenance should continue. In other words, the error state should only be reported/reflected while the agent is down.

The fact that the node is stuck in MAINTENANCE also seems to indicate a desegregated cluster. Auto maintenance shouldn't cause such situations.

Comment 1 John Sanda 2014-08-29 12:20:38 UTC

Bumping the target release due to time constraints. Work has been started though in the storage_workflow branch.

Comment 3 Filip Brychta 2019-08-05 14:50:55 UTC

JBoss ON is coming to the end of its product life cycle. For more information regarding this transition, see https://access.redhat.com/articles/3827121.
This bug report/request is being closed. If you feel this issue should not be closed or requires further review, please create a new bug report against the latest supported JBoss ON 3.3 version.