Bug 1272328

Summary:	Storage node in maintenance operation mode cannot be undeployed
Product:	[JBoss] JBoss Operations Network	Reporter:	bkramer <bkramer>
Component:	Storage Node	Assignee:	Michael Burman <miburman>
Status:	CLOSED ERRATA	QA Contact:	Filip Brychta <fbrychta>
Severity:	medium	Docs Contact:
Priority:	low
Version:	JON 3.3.2	CC:	bkramer, fbrychta, jsanda, loleary, spinder
Target Milestone:	CR02	Keywords:	Triaged
Target Release:	JON 3.3.8
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-16 18:44:37 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1272329
Bug Blocks:

Description bkramer 2015-10-16 06:43:01 UTC

Description of problem:
A storage node that was deployed initially by mistake, with cluster status 'NORMAL', availability 'UP' but operation mode 'MAINTENANCE' cannot be undeployed. 

Currently, the method that undeployes storage nodes is:

*******************************************************
308     public void undeployStorageNode(Subject subject, StorageNode storageNode) {
309         StorageNodeCriteria c = new StorageNodeCriteria();
310         c.addFilterId(storageNode.getId());
311         c.fetchResource(true);
312         List<StorageNode> storageNodes = storageNodeManager.findStorageNodesByCriteria(subject, c);
313         if (storageNodes.isEmpty()) {
314             throw new RuntimeException("Storage node not found, can not undeploy " + storageNode);
315         }
316         storageNode = storageNodes.get(0);
317 
318         switch (storageNode.getOperationMode()) {
319         case INSTALLED:
320             storageNodeManager.resetInNewTransaction();
321             storageNodeOperationsHandler.uninstall(subject, storageNode);
322             break;
323         case ANNOUNCE:
324         case BOOTSTRAP:
325             storageNodeManager.resetInNewTransaction();
326             storageNodeOperationsHandler.unannounceStorageNode(subject, storageNode);
327             break;
328         case ADD_MAINTENANCE:
329         case NORMAL:
330         case DECOMMISSION:
331             storageNodeManager.resetInNewTransaction();
332             storageNodeOperationsHandler.decommissionStorageNode(subject, storageNode);
333             break;
334         case REMOVE_MAINTENANCE:
335             storageNodeManager.resetInNewTransaction();
336             storageNodeOperationsHandler.performRemoveNodeMaintenance(subject, storageNode);
337             break;
338         case UNANNOUNCE:
339             storageNodeManager.resetInNewTransaction();
340             storageNodeOperationsHandler.unannounceStorageNode(subject, storageNode);
341             break;
342         case UNINSTALL:
343             storageNodeManager.resetInNewTransaction();
344             storageNodeOperationsHandler.uninstall(subject, storageNode);
345             break;
346         default:
347             // TODO what do we do with/about maintenance mode
348             throw new RuntimeException("Cannot undeploy " + storageNode);
349         }
350     }
*******************************************************

So, if storage node is in maintenance mode, we just through an Exception and users get stuck with the node they don't want. 

Version-Release number of selected component (if applicable):
JBoss ON 3.3.2

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:
Attempt to undeply storage node fails with the "Cannot undeploy..." message.

Expected results:
Storage node is properly undeployed and no exception is thrown.


Additional info:

Comment 3 John Sanda 2016-06-07 19:37:50 UTC

We set the operation mode to MAINTENANCE when we run the weekly repair job. I don't think we use that operation mode for anything else. Undeploying a node while repair is running would cause the repair job to fail. That in and of itself is not a big problem except that there is no mechanism in place to automatically rerun the repair job. I think it would be good if we can first figure out and hopefully fix getting stuck in MAINENANCE mode. bkramer, do you have steps to reproduce getting stuck in maintenance mode?

Comment 4 bkramer 2016-06-09 08:09:22 UTC

(In reply to John Sanda from comment #3)
> We set the operation mode to MAINTENANCE when we run the weekly repair job.
> I don't think we use that operation mode for anything else. Undeploying a
> node while repair is running would cause the repair job to fail. That in and
> of itself is not a big problem except that there is no mechanism in place to
> automatically rerun the repair job. I think it would be good if we can first
> figure out and hopefully fix getting stuck in MAINENANCE mode. bkramer, do
> you have steps to reproduce getting stuck in maintenance mode?

John, unfortunately I don't have steps to reproduce this issue. Going back to the original issue it seems that this storage node/server/agent were initially installed by mistake and then were shut down and not used for more then 6 months. 

However, when users tried to remove unused storage node they discovered that it cannot be done because it's in maintenance operation mode. They tried to start JON storage node and then to remove it but this didn't work. Then they run runClusterMaintenance() operation but with no luck. 

Finally, they manually changed the operation_mode in the database to NORMAL:

UPDATE rhq_storage_node 
    SET operation_mode = 'NORMAL',
        resource_op_hist_id = NULL, 
        error_msg = NULL 
    WHERE id IN ( SELECT id FROM rhq_storage_node WHERE operation_mode = 'MAINTENANCE' AND resource_id IS NOT NULL);

and after that they were able to remove the storage node.

Comment 6 Larry O'Leary 2016-06-09 15:53:08 UTC

@John,

Getting stuck in MAINTENANCE mode is a very common issue for storage nodes. This is the long running paid of JBoss ON 3.2 and later. If, for example, you add a new node and one of the existing nodes is down or its agent is down, you end up stuck in maintenance mode.

In many cases we have the user fix/repair the issue that caused maintenance to happen and then run the cluster maintenance job. But in the case of a node that has been deleted or is not available due to a bad IP or host name, you can't undeploy.

Comment 7 John Sanda 2016-06-10 03:15:39 UTC

If decommissioning a node while read repair is running does not introduce any instability in the cluster, then we could simply allow a node in maintenance mode to be undeployed. I am still investigating trying to figure out whether or not this is a safe course of action. 

There is another flag we using during cluster maintenance. It is the rhq_storage_node.maintenance_pending column. Prior to running repair, we set it to true  for each node. This is used to implement a data structure for tracking the remaining nodes for which repair needs to be run. When repair completes successfully, the field is set back to false. If a node is stuck in maintenance mode but the maintenance pending flag for all other nodes is false, we could safely allow the node to be undeployed. This would be an easy change. I was reviewing the relevant code in StorageNodeOperationsHandlerBean that sets the maintenance pending flag. It looks like we only set it back to false when the repair completes successfully. I am inclined to say that is a bug. Regardless of the result, I think we should be setting it to false as long as there is a result. In terms of code changes that would also pretty straightforward. Of course if we have multiple nodes stuck with their maintenance pending flag set to true then this won't work. If that is the case, then I think we might want to consider providing a CLI operation to take a node out of maintenance mode.

Comment 8 Larry O'Leary 2016-06-10 14:19:33 UTC

I am assuming your idea would be to allow a decommission/undeploy of a storage node that is in maintenance mode as long as all other nodes have their maintenance pending flag set to false? 

If that is the case, the CLI option seems like a ideal method to provide a workaround for instances in where other nodes have their maintenance pending flag set to true. In those cases the expectation would be a manual fix (node repair + properly updating seed lists and authentication lists) followed by executing the CLI operation to set maintenance pending to false followed by decommission of the storage node that should never have been added or is in a permanently broken/down state.

If my understanding is correct, this seems like a good plan.

Comment 9 John Sanda 2016-06-13 14:01:05 UTC

(In reply to Larry O'Leary from comment #8)
> I am assuming your idea would be to allow a decommission/undeploy of a
> storage node that is in maintenance mode as long as all other nodes have
> their maintenance pending flag set to false? 

Correct. Based on my initial review of the code, some additional changes are needed to ensure that the maintenance pending flag gets reset back to false when the repair resource operation completes for each node. It looks like we only reset it when the operation completes successfully.

> 
> If that is the case, the CLI option seems like a ideal method to provide a
> workaround for instances in where other nodes have their maintenance pending
> flag set to true. In those cases the expectation would be a manual fix (node
> repair + properly updating seed lists and authentication lists) followed by
> executing the CLI operation to set maintenance pending to false followed by
> decommission of the storage node that should never have been added or is in
> a permanently broken/down state.
> 
> If my understanding is correct, this seems like a good plan.

If we make the changes with the maintenance pending flag, we should be able to run the undeploy operation which will take care of updating seeds lists and auth lists. We need an easy, reliable way to determine if cluster maintenance is in fact running. One option would be to store something in the system settings table. If that flag is false, we don't have to worry about the maintenance pending flags.

The undeploy operation is workflow that consists of several resource operations. The most notable of those is decommission. It performs the necessary work to remove the node from the cluster. Other nodes are informed that the node is leaving the cluster and they take ownership (if necessary) of the leaving node's data. Decommission has to be run on the node that is being removed. If the node cannot be started, then we need to follow a different procedure. If by some chance the node never properly joined the cluster such that running `nodetool status` does not list it along with the other storage nodes, then we may need to follow different steps as well. I want to point these things out to make sure we cover our bases.

Comment 13 John Sanda 2017-01-25 16:04:03 UTC

Larry,

Do you think that the CLI operation previously discussed would be sufficient?

Comment 14 Larry O'Leary 2017-01-25 20:15:58 UTC

Yes.

As long as the CLI method was limited to a super-user role, I think this would be a fine option.

No schema changes though.

Comment 18 Simeon Pinder 2017-02-08 06:03:18 UTC

Moving to ON_QA as available for test with the latest build:

JON 3.3.8 CR01 artifacts are available for test from here:
http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/114/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip
 *Note: jon-server-patch-3.3.0.GA.zip maps to CR01 build of
 jon-server-3.3.0.GA-update-08.zip.

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=537179

Comment 19 Simeon Pinder 2017-02-11 00:19:35 UTC

Moving to ON_QA as available for testing with the following build:
 http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/117/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip
 *Note: jon-server-patch-3.3.0.GA.zip maps to CR02 build of
 jon-server-3.3.0.GA-update-08.zip.

Comment 22 errata-xmlrpc 2017-02-16 18:44:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2017-0285.html