Bug 1272328
Summary: | Storage node in maintenance operation mode cannot be undeployed | ||
---|---|---|---|
Product: | [JBoss] JBoss Operations Network | Reporter: | bkramer <bkramer> |
Component: | Storage Node | Assignee: | Michael Burman <miburman> |
Status: | CLOSED ERRATA | QA Contact: | Filip Brychta <fbrychta> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | JON 3.3.2 | CC: | bkramer, fbrychta, jsanda, loleary, spinder |
Target Milestone: | CR02 | Keywords: | Triaged |
Target Release: | JON 3.3.8 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-16 18:44:37 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1272329 | ||
Bug Blocks: |
Description
bkramer
2015-10-16 06:43:01 UTC
We set the operation mode to MAINTENANCE when we run the weekly repair job. I don't think we use that operation mode for anything else. Undeploying a node while repair is running would cause the repair job to fail. That in and of itself is not a big problem except that there is no mechanism in place to automatically rerun the repair job. I think it would be good if we can first figure out and hopefully fix getting stuck in MAINENANCE mode. bkramer, do you have steps to reproduce getting stuck in maintenance mode? (In reply to John Sanda from comment #3) > We set the operation mode to MAINTENANCE when we run the weekly repair job. > I don't think we use that operation mode for anything else. Undeploying a > node while repair is running would cause the repair job to fail. That in and > of itself is not a big problem except that there is no mechanism in place to > automatically rerun the repair job. I think it would be good if we can first > figure out and hopefully fix getting stuck in MAINENANCE mode. bkramer, do > you have steps to reproduce getting stuck in maintenance mode? John, unfortunately I don't have steps to reproduce this issue. Going back to the original issue it seems that this storage node/server/agent were initially installed by mistake and then were shut down and not used for more then 6 months. However, when users tried to remove unused storage node they discovered that it cannot be done because it's in maintenance operation mode. They tried to start JON storage node and then to remove it but this didn't work. Then they run runClusterMaintenance() operation but with no luck. Finally, they manually changed the operation_mode in the database to NORMAL: UPDATE rhq_storage_node SET operation_mode = 'NORMAL', resource_op_hist_id = NULL, error_msg = NULL WHERE id IN ( SELECT id FROM rhq_storage_node WHERE operation_mode = 'MAINTENANCE' AND resource_id IS NOT NULL); and after that they were able to remove the storage node. @John, Getting stuck in MAINTENANCE mode is a very common issue for storage nodes. This is the long running paid of JBoss ON 3.2 and later. If, for example, you add a new node and one of the existing nodes is down or its agent is down, you end up stuck in maintenance mode. In many cases we have the user fix/repair the issue that caused maintenance to happen and then run the cluster maintenance job. But in the case of a node that has been deleted or is not available due to a bad IP or host name, you can't undeploy. If decommissioning a node while read repair is running does not introduce any instability in the cluster, then we could simply allow a node in maintenance mode to be undeployed. I am still investigating trying to figure out whether or not this is a safe course of action. There is another flag we using during cluster maintenance. It is the rhq_storage_node.maintenance_pending column. Prior to running repair, we set it to true for each node. This is used to implement a data structure for tracking the remaining nodes for which repair needs to be run. When repair completes successfully, the field is set back to false. If a node is stuck in maintenance mode but the maintenance pending flag for all other nodes is false, we could safely allow the node to be undeployed. This would be an easy change. I was reviewing the relevant code in StorageNodeOperationsHandlerBean that sets the maintenance pending flag. It looks like we only set it back to false when the repair completes successfully. I am inclined to say that is a bug. Regardless of the result, I think we should be setting it to false as long as there is a result. In terms of code changes that would also pretty straightforward. Of course if we have multiple nodes stuck with their maintenance pending flag set to true then this won't work. If that is the case, then I think we might want to consider providing a CLI operation to take a node out of maintenance mode. I am assuming your idea would be to allow a decommission/undeploy of a storage node that is in maintenance mode as long as all other nodes have their maintenance pending flag set to false? If that is the case, the CLI option seems like a ideal method to provide a workaround for instances in where other nodes have their maintenance pending flag set to true. In those cases the expectation would be a manual fix (node repair + properly updating seed lists and authentication lists) followed by executing the CLI operation to set maintenance pending to false followed by decommission of the storage node that should never have been added or is in a permanently broken/down state. If my understanding is correct, this seems like a good plan. (In reply to Larry O'Leary from comment #8) > I am assuming your idea would be to allow a decommission/undeploy of a > storage node that is in maintenance mode as long as all other nodes have > their maintenance pending flag set to false? Correct. Based on my initial review of the code, some additional changes are needed to ensure that the maintenance pending flag gets reset back to false when the repair resource operation completes for each node. It looks like we only reset it when the operation completes successfully. > > If that is the case, the CLI option seems like a ideal method to provide a > workaround for instances in where other nodes have their maintenance pending > flag set to true. In those cases the expectation would be a manual fix (node > repair + properly updating seed lists and authentication lists) followed by > executing the CLI operation to set maintenance pending to false followed by > decommission of the storage node that should never have been added or is in > a permanently broken/down state. > > If my understanding is correct, this seems like a good plan. If we make the changes with the maintenance pending flag, we should be able to run the undeploy operation which will take care of updating seeds lists and auth lists. We need an easy, reliable way to determine if cluster maintenance is in fact running. One option would be to store something in the system settings table. If that flag is false, we don't have to worry about the maintenance pending flags. The undeploy operation is workflow that consists of several resource operations. The most notable of those is decommission. It performs the necessary work to remove the node from the cluster. Other nodes are informed that the node is leaving the cluster and they take ownership (if necessary) of the leaving node's data. Decommission has to be run on the node that is being removed. If the node cannot be started, then we need to follow a different procedure. If by some chance the node never properly joined the cluster such that running `nodetool status` does not list it along with the other storage nodes, then we may need to follow different steps as well. I want to point these things out to make sure we cover our bases. Larry, Do you think that the CLI operation previously discussed would be sufficient? Yes. As long as the CLI method was limited to a super-user role, I think this would be a fine option. No schema changes though. Moving to ON_QA as available for test with the latest build: JON 3.3.8 CR01 artifacts are available for test from here: http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/114/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip *Note: jon-server-patch-3.3.0.GA.zip maps to CR01 build of jon-server-3.3.0.GA-update-08.zip. https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=537179 Moving to ON_QA as available for testing with the following build: http://download.eng.bos.redhat.com/brewroot/packages/org.jboss.on-jboss-on-parent/3.3.0.GA/117/maven/org/jboss/on/jon-server-patch/3.3.0.GA/jon-server-patch-3.3.0.GA.zip *Note: jon-server-patch-3.3.0.GA.zip maps to CR02 build of jon-server-3.3.0.GA-update-08.zip. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2017-0285.html |