Red Hat Bugzilla – Bug 692933
Ability to set agent into maintenance mode
Last modified: 2012-11-29 09:24:24 EST
Feature Request: A large customer has the need to "pause/stop" all alerts/operations on all agents from firing while their server is in a maintenance mode (regular maintenance, upgrades, troubleshooting some issues, etc). It would be nice if this could be triggered from the JON Web UI.
The way we recommend doing this is to use the "maintenence mode" feature for the server. In other words, you don't shutdown, disable or pause the agent but rather you put the server(s) in MM mode.
This frees you up from having to touch your (potentially large number of) agents - rather you just flip the mode of your servers (which will be much less in numbers - usually you only have a few servers to handle hundreds of agents, so you just tell those few servers to go into MM.
To do this, go to the Administration>Server page and select your servers and press "SET MAINTENANCE". All your agents' messages will no longer be processed until at least one server goes back to normal mode (which you do by pressing the "SET NORMAL" button on that same UI page).
If this satisfies your use-case, please close this issue, otherwise, please provide more information why server maintenance mode doesn't do what you need, whereas a new "agent maintenance mode" would solve the problem. As it stands right now, I would consider this issue a non-problem since we already have a UI page that lets you turn on/off maintenance mode for servers which effectively addresses the use-case mentioned in this issue (which is, don't have agent messages trigger alerts and their resulting operation notifications).
er.. of course, I just read more closely your description and it sounds like you have server MM turned on already.
so, it appears I'm confused what you want :)
please provide more details.
The way I understand it is that they want some "granularity" on setting only certain agents to go into maintenance mode. Setting through the JON HA Server screen shutdowns down ALL agents that are connected to that JON server.
So for example, they have 2 JON server (A & B) and have 10 agents connected to each JON server.
Now say only 5 out of 10 agents connected to JON Server A is going into maintenance mode because the servers the agents are on are doing some maintenance.
So, they don't want to put JON server A into MM because that will put all 10 agents into MM where in actuality, only 5 need to go in MM.
Does this make sense? Thanks!
Also, sorry this is opened as a bug, it really is NOT a bug, but a "Feature Request" but I couldn't find any option in bugzilla to specify that it is a "Feature Request" and not an issue. Feel free to point me to the right place or way to enter this feature request.
What if you shutdown the agent and restart it when you want it to begin monitoring again?
I'm trying to figure out what you'd get by keeping the agent running but not monitoring anything.
There might be a way to do what you want (short of shutting down the agent and restarting it when ready). You can issue the prompt command "pc stop" (via the agent resource's operation "Execute Prompt Command". That would stop the plugin container and all plugins but it remains running. When ready, you can then execute the "pc start" command to restart the agent's plugin container and all plugins (thus getting it to start monitoring again).
Thanks for the suggestion. Though it may be possible, but that would be a chore if they have many agents? I guess that's one of the reasons they are requesting something that can be done through the JON Gui?
(In reply to comment #6)
> Thanks for the suggestion. Though it may be possible, but that would be a chore
> if they have many agents? I guess that's one of the reasons they are requesting
> something that can be done through the JON Gui?
This can be done through the GUI and should be relatively easy. Create a compatible group that contains all your "RHQ Agent" resources (I assume all agents are imported into inventory - if they are not, then you can't do this - you'd have to import them). You can easily create this group via a group definition (aka dynagroup) - the group evaluation expression would be something like:
resource.resourceType.pluginName = RHQAgent
resource.resourceType.typeName = RHQ Agent
Then in the compatible group view, Operations tab, you invoke the "Execute Prompt Command" - specifically the commands you'd execute via that Operation would be "pc stop" or "pc start" as explained in the earlier comment #5.
Thanks John. Let me bring this back to the customer and see if this fulfils their request
Customer responded and looks like this suggestion (pc stop/start) and the group definition will meet their needs. Their only request now is that if it's possible for us to make "pc stop/start" spelled out as command to choose from when doing an operation for the agent?
Circling back on this topic, customer reported that they were unable to run the "pc" command (start or stop) on the agent in the compatible group. I confirmed it as well with a test on my local machine with 1 RHQ agent. I can do a "pc status" but when doing a "pc stop", I get a "failure" with the following error:
java.lang.RuntimeException: Call to [org.rhq.plugins.agent.AgentServerComponent.invokeOperation()] with args [[executePromptCommand, Configuration[id=12094]]] was rudely interrupted. at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:452) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invoke(ResourceContainer.java:434) at $Proxy64.invokeOperation(Unknown Source) at org.rhq.core.pc.operation.OperationInvocation.run(OperationInvocation.java:217) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1042) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:257) at java.util.concurrent.FutureTask.get(FutureTask.java:119) at org.rhq.core.pc.inventory.ResourceContainer$ResourceComponentInvocationHandler.invokeInNewThreadWithLock(ResourceContainer.java:446) ... 6 more
but, it does show on the agent command line that "The plugin container has been stopped".
However, running a "pc start" command result in the following error:
org.rhq.core.clientapi.agent.PluginContainerException: Failed to submit invocation request. resource=, operation=[executePromptCommand], jobId=[rhq-resource-10002--1981725742-1309981009433_=_rhq-resource-10002_=_1309981009491]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at $Proxy0.execute(Unknown Source)
Caused by: java.lang.NullPointerException: [Warning] null
... 26 more
Side note, the agent and server is registered under a loopback address, not sure if that made any difference.
try using the agent operation "restartPluginContainer".
there is also a "restart" operation on the agent - this restarts not only the PC but the agent core internals as well (such as the comm layer)
to support this feature would require some code enhancement to the agent-side code.