Red Hat Bugzilla – Bug 536250
Last modified: 2013-09-01 06:06:09 EDT
We need to be able to mark resources as administratively down.
This new AvailabilityType.ADMIN_DOWN should get a blue icon
This is needed to support e.g maintenance windows, SLAs and also go compute correct availability percentages when a resource is down on purpose.
One example is a host with several network interfaces where only a subset is used. We don't want all other interfaces always show as error, as the admin marked them as down on purpose.
This is also needed as we are not able to keep a service out of inventory (if the network interface is uninventoried, it gets readded after the next scan silently).
Another use of this is when a user uses our stuff to shut down a resource. It feels weird to tell RHQ to stop an AS instance and then see it listed as a problem resource.
Perhaps add the ability for a user to toggle whether or not a DOWN availability is a problem. This would involve adding a new boolean IS_PROBLEM column to RHQ_AVAILABILITY, which would default to true. The GUI could provide a way for the user to flip it to true. The problem resources portlet would then only display Resources whose current avail is DOWN and IS_PROBLEM=true.
We could implement something similar for alerts.
we would still want to record *what* is happening on the box in real time, even during a maintenance window -- the theory being that the sys admin who is making the modification / upgrade will flip the maintenance window switch, muck with the managed resource, and then flip the switch back. if we continue to monitor the managed resource during this time, then the sys admin can use RHQ to watch the availability and metrics. RHQ will then be his indicator for when things have in fact returned to a steady state and he can turn off the maintenance window.
it would be better if we could support an independent maintenance-window timeline for the resource. then, the availability and metric graphs could get an overlay to indicate - regardless of what the metrics (normal, spiking, flat-lined) and availability (red, green, in between) were - that this was a maintenance window. i'm thinking the overlay would be a mostly transparent tile that would span the entire height of the metric display charts (including availability and events timelines).
maintenance windows for resources would, of course, disable any server-side processing: alert definitions, operation schedules, scheduled content pushes (ok, this doesn't exist yet ;).
I understand the reasoning about the live view during maintenance windows.
Still, there are resources - especially services like network adapters, that are down - not because of a failure, but just because they are not needed. Those should not show DOWN availability, as they are down on purpose and not down on error.
And a group with one resource being ADMIN_DOWN should not take this resource into account when computing group availability.
Also if a SLA says that a resource needs to be x% up outside of maintenance windows, we need to clearly know when maintenance is on - and it does not matter if the resource is up or down in that window.
hmm...interesting. i think that "down on purpose" and "maintenance window" share a lot of the same characteristics:
* we still want to have live information about them
* we still want to know that "red" is OK during this period
* we don't want either of these states to contribute to breached SLAs
i still think that a separate time-line is the best approach. it provides the most accurate information, and doesn't take away the possibility of calculating SLAs correctly. there are plenty of algorithms for computing aggregate information based on concurrent events along the same logical time-line.
as for group availability, there's already an enhancement request for that. instead of trying to come up with complex logical to determine what the appropriate icon is to show on the resource browser, to break it down and show composite information instead:
X - up
Y - down
Z - unknown
maybe we need to add more types:
W - in maintenance
V - off-line (a.k.a down on purpose)
i'm going to link this case to the feature enhancement requests concerning the display of group availability.
This bug was previously known as http://jira.rhq-project.org/browse/RHQ-620
This bug relates to RHQ-1494
This bug incorporates RHQ-673
mass move to component = core server
The new Resource Disable/Enable feature, and the associated DISABLED
avail type, responds to this BZ.
It is in master.
A separate BZ will be generated to handle a separate RFE, which is the
ability for resource component code (agent plugin code) to be able to
disable/enable its corresponding resource.
Bulk closing of items that are on_qa and in old RHQ releases, which are out for a long time and where the issue has not been re-opened since.