Bug 1281667 - [RFE] Add object classifications and system impact in events
Summary: [RFE] Add object classifications and system impact in events
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 3.5.5
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Nobody
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-11-13 04:31 UTC by Marcus West
Modified: 2020-04-01 14:51 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed:
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Sorted events list (23.54 KB, text/plain)
2015-11-16 08:57 UTC, Pavel Zhukov
no flags Details

Description Marcus West 2015-11-13 04:31:06 UTC
Summary:

Customers require a way of filtering alerts so that 'critical' ones can be dealt with immediately (page sent to on-call persons, etc).  



From the current documentation, we have a list of event codes:

  https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.5/html/Technical_Guide/appe-Event_Codes.html#Event_Codes

But there is no 'level' assigned to each error.  Sometimes from the brief description, it's difficult to tell how important the error is.

For customers using enterprise monitoring systems (ie, OVO, BMC, Nagios, Zabbix etc), they may have many different resources to monitor.  Some sort of classification system would make it easier to triage alerts appropriately, ie:

   a. CRITICAL - Immediate risk to workloads or high availability thereof. I need to fix immediately (ie, VM is not responding, Storage not responding, etc)
   I will invoke "Someone on-call will be woken up to fix" process.

   b. WARNING - Doesn't require immediate fixing. I'd still like it to show up. (ie, one interface in an active/passive bond has gone down)
  
   c. NORMAL - Everything else.

In the code [1] we can see some sort of classification system that exists ('ERROR' and 'WARNING'), however this needs to be refined further.  For example: (both 'ERROR')

  VDS_AUTO_FENCE_STATUS_FAILED(540, AuditLogSeverity.ERROR),
  VDS_INSTALL_FAILED(505, AuditLogSeverity.ERROR),

The first error might leave HA VM's unable to restart, and thus would be critical.  For the second, failing to install a host that is not in use yet would not be considered critical.

We acknowledge that customers will have their own setup, which affects the criticality of specific alerts.  It would be good for the documentation to be broken up via conceptual user functions, rather than pragmatically ([1]), ie:

Table 1:
If you use X functionality (Say, VDI), monitor the following events
Event A... <EVENT NAME> <Sample Blurb you will see in the description (so the monitoring guys can apply heuristic to it)>
Event B... <EVENT NAME> <Sample Blurb>
Event C... <EVENT NAME> <Sample Blurb>

Table 2:
For Y functionality (Say, infrastructure operations monitoring - dc's, hosts, storage domains etc), monitor the following events
Event A... <EVENT NAME> <Sample Blurb (so the monitoring guys can apply heuristic to it)
Event B... <EVENT NAME> <Sample Blurb>
Event C... <EVENT NAME> <Sample Blurb>

Table 3:
For Z functionality (Say, networking), monitor the following events:
... ... ...



[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/AuditLogType.java

Comment 4 Pavel Zhukov 2015-11-16 08:57:56 UTC
Created attachment 1094778 [details]
Sorted events list

Comment 8 Julie 2015-11-18 01:27:52 UTC
Sorting SNMP events by severity has been added to the Admin Guide already (BZ# 1269720).

Comment 12 Michal Skrivanek 2020-03-19 15:42:21 UTC
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 14 Michal Skrivanek 2020-04-01 14:48:57 UTC
ok, closing. Please reopen if still relevant/you want to work on it.

Comment 15 Michal Skrivanek 2020-04-01 14:51:55 UTC
ok, closing. Please reopen if still relevant/you want to work on it.


Note You need to log in before you can comment on or make changes to this bug.