Bug 1281667 - [RFE] Add object classifications and system impact in events
[RFE] Add object classifications and system impact in events
Status: NEW
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.5.5
All Linux
medium Severity high
: ---
: ---
Assigned To: nobody nobody
: FutureFeature
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-12 23:31 EST by Marcus West
Modified: 2018-01-08 08:56 EST (History)
14 users (show)

See Also:
Fixed In Version:
Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Infra
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Sorted events list (23.54 KB, text/plain)
2015-11-16 03:57 EST, Pavel Zhukov
no flags Details

  None (edit)
Description Marcus West 2015-11-12 23:31:06 EST
Summary:

Customers require a way of filtering alerts so that 'critical' ones can be dealt with immediately (page sent to on-call persons, etc).  



From the current documentation, we have a list of event codes:

  https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.5/html/Technical_Guide/appe-Event_Codes.html#Event_Codes

But there is no 'level' assigned to each error.  Sometimes from the brief description, it's difficult to tell how important the error is.

For customers using enterprise monitoring systems (ie, OVO, BMC, Nagios, Zabbix etc), they may have many different resources to monitor.  Some sort of classification system would make it easier to triage alerts appropriately, ie:

   a. CRITICAL - Immediate risk to workloads or high availability thereof. I need to fix immediately (ie, VM is not responding, Storage not responding, etc)
   I will invoke "Someone on-call will be woken up to fix" process.

   b. WARNING - Doesn't require immediate fixing. I'd still like it to show up. (ie, one interface in an active/passive bond has gone down)
  
   c. NORMAL - Everything else.

In the code [1] we can see some sort of classification system that exists ('ERROR' and 'WARNING'), however this needs to be refined further.  For example: (both 'ERROR')

  VDS_AUTO_FENCE_STATUS_FAILED(540, AuditLogSeverity.ERROR),
  VDS_INSTALL_FAILED(505, AuditLogSeverity.ERROR),

The first error might leave HA VM's unable to restart, and thus would be critical.  For the second, failing to install a host that is not in use yet would not be considered critical.

We acknowledge that customers will have their own setup, which affects the criticality of specific alerts.  It would be good for the documentation to be broken up via conceptual user functions, rather than pragmatically ([1]), ie:

Table 1:
If you use X functionality (Say, VDI), monitor the following events
Event A... <EVENT NAME> <Sample Blurb you will see in the description (so the monitoring guys can apply heuristic to it)>
Event B... <EVENT NAME> <Sample Blurb>
Event C... <EVENT NAME> <Sample Blurb>

Table 2:
For Y functionality (Say, infrastructure operations monitoring - dc's, hosts, storage domains etc), monitor the following events
Event A... <EVENT NAME> <Sample Blurb (so the monitoring guys can apply heuristic to it)
Event B... <EVENT NAME> <Sample Blurb>
Event C... <EVENT NAME> <Sample Blurb>

Table 3:
For Z functionality (Say, networking), monitor the following events:
... ... ...



[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/AuditLogType.java
Comment 4 Pavel Zhukov 2015-11-16 03:57 EST
Created attachment 1094778 [details]
Sorted events list
Comment 8 Julie 2015-11-17 20:27:52 EST
Sorting SNMP events by severity has been added to the Admin Guide already (BZ# 1269720).

Note You need to log in before you can comment on or make changes to this bug.