Bug 1281667

Summary: [RFE] Add object classifications and system impact in events
Product: Red Hat Enterprise Virtualization Manager Reporter: Marcus West <mwest>
Component: ovirt-engineAssignee: Nobody <nobody>
Status: CLOSED DEFERRED QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 3.5.5CC: adahms, gscott, jko, lsurette, mwest, nobody, peli, pzhukov, Rhev-m-bugs, srevivo, ykaul
Target Milestone: ---Keywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Sorted events list none

Description Marcus West 2015-11-13 04:31:06 UTC
Summary:

Customers require a way of filtering alerts so that 'critical' ones can be dealt with immediately (page sent to on-call persons, etc).  



From the current documentation, we have a list of event codes:

  https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.5/html/Technical_Guide/appe-Event_Codes.html#Event_Codes

But there is no 'level' assigned to each error.  Sometimes from the brief description, it's difficult to tell how important the error is.

For customers using enterprise monitoring systems (ie, OVO, BMC, Nagios, Zabbix etc), they may have many different resources to monitor.  Some sort of classification system would make it easier to triage alerts appropriately, ie:

   a. CRITICAL - Immediate risk to workloads or high availability thereof. I need to fix immediately (ie, VM is not responding, Storage not responding, etc)
   I will invoke "Someone on-call will be woken up to fix" process.

   b. WARNING - Doesn't require immediate fixing. I'd still like it to show up. (ie, one interface in an active/passive bond has gone down)
  
   c. NORMAL - Everything else.

In the code [1] we can see some sort of classification system that exists ('ERROR' and 'WARNING'), however this needs to be refined further.  For example: (both 'ERROR')

  VDS_AUTO_FENCE_STATUS_FAILED(540, AuditLogSeverity.ERROR),
  VDS_INSTALL_FAILED(505, AuditLogSeverity.ERROR),

The first error might leave HA VM's unable to restart, and thus would be critical.  For the second, failing to install a host that is not in use yet would not be considered critical.

We acknowledge that customers will have their own setup, which affects the criticality of specific alerts.  It would be good for the documentation to be broken up via conceptual user functions, rather than pragmatically ([1]), ie:

Table 1:
If you use X functionality (Say, VDI), monitor the following events
Event A... <EVENT NAME> <Sample Blurb you will see in the description (so the monitoring guys can apply heuristic to it)>
Event B... <EVENT NAME> <Sample Blurb>
Event C... <EVENT NAME> <Sample Blurb>

Table 2:
For Y functionality (Say, infrastructure operations monitoring - dc's, hosts, storage domains etc), monitor the following events
Event A... <EVENT NAME> <Sample Blurb (so the monitoring guys can apply heuristic to it)
Event B... <EVENT NAME> <Sample Blurb>
Event C... <EVENT NAME> <Sample Blurb>

Table 3:
For Z functionality (Say, networking), monitor the following events:
... ... ...



[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/AuditLogType.java

Comment 4 Pavel Zhukov 2015-11-16 08:57:56 UTC
Created attachment 1094778 [details]
Sorted events list

Comment 8 Julie 2015-11-18 01:27:52 UTC
Sorting SNMP events by severity has been added to the Admin Guide already (BZ# 1269720).

Comment 12 Michal Skrivanek 2020-03-19 15:42:21 UTC
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it.
If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.

Comment 14 Michal Skrivanek 2020-04-01 14:48:57 UTC
ok, closing. Please reopen if still relevant/you want to work on it.

Comment 15 Michal Skrivanek 2020-04-01 14:51:55 UTC
ok, closing. Please reopen if still relevant/you want to work on it.