Summary: Customers require a way of filtering alerts so that 'critical' ones can be dealt with immediately (page sent to on-call persons, etc). From the current documentation, we have a list of event codes: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/3.5/html/Technical_Guide/appe-Event_Codes.html#Event_Codes But there is no 'level' assigned to each error. Sometimes from the brief description, it's difficult to tell how important the error is. For customers using enterprise monitoring systems (ie, OVO, BMC, Nagios, Zabbix etc), they may have many different resources to monitor. Some sort of classification system would make it easier to triage alerts appropriately, ie: a. CRITICAL - Immediate risk to workloads or high availability thereof. I need to fix immediately (ie, VM is not responding, Storage not responding, etc) I will invoke "Someone on-call will be woken up to fix" process. b. WARNING - Doesn't require immediate fixing. I'd still like it to show up. (ie, one interface in an active/passive bond has gone down) c. NORMAL - Everything else. In the code [1] we can see some sort of classification system that exists ('ERROR' and 'WARNING'), however this needs to be refined further. For example: (both 'ERROR') VDS_AUTO_FENCE_STATUS_FAILED(540, AuditLogSeverity.ERROR), VDS_INSTALL_FAILED(505, AuditLogSeverity.ERROR), The first error might leave HA VM's unable to restart, and thus would be critical. For the second, failing to install a host that is not in use yet would not be considered critical. We acknowledge that customers will have their own setup, which affects the criticality of specific alerts. It would be good for the documentation to be broken up via conceptual user functions, rather than pragmatically ([1]), ie: Table 1: If you use X functionality (Say, VDI), monitor the following events Event A... <EVENT NAME> <Sample Blurb you will see in the description (so the monitoring guys can apply heuristic to it)> Event B... <EVENT NAME> <Sample Blurb> Event C... <EVENT NAME> <Sample Blurb> Table 2: For Y functionality (Say, infrastructure operations monitoring - dc's, hosts, storage domains etc), monitor the following events Event A... <EVENT NAME> <Sample Blurb (so the monitoring guys can apply heuristic to it) Event B... <EVENT NAME> <Sample Blurb> Event C... <EVENT NAME> <Sample Blurb> Table 3: For Z functionality (Say, networking), monitor the following events: ... ... ... [1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/AuditLogType.java
Created attachment 1094778 [details] Sorted events list
Sorting SNMP events by severity has been added to the Admin Guide already (BZ# 1269720).
We didn't get to this bug for more than 2 years, and it's not being considered for the upcoming 4.4. It's unlikely that it will ever be addressed so I'm suggesting to close it. If you feel this needs to be addressed and want to work on it please remove cond nack and target accordingly.
ok, closing. Please reopen if still relevant/you want to work on it.