Bug 1877478

Summary: [RFE] collect network metrics in DWH ( rx and tx drop )
Product: Red Hat Enterprise Virtualization Manager Reporter: Andrea Perotti <aperotti>
Component: ovirt-engine-dwhAssignee: Aviv Litman <alitman>
Status: CLOSED ERRATA QA Contact: Pavel Novotny <pnovotny>
Severity: high Docs Contact:
Priority: medium    
Version: 4.4.1CC: acernek, alitman, amusil, dholler, emarcus, gdeolive, lsurette, mburman, mhicks, mkalinin, mperina, pelauter, sbonazzo, sradco, srevivo
Target Milestone: ovirt-4.4.7Keywords: FutureFeature, Rebase, Reopened, ZStream
Target Release: 4.4.7   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ovirt-engine-4.4.7 ovirt-engine-dwh-4.4.7.1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-22 15:12:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Metrics RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1789090, 1937714, 1938964    

Description Andrea Perotti 2020-09-09 17:43:35 UTC
1. What is the nature and description of the request?  

Being RHV a complete virtualization solution, is it expected to be able to control the status of the virtualization infra and of its components, especially hypevisor, in great detail, in order to assist virt. admins in maintaning the infra and properly react in case of misbehaviour.

This include the status of the network, of the networking ports of each hypevisor, like it has already been started doing with "network usage" alert. 
Any misbehaviour or hint of malfunction should be reported to RHV Manager that should be able to properly collect historical data of those metrics and eventually alert in case of threshold exceed (# of packet loss, TCP ack/nak/re-transmits)
Those data should be exposed via API as well, in order to allow 3rd party virtualization monitoring tools to benefit from it (like IBM Tivoli Monitoring for Virtual Environments [0])

2. Why would you need this? (List the business requirements here)

In case of hardware fault, intermittent issues or misconfiguration on the network, RHV Admin should have the appropriate information in order to be able to spot anomalies in order to focus their troubleshooting efforts. 

3. How would you like to achieve this? (List the functional requirements here)  

a. Display more detailed network statistics from the RHV Hosts directly on the RHV Portal
b. Getting Alerts from RHV Manger in case of critical network misbehaviour
c. Retrieving same data via RHV Manager API , in order to collect them by external monitoring tools


4. For each functional requirement listed, specify how you can test to confirm the requirement is successfully implemented.   

a) simulate a constant packet loss/retransmission (i.e. via netem[1]) and ensure that data is collected by RHV Manager and values saved into DWH
b) in case of a) make that visible as an visible alert in the GUI and ensure is possible to instruct ovirt-notifier to send email about that event.
c) make sure the #packet loss/retransmission instant data is exposed via API as well, like hypervisor CPU or memory

5. Do you have any specific timeline dependencies ?  

As soon as possible: this is a big deficiency that will be greatly beneficial for all RHV user 

6. Would you be able to assist in testing this functionality if implemented?

yes
 

[0] https://www.ibm.com/support/pages/ibm-tivoli-monitoring-virtual-environment-linux-kernel-based-virtual-machines-agent
[1] https://wiki.linuxfoundation.org/networking/netem#emulating_wide_area_network_delays

Comment 2 Sandro Bonazzola 2020-09-10 08:00:19 UTC
If the request is to get network failure notifications in the engine you may be interested in https://www.ovirt.org/develop/release-management/features/gluster/nagios-integration.html

Comment 3 Dominik Holler 2020-10-29 16:08:59 UTC
> Being RHV a complete virtualization solution, is it expected to be able to control the status of the virtualization infra and of its components, especially hypevisor, in great detail, in order to assist virt. admins in maintaning the infra and properly react in case of misbehaviour.

Why is it required to use RHV, instead of a dedicated monitoring software?
Even if we would add full-featured network monitoring to RHV, this would not address monitoring other areas.

Comment 6 Martin Perina 2021-02-04 13:31:24 UTC
It's better to use dedicated monitoring software for advanced network monitoring, it doesn't make sense to reimplement that functionality in RHV.

Comment 27 Pavel Novotny 2021-07-08 16:02:07 UTC
Verified in 
ovirt-engine-4.4.7.6-0.11.el8ev.noarch
ovirt-engine-dwh-4.4.7.3-1.el8ev.noarch

The Rx/Tx dropped packets statistics are now collected and displayed in several dashboards in Grafana. See bug 1937714 comment 6 for details.

Comment 31 errata-xmlrpc 2021-07-22 15:12:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: RHV Manager (ovirt-engine) security update [ovirt-4.4.7]), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2865