876697 – [RFE] Report Storage IOPS cumulative values per VM per disk via metric store.

Bug 876697 - [RFE] Report Storage IOPS cumulative values per VM per disk via metric store.

Summary: [RFE] Report Storage IOPS cumulative values per VM per disk via metric store.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine-metrics
Sub Component:
Version:	3.0.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.2
Target Release:	---
Assignee:	Shirly Radco
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	879590 (view as bug list)
Depends On:	ovirt_report_storage_iops_accum_per_vm
Blocks:	880593 1168021
TreeView+	depends on / blocked

Reported:	2012-11-14 18:22 UTC by Josh Carter
Modified:	2021-09-09 11:31 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:
Clone Of:
Clones:	ovirt_report_storage_iops_accum_per_vm (view as bug list)
Environment:
Last Closed:	2018-05-15 17:35:57 UTC
oVirt Team:	Metrics
Target Upstream Version:
Embargoed:
Flags:	sherold: Triaged+ lsvaty: testing_plan_complete+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	880611	1	None	None	None	2022-03-14 10:24:24 UTC
Red Hat Knowledge Base (Solution)	329093	0	None	None	None	Never
Red Hat Product Errata	RHEA-2018:1488	0	None	None	None	2018-05-15 17:37:49 UTC
oVirt gerrit	38759	0	'None'	ABANDONED	core: storing iops into db	2021-01-27 09:25:06 UTC
oVirt gerrit	38760	0	'None'	ABANDONED	core: use offsets to storing iops counts surviving device hotplugs/restarts.	2021-01-27 09:25:07 UTC

Internal Links: 880611

Comment 11 Sean Cohen 2014-07-20 06:21:00 UTC

*** Bug 879590 has been marked as a duplicate of this bug. ***

Comment 20 Yaniv Kaul 2017-06-06 17:11:29 UTC

fromani - does the collectd virt plugin (2?) has this?

Comment 21 Francesco Romani 2017-06-12 10:55:09 UTC

(In reply to Yaniv Kaul from comment #20)
> fromani - does the collectd virt plugin (2?) has this?

Not directly. The virt plugin reports the operations completed so far (https://github.com/collectd/collectd/blob/master/src/virt.c#L454)
we need either to configure collectd to aggregate the results somehow and report the OPS (https://collectd.org/wiki/index.php/Plugin:Aggregation) or to do the computation on the engine side.

Comment 22 Francesco Romani 2017-06-16 08:26:35 UTC

clearing needinfo, answered in https://bugzilla.redhat.com/show_bug.cgi?id=876697#c21

Comment 23 Yaniv Kaul 2017-06-16 19:48:01 UTC

(In reply to Francesco Romani from comment #21)
> (In reply to Yaniv Kaul from comment #20)
> > fromani - does the collectd virt plugin (2?) has this?
> 
> Not directly. The virt plugin reports the operations completed so far
> (https://github.com/collectd/collectd/blob/master/src/virt.c#L454)
> we need either to configure collectd to aggregate the results somehow and
> report the OPS (https://collectd.org/wiki/index.php/Plugin:Aggregation) or
> to do the computation on the engine side.

Cumulative is actually what this RFE is all about. Not really 'I/O Operations Per Second (IOPS) but 'I/O OPerationS' (IOPs?) - quite confusing, but the bottom line is that they indeed seem to want the total number of operations.

Comment 24 Yaniv Kaul 2017-06-18 09:19:15 UTC

So looking at VDSM code, we actually already collect quite a bit of data on both read and writes (lib/vdsm/virt/vmstats.py ):

214             if 'disks' in stat:
215                 for disk in stat['disks']:
216                     diskprefix = prefix + '.disk.' + disk
217                     diskinfo = stat['disks'][disk]
218 
219                     data[diskprefix + '.read_latency'] = \
220                         diskinfo['readLatency']
221                     data[diskprefix + '.read_ops'] = \
222                         diskinfo['readOps']
223                     data[diskprefix + '.read_bytes'] = \
224                         diskinfo['readBytes']
225                     data[diskprefix + '.read_rate'] = \
226                         diskinfo['readRate']
227 
228                     data[diskprefix + '.write_bytes'] = \
229                         diskinfo['writtenBytes']
230                     data[diskprefix + '.write_ops'] = \
231                         diskinfo['writeOps']
232                     data[diskprefix + '.write_latency'] = \
233                         diskinfo['writeLatency']
234                     data[diskprefix + '.write_rate'] = \
235                         diskinfo['writeRate']

Engine (backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/vdsbroker/VdsProperties.java) takes some of them:
    public static final String vm_disk_read_rate = "readRate";
    public static final String vm_disk_write_rate = "writeRate";
    public static final String vm_disk_read_latency = "readLatency";
    public static final String vm_disk_write_latency = "writeLatency";

And even calculates the latencies (@ backend/manager/modules/vdsbroker/src/main/java/org/ovirt/engine/core/vdsbroker/vdsbroker/VdsBrokerObjectsBuilder.java )

So we need to add the Ops at least in raw numbers, see what comes out.

Comment 32 Yaniv Bronhaim 2017-08-13 09:30:54 UTC

this comes from _disk_iops_bytes function which in lib/vdsm/virt/vmstats.py
Francesco or Arik probably can give more detailed info - I see it is called in every sample interval and count the read\write io bytes the vm performs until next check

Comment 33 Francesco Romani 2017-09-22 14:42:04 UTC

Every sampling interval indeed Vdsm read (among others) those stats from libvirt bulk stats:

  "block.<num>.rd.reqs" - number of read requests as unsigned long long.
  "block.<num>.wr.reqs" - number of write requests as unsigned long long.
  "block.<num>.fl.reqs" - total flush requests as unsigned long long.

those are absolutes, so are total number of I/O operations at any given time.
Vdsm prefers to return absolute values, and leave the computation of the rate to the client (e.g. Engine). This is considered safer.

Looking at vmstats.py, it seems that Vdsm is sending absolutes, not rates, to the metric store. It should be simple to convert the values from absolutes to rates, thus adding the "per second" part.

HTH,

Comment 35 Lukas Svaty 2017-12-08 14:56:51 UTC

Hi Francesco,

- do we support multiple disks, or is the number sum on all of them?
- any difference on shared disks?
- I suspect we are gathering the info same way for all storage types, right? (NFS, iSCSi, Gluster...)

At the moment in metrics store we are only able to aggregate by VM, which is not sufficient, we need to be able to check SD IOPS per each VM disk.

Comment 36 Sandro Bonazzola 2017-12-20 13:59:10 UTC

oVirt 4.2.0 has been released on Dec 20th 2017. Please consider re-targeting this bug to next milestone

Comment 37 Shirly Radco 2018-01-10 10:38:56 UTC

(In reply to Lukas Svaty from comment #35)
> Hi Francesco,
> 
> - do we support multiple disks, or is the number sum on all of them?
> - any difference on shared disks?
> - I suspect we are gathering the info same way for all storage types, right?
> (NFS, iSCSi, Gluster...)
> 
> At the moment in metrics store we are only able to aggregate by VM, which is
> not sufficient, we need to be able to check SD IOPS per each VM disk.

We have the disk name as "collectd.type_instance". Why cant we group the the vm_disk_write_ops and vm_disk_read_ops per disk name(collectd.type_instance)?

Comment 38 Francesco Romani 2018-01-10 14:40:29 UTC

(In reply to Lukas Svaty from comment #35)
> Hi Francesco,
> 
> - do we support multiple disks, or is the number sum on all of them?
> - any difference on shared disks?
> - I suspect we are gathering the info same way for all storage types, right?
> (NFS, iSCSi, Gluster...)
> 
> At the moment in metrics store we are only able to aggregate by VM, which is
> not sufficient, we need to be able to check SD IOPS per each VM disk.

Sorry for delayed answer
- we report the stats per-disk
- in which sense you mean 'shared'? Each VM reports the counters of all its disk, no matter about how the backend is implemented.
If more than one VM access the same image (perhaps ISO image in RO mode, maybe installation media?) each one will report its counters, if you want or need to have the counters per-image rather than per-vm, it could be hard to do that and Vdsm can't help here
- correct, the storage type is not relevant here

Comment 39 Shirly Radco 2018-01-10 19:40:00 UTC

(In reply to Francesco Romani from comment #38)
> (In reply to Lukas Svaty from comment #35)
> > Hi Francesco,
> > 
> > - do we support multiple disks, or is the number sum on all of them?
> > - any difference on shared disks?
> > - I suspect we are gathering the info same way for all storage types, right?
> > (NFS, iSCSi, Gluster...)
> > 
> > At the moment in metrics store we are only able to aggregate by VM, which is
> > not sufficient, we need to be able to check SD IOPS per each VM disk.
> 
> Sorry for delayed answer
> - we report the stats per-disk
> - in which sense you mean 'shared'? Each VM reports the counters of all its
> disk, no matter about how the backend is implemented.
> If more than one VM access the same image (perhaps ISO image in RO mode,
> maybe installation media?) each one will report its counters, if you want or

If more then 1 VM access the same image, they will report the same values right?

> need to have the counters per-image rather than per-vm, it could be hard to
> do that and Vdsm can't help here
> - correct, the storage type is not relevant here

Comment 40 Francesco Romani 2018-01-11 10:12:22 UTC

(In reply to Shirly Radco from comment #39)
> (In reply to Francesco Romani from comment #38)
> > (In reply to Lukas Svaty from comment #35)
> > > Hi Francesco,
> > > 
> > > - do we support multiple disks, or is the number sum on all of them?
> > > - any difference on shared disks?
> > > - I suspect we are gathering the info same way for all storage types, right?
> > > (NFS, iSCSi, Gluster...)
> > > 
> > > At the moment in metrics store we are only able to aggregate by VM, which is
> > > not sufficient, we need to be able to check SD IOPS per each VM disk.
> > 
> > Sorry for delayed answer
> > - we report the stats per-disk
> > - in which sense you mean 'shared'? Each VM reports the counters of all its
> > disk, no matter about how the backend is implemented.
> > If more than one VM access the same image (perhaps ISO image in RO mode,
> > maybe installation media?) each one will report its counters, if you want or
> 
> If more then 1 VM access the same image, they will report the same values
> right?

It depends on the usage patterns on the VM.
Let's consider this scenario: we have vm A and vm B that boot from the same shared image. Once the boot is done, they stop at the user login.

In this simple scenario, we can rightfully expect that up to the login screen, the counters reported by both VMs will be the same, because the usage pattern of the disk will be exactly the same, thus the number of read operations will be the same (actually, also the order of reads, but that is not reported).

In the same example, after the users log in, the usage pattern, thus the disk usage pattern, thus the I/O counters will become unpredictable, then we cannot expect to be the same.

The takeaway is that the IO values are always reported per-vm, never per-image.

Comment 41 Yaniv Kaul 2018-03-19 08:41:24 UTC

Shirly, what's the latest here? Is it going to make it to 4.2.2?

Comment 42 Shirly Radco 2018-03-20 11:02:14 UTC

As discussed with Yaniv Lavi, the collectd metrics per vm per disk from virt plugin metrics, is sufficient for this RFE.

collectd.virt.disk_ops.read
collectd.virt.disk_ops.write

These metrics should allow the user to quickly / easily identify the VMs that utilized high I/O as requested in this RFE.

Disk name is saved as collectd.type_instance.
VM name is saved as collectd.plugin_instance.


As for being able to aggregate storage IOPS per VM, it is not possible at this stage, since the disk name is not the drive name - it's what libvirt guesses, therefore it can't be aggregated across vms. User can log to the vm in order to be check the drive name.

Comment 43 Lukas Svaty 2018-03-21 16:10:54 UTC

Aggregation part will be solved in separate BZ. Verified in ovirt-engine-metrics-1.1.3.3-1.el7ev.noarch

Comment 46 errata-xmlrpc 2018-05-15 17:35:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:1488

Comment 47 Franta Kust 2019-05-16 13:05:18 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.

acanan
ahadas
byount
djuran
dyasny
fromani
howey.vernon
jhunsaker
lpeer
lsvaty
mkalinin
mwest
ndoane
ngupta
nicolas
pdwyer
ratamir
rbalakri
sander
scohen
sradco
ssekidde
tvvcox
ylavi