Bug 1181653

Summary: RFE: qemu: support block-set-write-threshold
Product: [Community] Virtualization Tools Reporter: Francesco Romani <fromani>
Component: libvirtAssignee: Eric Blake <eblake>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: unspecifiedCC: crobinso, eblake, libvirt-maint, rbalakri
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1181659 (view as bug list) Environment:
Last Closed: 2017-07-17 16:40:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1139217    

Description Francesco Romani 2015-01-13 14:37:03 UTC
Description of problem:
Add an event to report if a block device usage exceeds a threshold. The threshold should be configurable, and the event should report the affected block device.

Rationale for the RFE
Managing applications, like oVirt (http://www.ovirt.org), make extensive use of thin-provisioned disk images.
In order to let the guest run flawlessly and be not unnecessarily paused, oVirt sets a watermark and automatically resized the image once the watermark is reached or exceeded.

In order to detect the mark crossing, the managing application has no choice than aggressively poll the disk highest written sector, using virDomaiGetBlockInfo or the recently added bulk stats equivalent.

However, oVirt needs to do very frequent polling. In general, this usage leads to unnecessary system load, and is made even worse under scale: scenarios
with hunderds of VM are becoming not unusual.

A patch for QEMU to implement disk usage threshold was posted on qemu-devel, reviewd and acked.
Once accepted, libvirt should expose this event.

This BZ entry is to track libvirt support.

Additional info:
QEMU upstream bug: https://bugs.launchpad.net/qemu/+bug/1338957?comments=all
Includes link to the QEMU API.

Comment 1 Eric Blake 2015-01-13 15:57:26 UTC
Exposing the new event should be easy; the hard part will be figuring out an interface for the user to request that the event should happen.  I'm suspecting we need a new API (and thus can't rebase this to happen any sooner than RHEL 7.2), that lets a user register a size to use to trigger a threshold event.

Comment 2 Francesco Romani 2015-01-13 15:59:19 UTC
(In reply to Eric Blake from comment #1)
> Exposing the new event should be easy; the hard part will be figuring out an
> interface for the user to request that the event should happen.  I'm
> suspecting we need a new API (and thus can't rebase this to happen any
> sooner than RHEL 7.2), that lets a user register a size to use to trigger a
> threshold event.

RHEL 7.2 should be fine for us (= oVirt/RHEV).

Comment 3 Francesco Romani 2015-01-14 09:31:48 UTC
(In reply to Eric Blake from comment #1)
> Exposing the new event should be easy; the hard part will be figuring out an
> interface for the user to request that the event should happen.  I'm
> suspecting we need a new API (and thus can't rebase this to happen any
> sooner than RHEL 7.2), that lets a user register a size to use to trigger a
> threshold event.

Regarding the API, I'll briefly describe what oVirt currently does.

The task is done by VDSM, the oVirt node management daemon.
periodically, each disk of each VM is sampled. disk images using format != cow or which are not block devices are immediately discarded.

(pseudo-code python-ish follows)
for each disk
- grab blockInfo:
  capacity, alloc, physical = virDomainGetblockInfo(drive.path, 0)

- check if the drive should be extended or not

  def _shouldExtendVolume(self, drive, capacity, alloc, physical):
     # always use the freshest data
     nextPhysSize = physical + drive.VOLWM_CHUNK_MB * constants.MEGAB

     # NOTE: the intent of this check is to prevent faulty images to
     # trick qemu in requesting extremely large extensions (BZ#998443).
     # Probably the definitive check would be comparing the allocated
     # space with capacity + format_overhead. Anyway given that:
     #
     # - format_overhead is tricky to be computed (it depends on few
     #   assumptions that may change in the future e.g. cluster size)
     # - currently we allow only to extend by one chunk at time
     #
     # the current check compares alloc with the next volume size.
     # It should be noted that alloc cannot be directly compared with
     # the volume physical size as it includes also the clusters not
     # written yet (pending).
     if alloc > nextPhysSize:
         pause_vm_using_the_disk()
         raise Exception

     return physical - alloc < drive.watermarkLimit():

- the drive's watermarkLimit is expressed on percentage of the drive apparent
  size and possibly adjusted to accomodate
  live storage migration.

  constants:


 VOLWM_CHUNK_MB = configfile.getint('volume_utilization_chunk_mb')  # default: 1024
 VOLWM_FREE_PCT = 100 - config.getint('irs', 'volume_utilization_percent')  # default: 50
 VOLWM_CHUNK_REPLICATE_MULT = 2  # Chunk multiplier during replication

  def volExtensionChunk(drive):
    """
    Returns the volume extension chunks (used for the thin provisioning
    on block devices). The value is based on the vdsm configuration but
    can also dynamically change according to the VM needs (e.g. increase
    during a live storage migration).
    """
    if drive.isDiskReplicationInProgress():
        return drive.VOLWM_CHUNK_MB * drive.VOLWM_CHUNK_REPLICATE_MULT
    return drive.VOLWM_CHUNK_MB

  def watermarkLimit(drive):
    """
    Returns the watermark limit, when the LV usage reaches this limit an
    extension is in order (thin provisioning on block devices).
    """
    return (drive.VOLWM_FREE_PCT * volExtensionChunk(drive) *
            constants.MEGAB / 100)

HTH

Comment 4 Eric Blake 2015-05-19 12:48:02 UTC
Current libvirt proposal for exposing this:
https://www.redhat.com/archives/libvir-list/2015-May/msg00580.html

Comment 5 Eric Blake 2017-04-11 13:54:14 UTC
Now upstream in v3.2.0, culminating with this commit:

commit 91c3d430c96ca365ae40bf922df3e4f83295e331
Author: Peter Krempa <pkrempa>
Date:   Thu Mar 16 14:37:56 2017 +0100

    qemu: stats: Display the block threshold size in bulk stats
    
    Management tools may want to check whether the threshold is still set if
    they missed an event. Add the data to the bulk stats API where they can
    also query the current backing size at the same time.