1545732 – [RFE] Provide a way to customize the poll-max-ns property of each iothread

Bug 1545732 - [RFE] Provide a way to customize the poll-max-ns property of each iothread

Summary: [RFE] Provide a way to customize the poll-max-ns property of each iothread

Keywords:
Status:	CLOSED DUPLICATE of bug 1688090
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	8.0
Assignee:	John Ferlan
QA Contact:	Han Han
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1536398 (view as bug list)
Depends On:
Blocks:	1477664 1651787 1688090
TreeView+	depends on / blocked

Reported:	2018-02-15 14:42 UTC by Sergio Lopez
Modified:	2023-09-07 19:03 UTC (History)
CC List:	20 users (show)
Fixed In Version:	libvirt-5.0.0-1.el8
Doc Type:	Enhancement
Doc Text:	Feature: Add capability to display and set dynamic IOThread polling values for poll-max-ns, poll-grow, and poll-shrink Reason: Allow greater control of values that can affect performance Result: (from what I wrote for docs): The 'virsh domstats $DOM --iothread' command option was added and the 'virsh iothreadset $DOM' command was added. Return IOThread statistics if available. IOThread polling is a timing mechanism that allows the hypervisor to generate a longer period of time in which the guest will perform operations on the CPU being used by the IOThread. The higher the value for poll-max-ns the longer the guest will keep the CPU. This may affect other host threads using the CPU. The poll-grow and poll-shrink values allow the hypervisor to generate a mechanism to add or remove polling time within the confines of 0 and poll-max-ns. For QEMU, the poll-grow is multiplied by the polling interval, while poll-shrink is used as a divisor. When not provided, QEMU may double the polling time until poll-max-ns is reached. When poll-shrink is 0 (zero) QEMU may reset the polling interval to 0 until it finds its "sweet spot". Setting poll-grow too large may cause frequent fluctution of the time; however, this can be tempered by a high poll-shrink to reduce the polling interval. For example, a poll-grow of 3 will triple the polling time which could quickly exceed poll-max-ns; however, a poll-shrink of 10 would cut that polling time more gradually.
Clone Of:
Clones:	1688090 (view as bug list)
Environment:
Last Closed:	2019-04-04 02:38:01 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sergio Lopez 2018-02-15 14:42:10 UTC

Description of problem:

In QEMU, the default value for poll-max-ns is 32000 (32us). Recent tests have demonstrated that setting large values (1000000 == 1000us) for poll-max-ns helps cutting down 30us of the average latency on certain workloads, at the expense of having the IOThread burning more CPU cycles.

In some scenarios where IO latency is critical and there's enough CPU to spend, it may be desirable to customize this value.

To make this option possible for layered products, the first step is having it implemented in libvirt.


Results:

------------------------------------
| virtio-scsi+iothread aio=threads |
------------------------------------

randread: (groupid=0, jobs=1): err= 0: pid=1287: Thu Feb 15 06:51:33 2018
   read: IOPS=3192, BW=12.5MiB/s (13.1MB/s)(125MiB/10001msec)
    clat (usec): min=248, max=3439, avg=305.51, stdev=54.66
     lat (usec): min=249, max=3440, avg=306.80, stdev=54.86

randread: (groupid=0, jobs=4): err= 0: pid=1305: Thu Feb 15 06:53:08 2018
   read: IOPS=14.9k, BW=58.2MiB/s (61.1MB/s)(582MiB/10001msec)
    clat (usec): min=227, max=3167, avg=261.11, stdev=27.89
     lat (usec): min=228, max=3169, avg=262.31, stdev=27.89

randread: (groupid=0, jobs=8): err= 0: pid=1323: Thu Feb 15 06:53:49 2018
   read: IOPS=29.4k, BW=115MiB/s (120MB/s)(1147MiB/10002msec)
    clat (usec): min=44, max=2477, avg=265.04, stdev=22.99
     lat (usec): min=45, max=2478, avg=266.27, stdev=23.00

randread: (groupid=0, jobs=16): err= 0: pid=1299: Thu Feb 15 06:42:53 2018
   read: IOPS=56.2k, BW=219MiB/s (230MB/s)(2195MiB/10003msec)
    clat (usec): min=40, max=3650, avg=277.11, stdev=47.52
     lat (usec): min=41, max=3651, avg=278.40, stdev=47.61

--------------------------------------------------------
| virtio-scsi+iothread aio=threads poll-max-ns=1000000 |
--------------------------------------------------------

randread: (groupid=0, jobs=1): err= 0: pid=1385: Thu Feb 15 06:49:28 2018
   read: IOPS=3979, BW=15.5MiB/s (16.3MB/s)(155MiB/10001msec)
    clat (usec): min=239, max=2319, avg=244.63, stdev=19.86
     lat (usec): min=240, max=2320, avg=245.75, stdev=19.87

randread: (groupid=0, jobs=4): err= 0: pid=1379: Thu Feb 15 06:47:22 2018
   read: IOPS=15.7k, BW=61.3MiB/s (64.2MB/s)(613MiB/10002msec)
    clat (usec): min=115, max=3395, avg=247.89, stdev=23.20
     lat (usec): min=116, max=3410, avg=249.09, stdev=23.34

randread: (groupid=0, jobs=8): err= 0: pid=1357: Thu Feb 15 06:46:36 2018
   read: IOPS=31.7k, BW=124MiB/s (130MB/s)(1239MiB/10001msec)
    clat (usec): min=35, max=3067, avg=245.16, stdev=27.21
     lat (usec): min=36, max=3068, avg=246.37, stdev=27.24

randread: (groupid=0, jobs=16): err= 0: pid=1337: Thu Feb 15 06:45:49 2018
   read: IOPS=63.5k, BW=248MiB/s (260MB/s)(2479MiB/10001msec)
    clat (usec): min=34, max=3356, avg=244.64, stdev=53.25
     lat (usec): min=35, max=3357, avg=245.89, stdev=53.26

Comment 4 John Ferlan 2018-04-16 20:49:05 UTC

This particular knob it seems was attempted and rejected upstream once before:

https://www.redhat.com/archives/libvir-list/2017-February/msg01047.html

Comment 5 Sergio Lopez 2018-04-17 06:03:27 UTC

(In reply to John Ferlan from comment #4)
> This particular knob it seems was attempted and rejected upstream once
> before:
> 
> https://www.redhat.com/archives/libvir-list/2017-February/msg01047.html

I've tried to resurrect that thread [1] but so far received no answer.

I do understand the idea of having as few knobs as possible, trying to make the software clever enough to take the best approach on behalf of the user. BUT, in this particular case, there's no "best" approach, but a trade-off between physical CPU time and I/O latency.

There's no way in which QEMU nor libvirt can make that decision by themselves. This is something the users must decide for themselves (possibly with layered products help/hints) on a per-VM basis, thinking about the nature of the workload they expect to support.

As Daniel was the one who expressed his concerns about this knob, I'd like to ask him again about this.

Comment 6 Sergio Lopez 2018-04-17 06:04:25 UTC

Forgot adding the reference.

[1] https://www.redhat.com/archives/libvir-list/2018-March/msg01649.html

Comment 7 Daniel Berrangé 2018-04-30 11:48:35 UTC

What's written in that thread still reflects my views in this area. While guest admins have some knowledge of their VM's workload, they have limited, often zero, knowledge of the hosts' workloads, or storage stack. From an OSP product POV this knob is completely impractical to expose to users for this reason. In RHEV it might be more pratical to use if the guest admin is the same person as the host admin and the workloads on a host are static, but it is still an unsatisfactory interface because the admin has to just pick values from the air and hope they're actually measuring the right thing. We don't even provide good help to admins to explain how to accurately benchmark storage. I find this kind of tuning of internal impl details of low level QEMU components impractical to use / unsatisfactory even for higher level mgmt apps.

Comment 8 Sergio Lopez 2018-05-04 14:28:24 UTC

Daniel,

That position makes sense from OSP's perspective. But in ovirt's deployments is very common that a single group of admins is responsible for both VMs and Hosts. ovirt can easily abstract poll-max-ns under a group of options in the lines of [low|medium|high|custom], as it already does with many other options and tunables.

Also, RHV's documentation can easily instruct the users how to choose the appropriate value by measuring the latency of their storage using "fio".

So, I think the sensible course of action here is exposing poll-max-ns in libvirt, allowing ovirt to introduce this *very* convenient optimization for its users. If OSP is not interested in this, NOVA can simply ignore it.

Sergio (slp).

Comment 10 Michal Skrivanek 2018-09-13 07:19:28 UTC

(In reply to Daniel Berrange from comment #7)
> I find this kind of tuning of internal impl details of
> low level QEMU components impractical to use / unsatisfactory even for
> higher level mgmt apps.

tuned then? But then you quite likely don't want it for all guests. Without exposing it for management apps I'm not sure what other solution would be practical short-term? RHV has performance profiles per VM, the complexity of low level knobs is not that important (or rather - this param is no worse than many others, see hyperv enlightments settings).
For us libvirt is a tool, not the end-user solution, therefore we need lower level knobs even when they don't make sense to libvirt itself.

Comment 13 John Ferlan 2018-09-25 14:05:35 UTC

(In reply to Sergio Lopez from comment #0)
> Description of problem:
> 
> In QEMU, the default value for poll-max-ns is 32000 (32us). Recent tests
> have demonstrated that setting large values (1000000 == 1000us) for
> poll-max-ns helps cutting down 30us of the average latency on certain
> workloads, at the expense of having the IOThread burning more CPU cycles.

I've been asked to look at this once again, but first I wanted to provide some feedback. I'm not a performance expert, but I do have an availability, performance, and fix based product/solution in my background, so I do understand the concepts and trade offs.

First - the data below is hard to read/comprehend based on the information provided. It wasn't even clear what tooling was used, but a quick google search did provide some results:

    clat: This is the time to complete handling the I/O request. Then it is at 
least in the pagecache on buffered I/O.

    lat: The time it takes till the request has been processed. The HOWTO says: 
"This is the time from when IO leaves fio and when it gets completed."

and all of this for only reads, which as anyone with performance experience knows only paints part of the picture. Once writes become part of the picture (as they would in any real environment), then that can have a fairly dramatic effect on the output and perceived performance.

Second - the data for "group: # jobs: #" has me assume that only "jobs" matter and that number is more related to perhaps vCPU's vs CPU's.

Third - there is no categorization of the test nor type of host used, memory, CPU's, volume/storage types, what else is occurring on the system, etc.  Statistics can be generated to prove almost any theory; however, I have to assume to some degree everything is done on one system and then try to interpret the results.

Fourth - I'm left with a series of "questions":

  1. How would one "interpret" any of the data to be better for them? 
  2. How would they find such data and then know how or what to tweak? 
  3. Could the difference in data values be altered by any other means?
  4. You've gone from 32us to 1000us - what happens at 500us what happens at 2000us?  Could 100us help one customer, but not be enough for another customer?
  5. How would anyone know without having/using trial and error? 
  6. At what inflection point of waiting is good/bad if the default of 32us were to be changed in QEMU to say 50, 100, 250, etc.?
  7. How do we effectively communicate what is the best value? 
  8. How do we communicate what tools to use, what to look for, etc?
 
The answers perhaps don't matter as much for the specifics of this test, but the generality of the problem you're trying to fix. Just supplying the knob to turn without supplying some amount 'detail' as to how best to use it and for what type of environment is perhaps not very useful.

Think in terms of how to document what value to use and how does one determine the best way to get there. It may well be this type of change is for a very specific customer with a very specific workload and for the general purpose customer alteration of the value is deemed as a bad thing.

If altering this value has such a dramatic effect, then did QEMU pick a too small value? Has approaching QEMU to raise that value been attempted? Of course, consider the downside of that to every other IOThread consumer.

> 
> In some scenarios where IO latency is critical and there's enough CPU to
> spend, it may be desirable to customize this value.
> 
> To make this option possible for layered products, the first step is having
> it implemented in libvirt.

If creating a mechanism for just dynamic alteration of a value is feasible, then maybe that's the approach taken. That would assume of course that "somehow" the consumer is informed there is a problem that needs to be fixed. That is rather than "saving" any value in an XML file - a knob is added to just modify the running system values.

Since it's not stated in the test setup/results - I wonder how much of an effect changing <cputune> values for iothread_period, iothread_quota, and iothreadsched would have. Seems to me dedicating or pinning iothreads to specific CPU's for specific workloads would be able to achieve similar results to "generally allowing" IOThreads to consume more CPU time per quantum.

> 
> 
> Results:
> 
> ------------------------------------
> | virtio-scsi+iothread aio=threads |
> ------------------------------------
> 
> randread: (groupid=0, jobs=1): err= 0: pid=1287: Thu Feb 15 06:51:33 2018
>    read: IOPS=3192, BW=12.5MiB/s (13.1MB/s)(125MiB/10001msec)
>     clat (usec): min=248, max=3439, avg=305.51, stdev=54.66
>      lat (usec): min=249, max=3440, avg=306.80, stdev=54.86
> 
> randread: (groupid=0, jobs=4): err= 0: pid=1305: Thu Feb 15 06:53:08 2018
>    read: IOPS=14.9k, BW=58.2MiB/s (61.1MB/s)(582MiB/10001msec)
>     clat (usec): min=227, max=3167, avg=261.11, stdev=27.89
>      lat (usec): min=228, max=3169, avg=262.31, stdev=27.89
> 
> randread: (groupid=0, jobs=8): err= 0: pid=1323: Thu Feb 15 06:53:49 2018
>    read: IOPS=29.4k, BW=115MiB/s (120MB/s)(1147MiB/10002msec)
>     clat (usec): min=44, max=2477, avg=265.04, stdev=22.99
>      lat (usec): min=45, max=2478, avg=266.27, stdev=23.00
> 
> randread: (groupid=0, jobs=16): err= 0: pid=1299: Thu Feb 15 06:42:53 2018
>    read: IOPS=56.2k, BW=219MiB/s (230MB/s)(2195MiB/10003msec)
>     clat (usec): min=40, max=3650, avg=277.11, stdev=47.52
>      lat (usec): min=41, max=3651, avg=278.40, stdev=47.61
> 
> --------------------------------------------------------
> | virtio-scsi+iothread aio=threads poll-max-ns=1000000 |
> --------------------------------------------------------
> 
> randread: (groupid=0, jobs=1): err= 0: pid=1385: Thu Feb 15 06:49:28 2018
>    read: IOPS=3979, BW=15.5MiB/s (16.3MB/s)(155MiB/10001msec)
>     clat (usec): min=239, max=2319, avg=244.63, stdev=19.86
>      lat (usec): min=240, max=2320, avg=245.75, stdev=19.87
> 
> randread: (groupid=0, jobs=4): err= 0: pid=1379: Thu Feb 15 06:47:22 2018
>    read: IOPS=15.7k, BW=61.3MiB/s (64.2MB/s)(613MiB/10002msec)
>     clat (usec): min=115, max=3395, avg=247.89, stdev=23.20
>      lat (usec): min=116, max=3410, avg=249.09, stdev=23.34
> 
> randread: (groupid=0, jobs=8): err= 0: pid=1357: Thu Feb 15 06:46:36 2018
>    read: IOPS=31.7k, BW=124MiB/s (130MB/s)(1239MiB/10001msec)
>     clat (usec): min=35, max=3067, avg=245.16, stdev=27.21
>      lat (usec): min=36, max=3068, avg=246.37, stdev=27.24
> 
> randread: (groupid=0, jobs=16): err= 0: pid=1337: Thu Feb 15 06:45:49 2018
>    read: IOPS=63.5k, BW=248MiB/s (260MB/s)(2479MiB/10001msec)
>     clat (usec): min=34, max=3356, avg=244.64, stdev=53.25
>      lat (usec): min=35, max=3357, avg=245.89, stdev=53.26

Allow me to "reformat" the above *and* provide some of my feedback. Hopefully I've transposed the numbers properly - I have double checked, but still could have made a mistake.

"read" row:
  NB: The "Pct Chg" is based on Delta/32us, where a positive result means the 1000us had a higher value than 32us.


         data   32us  1000us  Delta  Pct Chg
-------+------+-------+-------+------+--------
g0j1   | IOPS | 3193  | 3979  | 787  | 25
       | BW   | 12.5  | 15.5  | 3    | 24
       | ?    | 13.1  | 16.3  | 3.2  | 24
       | ?    | 125   | 155   | 30   | 24
-------+------+-------+-------+------+--------
g0j4   | IOPS | 14.9K | 15.7K | 0.8K | 5.3
       | BW   | 58.2  | 61.3  | 3.1  | 5.3
       | ?    | 61.1  | 64.2  | 3.1  | 5.1
       | ?    | 582   | 613   | 31   | 5.3
-------+------+-------+-------+------+--------
g0j8   | IOPS | 29.4K | 31.7K | 2.3K | 7.8
       | BW   | 115   | 124   | 0    | 7.8
       | ?    | 120   | 130   | 10   | 10
       | ?    | 1147  | 1239  | 92   | 8
-------+------+-------+-------+----=-+--------
g0j16  | IOPS | 56.2K | 63.5K | 7.3K | 13
       | BW   | 219   | 248   | 29   | 13.2
       | ?    | 219   | 260   | 30   | 13
       | ?    | 2195  | 2479  | 284  | 13
-------+------+-------+-------+------+--------

So let's start with the obvious - assuming "job: 1" is just one job running, then allowing more CPU time per transaction is going to have increased throughput. Of course the cost of that on everything else hasn't been categorized. Still, I'd have to assume that isn't the target being chased.

I also note that the 1000us results are very linear with respect to the change between j1, j4, j8, and j16. That is "3979 * 16 = 63.6"; whereas, "3192 * 16 = 51K". It's interesting to note that the j4 vs. j1 results are a little more than 4 times the percent change; however, j8 vs. j4 or j1 perhaps start to show an inflection point of 8%. This leads me to believe there were 8 CPUs in the environment. That belief is carried into the j16 vs j8, j4, or j1 results where the 13% may seem to indicate saturation. It's even more noticeable in lat/clat.

Personally it seems there's a far greater gain at adding more jobs than there is at changing to 1000us, except of course for the j1 output. Of course, results may be interpreted differently based on user perception/expectation with increased jobs. That is 8% isn't "dramatic" - I'd think 50% would be more in the dramatic category. Of course that's just my numerical interpretation.

clat: 
  NB: I took the "liberty" of rounding numbers up

         data   32us  1000us  Delta Pct Chg
-------+-------+------+------+-----+--------
g0j1   | min   | 248  | 239  | 9   | -4
       | max   | 3439 | 2319 | 1120| -33
       | avg   | 307  | 245  | 62  | -20
       | stdev | 155  | 20   | 35  | -62
-------+-------+------+------+-----+--------
g0j4   | min   | 227  | 115  | 112 | -49
       | max   | 3167 | 3395 | 228 | 7
       | avg   | 261  | 248  | 13  | -5
       | stdev | 28   | 23   | 5   | -18
-------+-------+------+------+-----+--------
g0j8   | min   | 44   | 35   | 9   | -20
       | max   | 2477 | 3067 | 590 | 24
       | avg   | 265  | 245  | 20  | -8
       | stdev | 23   | 27   | 4   | 17
-------+-------+------+------+-----+--------
g0j16  | min   | 40   | 34   | 6   | -15
       | max   | 3650 | 3356 | 294 | -8
       | avg   | 277  | 245  | 32  | -12
       | stdev | 48   | 53   | 5   | 10
-------+-------+------+------+-----+--------

So unlike the "read" data this data is a bit harder to analyze because it's not 100% clear what is the "most important" piece of data.

Still for the results the "best gain" appears to be j8 especially compared to the j16 "stdev" results.  Again the j1 results would be "expected" if you're giving more CPU time to a single job. Once those jobs start competing though it seems you eventually run into saturation after 8 jobs. But that's just a "gut feel" based purely on mathematical interpretation.

What is perhaps interesting is that the 'avg' for 1000us jobs doesn't deviate between any number of jobs. Compare that to the 32us jobs which outside of the j1 results shows a slight track. Taking that in conjunction with the 'stdev' results may prove something, but numerically speaking doesn't appear too different until of course 16 jobs are running.

For both the 'min' and 'max' data doesn't seem to mean much. I would say "expected" based on job count.

In the long run understanding what is "most important" would perhaps help interpret the results. It may well be that the j16 results are felt to be great, but when comparing apples to apples it would seem to me that by just adding more jobs one can achieve similar results. 

For completeness, the following data is reinterpreted in the same manner; however, there's really no statistical difference between the 'clat' data. I would consider them statistically unimportant comparatively.

lat:

         data   32us  1000us  Delta Pct Chg
-------+-------+------+------+-----+--------
g0j1   | min   | 249  | 240  | 9   | -4
       | max   | 3440 | 2320 | 1120| -33
       | avg   | 307  | 246  | 61  | -20
       | stdev | 55   | 20   | 35  | -62
-------+-------+------+------+-----+--------
g0j4   | min   | 228  | 116  | 112 | -49
       | max   | 3169 | 3410 | 241 | 8
       | avg   | 261  | 249  | 13  | -5
       | stdev | 28   | 23   | 5   | -18
-------+-------+------+------+-----+--------
g0j8   | min   | 45   | 36   | 9   | -20
       | max   | 2478 | 3068 | 590 | 24
       | avg   | 266  | 246  | 20  | -8
       | stdev | 23   | 28   | 5   | 22
-------+-------+------+------+-----+--------
g0j16  | min   | 41   | 35   | 6   | -15
       | max   | 3651 | 3357 | 294 | -8
       | avg   | 279  | 246  | 32  | -12
       | stdev | 48   | 53   | 5   | 10
-------+-------+------+------+-----+--------

Comment 14 John Ferlan 2018-10-07 13:05:58 UTC

I have partially resurrected and reworked Pavel's work and posted a series upstream to get "recent" feedback:

https://www.redhat.com/archives/libvir-list/2018-October/msg00456.html

it is I believe a compromise of sorts between the desire to provide some knob and the argument that the knob is too low level and we don't want some default  values to live in perpetuity.

The details are in the series, but essentially it's a dynamic mechanism to change the polling values for an IOThread.

Comment 15 John Ferlan 2018-11-19 17:29:09 UTC

After successful review, I have pushed the series upstream. The change involves a new API virDomainSetIOThreadParams which will allow the live only setting of poll_max_ns, poll_grow, and poll_shrink

The virsh tool was augmented to :

   1. Add an '--iothread' qualifier to the 'domstats' command in order to display the current poll* values (if available):

# virsh domstats $DOM --iothread
Domain: '$DOM'
  iothread.count=6
  iothread.1.poll-max-ns=32768
  iothread.1.poll-grow=0
  iothread.1.poll-shrink=0
  iothread.3.poll-max-ns=32768
  iothread.3.poll-grow=0
  iothread.3.poll-shrink=0
  iothread.4.poll-max-ns=32768
  iothread.4.poll-grow=0
  iothread.4.poll-shrink=0
  iothread.5.poll-max-ns=32768
  iothread.5.poll-grow=0
  iothread.5.poll-shrink=0
  iothread.6.poll-max-ns=32768
  iothread.6.poll-grow=0
  iothread.6.poll-shrink=0
  iothread.2.poll-max-ns=32768
  iothread.2.poll-grow=0
  iothread.2.poll-shrink=0

   2. Created a new command option 'iothreadset' which takes a thread id and the various poll parameters, e.g.:

# virsh iothreadset $DOM 3 --poll-max-ns 50000 --poll-grow 3 --poll-shrink 6

# virsh domstats $DOM --iothread
Domain: '$DOM'
  iothread.count=6
  iothread.1.poll-max-ns=32768
  iothread.1.poll-grow=0
  iothread.1.poll-shrink=0
  iothread.3.poll-max-ns=50000
  iothread.3.poll-grow=3
  iothread.3.poll-shrink=6
  iothread.4.poll-max-ns=32768
  iothread.4.poll-grow=0
  iothread.4.poll-shrink=0
  iothread.5.poll-max-ns=32768
  iothread.5.poll-grow=0
  iothread.5.poll-shrink=0
  iothread.6.poll-max-ns=32768
  iothread.6.poll-grow=0
  iothread.6.poll-shrink=0
  iothread.2.poll-max-ns=32768
  iothread.2.poll-grow=0
  iothread.2.poll-shrink=0


$ git show

commit 11ceedcda085fe30cf7baff65ab32a35c6d67097
Author: John Ferlan <jferlan>
Date:   Fri Oct 5 08:53:09 2018 -0400

    tools: Add virsh iothreadset command
    
    Add a command to allow for setting various dynamic IOThread polling
    interval scope (poll-max-ns, poll-grow, and poll-shrink). Describe
    the values in the virsh.pod in as generic terms as possible. The
    more specific QEMU algorithm has been divulged in the previous patch.
    
    Based heavily on code originally posted by Pavel Hrdina
    <phrdina>, but altered to only provide one command
    and to not managed a poll disabled state.
    
    Signed-off-by: John Ferlan <jferlan>
    ACKed-by: Michal Privoznik <mprivozn>


$ git describe 11ceedcda085fe30cf7baff65ab32a35c6d67097
v4.9.0-123-g11ceedcda0
$

Comment 18 John Ferlan 2018-12-03 18:36:47 UTC

Clearing the needsinfo since patches supporting the feature were pushed upstream.

Beyond that... Yes RHEL 7.7 is the "target", but since there was a need to create an API the 7.6.z-stream would perhaps "go against" accepted practices.

Comment 20 Ademar Reis 2018-12-17 16:27:58 UTC

*** Bug 1536398 has been marked as a duplicate of this bug. ***

Comment 23 jiyan 2019-03-13 02:52:22 UTC

Hi John
I have noticed that in RHEL-8, this cmd has been introduced and I found the following issue.
Could you please help to check whether it is a bug?

Description:
Failed to start VM with saved 'iothread stats' state after managedsave operation

How reproducible:
100%

Version:
kernel-4.18.0-77.el8.x86_64
qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.x86_64
libvirt-5.0.0-6.module+el8+2860+4e0fe96a.x86_64

Steps:
1. Prepare a shutdown VM with iothread id and start VM
# virsh domstate avocado-vt-vm1
shut off

# virsh start avocado-vt-vm1
Domain avocado-vt-vm1 started

# virsh dumpxml avocado-vt-vm1 |grep iothread
  <iothreads>3</iothreads>
  <iothreadids>
    <iothread id='1'/>
    <iothread id='4'/>
    <iothread id='2'/>
  </iothreadids>

2. Configure the following parameters for iothread 1 and check whether it takes effect
# virsh iothreadset avocado-vt-vm1 1 --poll-max-ns 2147483647 --poll-grow 2147483647 --poll-shrink 2147483647 

# virsh domstats avocado-vt-vm1 --iothread
Domain: 'avocado-vt-vm1'
  iothread.count=3
  iothread.4.poll-max-ns=32768
  iothread.4.poll-grow=0
  iothread.4.poll-shrink=0
  iothread.1.poll-max-ns=2147483647
  iothread.1.poll-grow=2147483647
  iothread.1.poll-shrink=2147483647
  iothread.2.poll-max-ns=32768
  iothread.2.poll-grow=0
  iothread.2.poll-shrink=0

3. Managedsave VM and start VM, then check the iothread stats again
# virsh managedsave avocado-vt-vm1
Domain avocado-vt-vm1 state saved by libvirt

# virsh start avocado-vt-vm1
Domain avocado-vt-vm1 started

# virsh domstats avocado-vt-vm1 --iothread
Domain: 'avocado-vt-vm1'
  iothread.count=3
  iothread.4.poll-max-ns=32768
  iothread.4.poll-grow=0
  iothread.4.poll-shrink=0
  iothread.1.poll-max-ns=32768
  iothread.1.poll-grow=0
  iothread.1.poll-shrink=0
  iothread.2.poll-max-ns=32768
  iothread.2.poll-grow=0
  iothread.2.poll-shrink=0

Actual results:
As step-3 shows

Expected results:
# man virsh
       managedsave domain [--bypass-cache] [{--running | --paused}] [--verbose]
           Save and destroy (stop) a running domain, so it can be restarted from the same state at a later time.  When the virsh start command is
           next run for the domain, it will automatically be started from this saved state.  

So in step-3, iothread stats info should be keep same before the managedsave operation.

Comment 24 jiyan 2019-03-13 04:09:20 UTC

Hi John
BTW，
I think it is better to give the exact range of these three parameters in 'man page'. (poll-max-ns:2147483647; poll-grow: 2147483647; poll-shrink: 0-2147483647)
besides, the following error msg is not proper enough.
# virsh iothreadset avocado-vt-vm1 1
error: params in virDomainSetIOThreadParams must not be NULL

Comment 25 John Ferlan 2019-03-13 11:51:31 UTC

Presence in RHEL-8 has more to do with which libvirt upstream release was rebased into RHEL-8. This bz is meant for RHEL-7 support

To answer the question from Comment 23 - the setting of the values is meant to be purely dynamic setting thus only affecting the currently running guest. Saving, destroying, stopping, etc. in any manner would reset the values to their defaults. So your assertion about managedsave will not be true. This is "more or less" described in iothreadset man page by:

If I<--live> is specified, affect a running guest. If the guest is not running an error is returned.
If I<--current> is specified or I<--live> is not specified, then handle as if I<--live> was specified.

But more clearly stated in the virDomainSetIOThreadParams description, see https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainSetIOThreadParams



To answer the second question from Comment 24 - the problem with providing a range is that at some point in the future that range could change and then the documentation is wrong. This is also described in the API documentation.

Comment 26 gaojianan 2019-03-27 06:05:03 UTC

(In reply to John Ferlan from comment #25)
> Presence in RHEL-8 has more to do with which libvirt upstream release was
> rebased into RHEL-8. This bz is meant for RHEL-7 support
> 
> To answer the question from Comment 23 - the setting of the values is meant
> to be purely dynamic setting thus only affecting the currently running
> guest. Saving, destroying, stopping, etc. in any manner would reset the
> values to their defaults. So your assertion about managedsave will not be
> true. This is "more or less" described in iothreadset man page by:
> 
> If I<--live> is specified, affect a running guest. If the guest is not
> running an error is returned.
> If I<--current> is specified or I<--live> is not specified, then handle as
> if I<--live> was specified.
> 
> But more clearly stated in the virDomainSetIOThreadParams description, see
> https://libvirt.org/html/libvirt-libvirt-domain.
> html#virDomainSetIOThreadParams

Can we explain this part in detail in the man page or give the documention link in the man page?
I have heard many discussions about this issue until now.If we don't modify the description 
in the man page,i think there will be many other people asking you about this question in the
future.And this is more convenient for users to understand.

Comment 27 John Ferlan 2019-03-27 20:47:07 UTC

Not saying "too much" was kind of an engineering trade-off/decision made. This particular feature had been previously rejected upstream mainly because it's not easy to describe how the poll* values should be set and a concern over saving a value in an XML file that may at some point in the future be a detriment. So going with dynamic only was a "fair trade off".

I understand your point about convenience of man page over API, but personally I have a feeling someone would ask how best to set the values before wondering about save/restore.

Adding something extra to the man page is outside the bounds of this bz though, but still important so I've posted the following upstream, let's see what feedback I get, see:

https://www.redhat.com/archives/libvir-list/2019-March/msg01898.html

Comment 28 Xuesong Zhang 2019-04-04 02:38:01 UTC

This BZ is already cloned to RHEL8.0 AV, so close this BZ.

*** This bug has been marked as a duplicate of bug 1688090 ***

Note You need to log in before you can comment on or make changes to this bug.

berrange
coli
dyuan
dzheng
fgarciad
gveitmic
hhan
jdenemar
jferlan
jgao
jiyan
jsuchane
laine
ldelouw
lhuang
michal.skrivanek
mkalinin
mtessun
xuzhang
yafu