Bug 1515877 - Unable to define QoS for the 10Gbit interface
Summary: Unable to define QoS for the 10Gbit interface
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: General
Version: 4.1.7.6
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ovirt-4.2.4
: ---
Assignee: eraviv
QA Contact: Michael Burman
URL:
Whiteboard:
: 1316568 1464047 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-21 14:23 UTC by Arman
Modified: 2018-12-14 19:38 UTC (History)
8 users (show)

Fixed In Version: ovirt-engine-4.2.4.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-06-26 08:34:18 UTC
oVirt Team: Network
rule-engine: ovirt-4.2+
ylavi: exception+
mburman: testing_plan_complete+
mburman: testing_ack+


Attachments (Terms of Use)
10Gbit interface detected on the host with right speed (66.46 KB, image/png)
2017-11-21 14:23 UTC, Arman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3529041 0 None None None 2018-12-14 19:36:11 UTC
oVirt gerrit 90799 0 master MERGED core: QoS - support larger bit rates 2018-05-29 09:02:17 UTC
oVirt gerrit 90813 0 master MERGED core: extend QoS max bit rate to 34Gbit/sec 2018-05-29 09:03:26 UTC
oVirt gerrit 91223 0 master MERGED webadmin: fix VM network QoS displayed units 2018-05-29 09:03:28 UTC
oVirt gerrit 91245 0 master MERGED api-model: Network QoS - explain bit rate units 2018-06-04 12:11:18 UTC
oVirt gerrit 91303 0 master MERGED core: fix VM network QoS bit rate conversion 2018-05-29 08:43:17 UTC
oVirt gerrit 91765 0 ovirt-engine-4.2 MERGED core: fix VM network QoS bit rate conversion 2018-05-30 08:31:01 UTC
oVirt gerrit 91766 0 ovirt-engine-4.2 MERGED core: QoS - support larger bit rates 2018-05-30 08:31:06 UTC
oVirt gerrit 91767 0 ovirt-engine-4.2 MERGED core: extend QoS max bit rate to 34Gbit/sec 2018-05-30 08:31:11 UTC
oVirt gerrit 91768 0 ovirt-engine-4.2 MERGED webadmin: fix VM network QoS displayed units 2018-05-30 08:31:15 UTC
oVirt gerrit 91920 0 model_4.2 MERGED api-model: Network QoS - explain bit rate units 2018-06-04 12:24:34 UTC

Description Arman 2017-11-21 14:23:21 UTC
Created attachment 1356729 [details]
10Gbit interface detected on the host with right speed

Description of problem:
We have hosts with 10Gbit. I am not so sure if it is a bug or luck of the future.
When we define a network for the VMs with QoS > 1Gbit it is failing



The VM with Nic from 10Gbit network saturating 100-400% during the high load.

Expecting behavior:
1) add 10G interface to VM
1.1) set performance limits on the vm to 8Gbit not default 1Gbit.
2) define QoS to 80% of the host 10Gbit Nic 
3) Profit 10Gbit interface on VM without polluting the logs with overloaded network.

Version-Release number of selected component (if applicable): 4.1.7.6

Comment 1 Dan Kenigsberg 2017-11-21 16:24:31 UTC
Seems like you're seeing a dup of bug 1316568. But your definition is more concise.



 Yevgeny Zaspitsky 2016-07-04 13:27:06 IDT

Looks like the overflow is on the engine side. It holds values as Integers, which are signed 32-bit in Java.
The ul.m2 and rt.m2 values are multiplied by 8 before sending to VDSM, so that could be an additional limitation.

Comment 2 Dan Kenigsberg 2017-11-21 16:25:22 UTC
*** Bug 1316568 has been marked as a duplicate of this bug. ***

Comment 3 Sandro Bonazzola 2018-06-01 07:14:48 UTC
Moving back to post because referenced patch https://gerrit.ovirt.org/#/c/91245/ is still open.

Comment 4 Yaniv Kaul 2018-06-07 09:50:27 UTC
(In reply to Sandro Bonazzola from comment #3)
> Moving back to post because referenced patch
> https://gerrit.ovirt.org/#/c/91245/ is still open.

Now I assume it can move to MODIFIED.

Comment 5 Michael Burman 2018-06-10 14:33:51 UTC
Eitan ,Dan 
This fix has caused us a regression. The values that passed to libvirt now are not match what was configured in the UI.

This was discovered during the first automation run on the new build on tier2.

For example, in the UI i set:
In Average=10
In Peak=1000
In burst=100

Out Average=10
Out Peak=1000
Out Burst=100

In the xml i get ->

<bandwidth>
        <inbound average='1250' peak='125000' burst='102400'/>
        <outbound average='1250' peak='125000' burst='102400'/>
      </bandwidth>

My question is : should i report new regression bug that will block this one or should i fail this bug? any how this can't be verified as it is now.

Comment 6 eraviv 2018-06-10 18:20:20 UTC
The conversion performed by engine was fixed and that's why you have a 'regression':

Engine expects the user to insert rate values in mbps - megabits per sec (as declared in the UI and in engine config) but before the fix engine multiplied the rates provided by the user by 128. 
Since libvirt declares it expects kilobytes [1], there are two possibilities:
1. there was an error in the units declared by engine for user input.
2. the multiplication factor was wrong
Since the documentation [2] declares that user input should be in mbps as well, option 1 is not valid.

So this fix changes the multiplication factor to 125: from mbps to KB. Therefore:
10 -> 1250
1000 -> 125000


[1] https://libvirt.org/formatnetwork.html#elementQoS
[2] https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/html-single/administration_guide/#Creating_a_Host_Network_Quality_of_Service_Entry

Comment 7 Michael Burman 2018-06-11 05:44:35 UTC
(In reply to eraviv from comment #6)
> The conversion performed by engine was fixed and that's why you have a
> 'regression':
> 
> Engine expects the user to insert rate values in mbps - megabits per sec (as
> declared in the UI and in engine config) but before the fix engine
> multiplied the rates provided by the user by 128. 
> Since libvirt declares it expects kilobytes [1], there are two possibilities:
> 1. there was an error in the units declared by engine for user input.
> 2. the multiplication factor was wrong
> Since the documentation [2] declares that user input should be in mbps as
> well, option 1 is not valid.
> 
> So this fix changes the multiplication factor to 125: from mbps to KB.
> Therefore:
> 10 -> 1250
> 1000 -> 125000
> 
> 
> [1] https://libvirt.org/formatnetwork.html#elementQoS
> [2]
> https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.2/
> html-single/administration_guide/
> #Creating_a_Host_Network_Quality_of_Service_Entry

I see, 
Eitan, next time you doing such changes please let us know. You made a lot of changes here, some are affecting our automation tests that expect to specific values.
This change was not even part of the origin bugs reports. As i look more carefully now, this fix includes multiple changes both for host and VMs QoS, i was not aware of it and this require more testing effort from me. This bug should have been split to several bugs BTW. And i think that summary should changed here as well. 

So, the things to test here are(notes for my self):
- Enable 10Gb host QoS on both engine and vdsm, verify that vdsm except the values and report them
- Test 10Gb on hosts
- Enable 10Gb on VM QoS
- Test 10Gb on Vms
- Test new bw reported in xml and change automation test for VM QoS.

Comment 8 Michael Burman 2018-06-11 05:53:36 UTC
As well i see that you changed the default engine-config values to ->

MaxPeakNetworkQoSValue: 34359 version: general

MaxAverageNetworkQoSValue: 17179 version: general

As i wrote in previous comment, this should have been split to at least 4 different bugs. Next time please do.

Comment 9 eraviv 2018-06-11 07:31:23 UTC
The fix for this bug contains several changes:

1. Set default MaxPeakNetworkQoSValue in engine-config to 34359 Megabits per sec - the max. QoS values supported by traffic control (see man tc #PARAMETERS). This change will allow the user to set higher bit rates in engine.

2. Set default MaxAverageNetworkQoSValue in engine-config to half the peak (as was half before the change) - namely 17179 Megabits per sec.

3. Correct the conversion factor employed by engine to convert user input of VM Qos rates from Megabits per sec to Kilobytes per sec - the value required by libvirt. conversion factor was: 128. now: 125 (1 Megabit = 125 KiloByte)

4. Enable engine to process values larger than java#Integer.MAX_INT. Before the fix this was a limitation that caused engine to send negative values to vdsm as Host QoS rates.

5. Enable engine db to store values larger than postgres#SMALL_INT for VM QoS. Before the fix this was a limitation that prevented engine from storing large values of VM QoS.

Notes: 

1. The fix for both VM QoS and Host QoS was executed under one bug because MaxPeakNetworkQoSValue and MaxAverageNetworkQoSValue influence both VM QoS and Host QoS and therefore changing them without solving all the above issues for 
both VM and Host QoS would have created regressions.

2. Verfications I performed for this bug during development:
- setting large values within the 34359 Mbps limit is possible for both types of QoS without any errors in engine or vdsm.
- making sure tc on the host reflects the values configured by engine (tc class show dev eth0)
- making sure libvirt accepts the bandwidth values passed to it be engine (dumpxml#bandwidth) and powers up the VM with them. If not, a traffic control error is returned to engine (engine.log).
- setting higher values than 34359 Mbps invokes errors.

Comment 10 Michael Burman 2018-06-11 08:33:17 UTC
Ok, so based on this ->

VM QoS UI - It's now possible to set 34359 Mbps for Peak and 17179 Megabits for Average. 

Host QoS UI - It's now possible to set 34359 Mbps for Peak and 17179 Megabits for Average. Note that in host QoS, there is validation for 17179 Megabits for peak, with the new default values, not like on VM QoS. If we want to set max in the Peak, we need to update the Average to max as well via engine-config. 

Host QoS - 

"hostQos": {
                "out": {
                    "rt": {
                        "m1": 0, 
                        "d": 0, 
                        "m2": 17179000000
                    }, 
                    "ul": {
                        "m1": 0, 
                        "d": 0, 
                        "m2": 34359000000
                    }, 
                    "ls": {
                        "m1": 0, 
                        "d": 0, 
                        "m2": 80


[root@orchid-vds1 ~]# tc class show dev enp4s0
class hfsc 1389: root 
class hfsc 1389:1388 parent 1389: leaf 1388: rt m1 0bit d 0us m2 17179Mbit ls m1 0bit d 0us m2 640bit ul m1 0bit d 0us m2 34359Mbit 

 "stp": "off", 
            "hostQos": {
                "out": {
                    "rt": {
                        "m1": 0, 
                        "d": 0, 
                        "m2": 10000000000
                    }, 
                    "ul": {
                        "m1": 0, 
                        "d": 0, 
                        "m2": 10000000000
                    }, 
                    "ls": {
                        "m1": 0, 
                        "d": 0, 
                        "m2": 80

[root@orchid-vds1 ~]# tc class show dev enp4s0
class hfsc 1389: root 
class hfsc 1389:1388 parent 1389: leaf 1388: rt m1 0bit d 0us m2 10Gbit ls m1 0bit d 0us m2 640bit ul m1 0bit d 0us m2 10Gbit 

VM Host - conversion factor was: 128. now: 125

 <bandwidth>
        <inbound average='1250000' peak='1875000' burst='1024000'/>
        <outbound average='1250000' peak='1875000' burst='1024000'/>
      </bandwidth>


 <bandwidth>
        <inbound average='2147375' peak='4294875' burst='102400'/>
        <outbound average='2147375' peak='4294875' burst='102400'/>
      </bandwidth>

- burst still reported wrong, will be handled in BZ 1580285

Verified on - 4.2.4.2-0.1.el7_3 and vdsm-4.20.30-1.el7ev.x86_64

Comment 11 Sandro Bonazzola 2018-06-26 08:34:18 UTC
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 12 Dan Kenigsberg 2018-12-14 19:36:12 UTC
*** Bug 1600248 has been marked as a duplicate of this bug. ***

Comment 13 Dan Kenigsberg 2018-12-14 19:38:12 UTC
*** Bug 1464047 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.