Bug 1686274 - Cannot start a guest which have cachetune + memorytune settings
Summary: Cannot start a guest which have cachetune + memorytune settings
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 8.0
Assignee: Martin Kletzander
QA Contact: Luyao Huang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-07 07:16 UTC by Luyao Huang
Modified: 2020-11-14 06:59 UTC (History)
11 users (show)

Fixed In Version: libvirt-5.0.0-9.el8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-07 10:41:10 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:2395 0 None None None 2019-08-07 10:41:22 UTC

Description Luyao Huang 2019-03-07 07:16:56 UTC
Description of problem:
Cannot start a guest which have cachetune + memorytune settings

Version-Release number of selected component (if applicable):
libvirt-5.0.0-6.module+el8+2860+4e0fe96a.x86_64

How reproducible:
100%

Steps to Reproduce:
1. prepare a host support intel RDT
2. mount resctrl dir

# mount -t resctrl resctrl /sys/fs/resctrl
# echo "L3:0=0ff;1=0ff" > /sys/fs/resctrl/schemata

3. restart libvirtd
4. prepare an inactive guest which have cachetune+memorytune element like this:

  <cputune>
    <cachetune vcpus='2-4'>
      <cache id='0' level='3' type='both' size='3' unit='MiB'/>
      <cache id='1' level='3' type='both' size='3' unit='MiB'/>
      <monitor level='3' vcpus='2-3'/>
      <monitor level='3' vcpus='4'/>
    </cachetune>
    <memorytune vcpus='1'>
      <node id='0' bandwidth='20'/>
      <node id='1' bandwidth='40'/>
    </memorytune>
  </cputune>

5. try to start guest:

# virsh start vm1
error: Failed to start domain vm1
error: unsupported configuration: Not enough room for allocation of 20% bandwidth on node 0, available bandwidth 0%

check libvirtd debug log:
2019-03-07 06:42:56.691+0000: 29277: debug : virResctrlAllocCreate:2400 : Writing resctrl schemata 'L3:0=700;1=700
' into '/sys/fs/resctrl/qemu-4-vm1-vcpus_2-4/schemata'


6. remove memorytune settings and start guest again:

# virsh dumpxml vm1
  <cputune>
    <cachetune vcpus='2-4'>
      <cache id='0' level='3' type='both' size='3' unit='MiB'/>
      <cache id='1' level='3' type='both' size='3' unit='MiB'/>
      <monitor level='3' vcpus='2-3'/>
      <monitor level='3' vcpus='4'/>
    </cachetune>
  </cputune>


# virsh start vm1
Domain vm1 started

7. check resctrl schemata and that MB have default value 100%

# cat /sys/fs/resctrl/qemu-5-vm1-vcpus_2-4/schemata 
    L3:0=700;1=700
    MB:0=100;1=100

libvirtd debug log:

2019-03-07 06:43:50.065+0000: 29277: debug : virResctrlAllocCreate:2400 : Writing resctrl schemata 'L3:0=700;1=700
' into '/sys/fs/resctrl/qemu-5-vm1-vcpus_2-4/schemata'


Actual results:

Cannot start guest with a validate config in step 5 and 

Expected results:

Start guest successfully in step 5 or report a useful error

Additional info:

from the debug log you can see that libvirt only set the L3 part and there is no MB, but kernel add a default MB value 100% and this cause fail to pass a check in virResctrlAllocMemoryBandwidth when try to create another dir.

Comment 1 Martin Kletzander 2019-03-08 16:47:07 UTC
(In reply to Luyao Huang from comment #0)
Hi, thanks for reporting this.  I don't have access to machine with MBA support, so I cannot check that, but I'm guessing you have yet another resctrl group (not the default /sys/fs/resctrl, but some subdirectory o that) which has 100% allocated.
It works roughly the same way as the L3 allocation.  Even though libvirt could just write `MB:0=20;1=20` in /schemata, it would not _guarantee_ that it gets the memory bandwidth allocated.  I, personally, am not sure how to perceive the documentation about bandwidth allocation, not only for libvirt, but also for kernel.  Let me know what are the allocations you have in the system; something like the output of:

  find /sys/fs/resctrl -name schemata | xargs head

or similar.

Comment 2 Luyao Huang 2019-03-11 07:49:31 UTC
(In reply to Martin Kletzander from comment #1)
> (In reply to Luyao Huang from comment #0)
> Hi, thanks for reporting this.  I don't have access to machine with MBA
> support, so I cannot check that, but I'm guessing you have yet another
> resctrl group (not the default /sys/fs/resctrl, but some subdirectory o
> that) which has 100% allocated.
> It works roughly the same way as the L3 allocation.  Even though libvirt
> could just write `MB:0=20;1=20` in /schemata, it would not _guarantee_ that
> it gets the memory bandwidth allocated.  I, personally, am not sure how to
> perceive the documentation about bandwidth allocation, not only for libvirt,
> but also for kernel.  Let me know what are the allocations you have in the
> system; something like the output of:
> 
>   find /sys/fs/resctrl -name schemata | xargs head
> 
> or similar.

Hi Martin,

I only mount resctrl one time under /sys/fs/resctrl, and this is the output when no active vm:

# find /sys/fs/resctrl -name schemata | xargs head
    L3:0=0ff;1=0ff
    MB:0=100;1=100

And test result if more than one machine set memorytune and total bandwidth > 100%:

# find /sys/fs/resctrl -name schemata | xargs head
==> /sys/fs/resctrl/qemu-15-vm2-vcpus_2-4/schemata <==
    L3:0=200;1=200
    MB:0=100;1=100

==> /sys/fs/resctrl/qemu-14-vm1-vcpus_2-4/schemata <==
    L3:0=100;1=100
    MB:0= 20;1= 40

==> /sys/fs/resctrl/schemata <==
    L3:0=0ff;1=0ff
    MB:0=100;1=100

==> /sys/fs/resctrl/qemu-14-vm1-vcpus_1/schemata <==
    L3:0=0ff;1=0ff
    MB:0= 20;1= 40

==> /sys/fs/resctrl/qemu-15-vm2-vcpus_1/schemata <==
    L3:0=0ff;1=0ff
    MB:0= 20;1= 40

you can see that the total bandwidth over 100% and 2 guest can start success.

And I am not sure why libvirt will check the total number of memory bandwidth,
since it is a speed limitation like mem cgroup. 

Also from the kernel resctrl document (https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt):

2. Same bandwidth percentage may mean different actual bandwidth
   depending on # of threads:

For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4
thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although
they have same percentage bandwidth of 10%. This is simply because as
threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.

If we set 4 vcpus in one group with 10% bandwidth, they can consume 40% bandwidth.
In that case one vm can consume more than 100% bandwidth if user set all vcpus in one group,
and our MB total check is not make that sense.

Obviously this is just my opinion, corrections are most welcome.


Thanks

Comment 3 Luyao Huang 2019-03-11 08:14:18 UTC
And another interesting issue is that only delete dir
under /sys/fs/resctrl cannot clear the data in MSR, and this will affect
next time we create dir:

1. prepare a guest and set MB in xml:

  <cputune>
    <cachetune vcpus='2-4'>
      <cache id='0' level='3' type='both' size='2' unit='MiB'/>
      <cache id='1' level='3' type='both' size='2' unit='MiB'/>
      <monitor level='3' vcpus='2-3'/>
      <monitor level='3' vcpus='4'/>
    </cachetune>
    <memorytune vcpus='2-4'>
      <node id='0' bandwidth='20'/>
      <node id='1' bandwidth='40'/>
    </memorytune>
    <memorytune vcpus='1'>
      <node id='0' bandwidth='20'/>
      <node id='1' bandwidth='40'/>
    </memorytune>
  </cputune>

2. start guest and check resctrl:

# find /sys/fs/resctrl -name schemata | xargs head
==> /sys/fs/resctrl/qemu-16-vm1-vcpus_2-4/schemata <==
    L3:0=100;1=100
    MB:0= 20;1= 40

==> /sys/fs/resctrl/schemata <==
    L3:0=0ff;1=0ff
    MB:0=100;1=100

==> /sys/fs/resctrl/qemu-16-vm1-vcpus_1/schemata <==
    L3:0=0ff;1=0ff
    MB:0= 20;1= 40

3. destroy guest and check resctrl:
# find /sys/fs/resctrl -name schemata | xargs head
    L3:0=0ff;1=0ff
    MB:0=100;1=100

4. use pqos to read MSR you will find data not been cleared:

# pqos -s
NOTE:  Mixed use of MSR and kernel interfaces to manage
       CAT or CMT & MBM may lead to unexpected behavior.
WARN: resctl filesystem mounted! Using MSR interface may corrupt resctrl filesystem and cause unexpected behaviour
L3CA/MBA COS definitions for Socket 0:
    L3CA COS0 => MASK 0xff
    L3CA COS1 => MASK 0x100
    L3CA COS2 => MASK 0xff
    L3CA COS3 => MASK 0x200
    L3CA COS4 => MASK 0xff
    L3CA COS5 => MASK 0x7ff
    L3CA COS6 => MASK 0x7ff
    L3CA COS7 => MASK 0x7ff
    L3CA COS8 => MASK 0x7ff
    L3CA COS9 => MASK 0x7ff
    L3CA COS10 => MASK 0x7ff
    L3CA COS11 => MASK 0x7ff
    L3CA COS12 => MASK 0x7ff
    L3CA COS13 => MASK 0x7ff
    L3CA COS14 => MASK 0x7ff
    L3CA COS15 => MASK 0x7ff
    MBA COS0 => 100% available
    MBA COS1 => 20% available
    MBA COS2 => 20% available
    MBA COS3 => 100% available
    MBA COS4 => 20% available
    MBA COS5 => 100% available
    MBA COS6 => 100% available
    MBA COS7 => 100% available
L3CA/MBA COS definitions for Socket 1:
    L3CA COS0 => MASK 0xff
    L3CA COS1 => MASK 0x100
    L3CA COS2 => MASK 0xff
    L3CA COS3 => MASK 0x200
    L3CA COS4 => MASK 0xff
    L3CA COS5 => MASK 0x7ff
    L3CA COS6 => MASK 0x7ff
    L3CA COS7 => MASK 0x7ff
    L3CA COS8 => MASK 0x7ff
    L3CA COS9 => MASK 0x7ff
    L3CA COS10 => MASK 0x7ff
    L3CA COS11 => MASK 0x7ff
    L3CA COS12 => MASK 0x7ff
    L3CA COS13 => MASK 0x7ff
    L3CA COS14 => MASK 0x7ff
    L3CA COS15 => MASK 0x7ff
    MBA COS0 => 100% available
    MBA COS1 => 40% available
    MBA COS2 => 40% available
    MBA COS3 => 100% available
    MBA COS4 => 40% available
    MBA COS5 => 100% available
    MBA COS6 => 100% available
    MBA COS7 => 100% available

5. change guest xml and remove MB settings:

  <cputune>
    <cachetune vcpus='2-4'>
      <cache id='0' level='3' type='both' size='2' unit='MiB'/>
      <cache id='1' level='3' type='both' size='2' unit='MiB'/>
      <monitor level='3' vcpus='2-3'/>
      <monitor level='3' vcpus='4'/>
    </cachetune>
  </cputune>

6. start guest and recheck resctrl:

# find /sys/fs/resctrl -name schemata | xargs head
==> /sys/fs/resctrl/schemata <==
    L3:0=0ff;1=0ff
    MB:0=100;1=100

==> /sys/fs/resctrl/qemu-17-vm1-vcpus_2-4/schemata <==
    L3:0=100;1=100
    MB:0= 20;1= 40

You can see that MB still the same with the last time.

Comment 4 Martin Kletzander 2019-03-11 08:59:23 UTC
Thanks for the detailed info.  I agree that lot of it does not make sense.  The default values after removing and re-creating the groups is something that happens with CAT values as well and it is why we need to write the defaults there even if they are not mentioned in the XML  However for CAT it is easy to do that, with MBA it does not make sense.  Anyway, we should still fix this at least in a way that will make some sense, even if we cannot do that perfectly.  The other thing of no allowing bigger bandwidth than 100% for the whole machine is wrong because a) it does not take the default group settings into account and b) it is only a limitation, not an allocation (although it is impossible to find that in any documentation).  The fact that it is per-thread is something I would not consider an issue after both (a) and (b) are handled.  I'll send a patch for those.

Comment 5 Martin Kletzander 2019-03-11 10:25:28 UTC
Patches proposed upstream:

https://www.redhat.com/archives/libvir-list/2019-March/msg00589.html

Comment 6 Luyao Huang 2019-03-12 02:49:53 UTC
(In reply to Martin Kletzander from comment #4)
> Thanks for the detailed info.  I agree that lot of it does not make sense. 
> The default values after removing and re-creating the groups is something
> that happens with CAT values as well and it is why we need to write the
> defaults there even if they are not mentioned in the XML  However for CAT it
> is easy to do that, with MBA it does not make sense.  Anyway, we should
> still fix this at least in a way that will make some sense, even if we
> cannot do that perfectly.  The other thing of no allowing bigger bandwidth
> than 100% for the whole machine is wrong because a) it does not take the
> default group settings into account and b) it is only a limitation, not an
> allocation (although it is impossible to find that in any documentation). 
> The fact that it is per-thread is something I would not consider an issue
> after both (a) and (b) are handled.  I'll send a patch for those.

Agreed, but i am still wondering if it is a bug in kernel that the MSR not been
reset after remove dir.

(In reply to Martin Kletzander from comment #5)
> Patches proposed upstream:
> 
> https://www.redhat.com/archives/libvir-list/2019-March/msg00589.html

Thanks a lot for your quick fix and i noticed maybe there is a small problem
in the first patch before create a build to test it:

+    for (i = 0; i < src_bw->nbandwidths; i++) {
+        if (dst_bw->bandwidths[i]) {
+            *dst_bw->bandwidths[i] = 123;     <----here
+            continue;
+        }

I guess you forgot remove this line (also with curly braces) before send a patch :)

Comment 7 Martin Kletzander 2019-03-12 08:56:20 UTC
(In reply to Luyao Huang from comment #6)
I believe it is.  However it is the same with CAT where we had to do the workaround in order not to wait for the kernel fix to land.  If you submit the patch, though, I will fully support that ;)  And thanks for noting the thing in the patch, yes, that was there for testing.

Comment 8 Martin Kletzander 2019-03-12 09:31:27 UTC
Fixed upstream with v5.1.0-173-g408aeebcef1e..v5.1.0-175-gbf8c8755dc8a:
commit 408aeebcef1e81e55bebb4f2d47403d04ee16c0f
Author: Martin Kletzander <mkletzan>
Date:   Mon Mar 11 10:23:10 2019 +0100

    resctrl: Do not calculate free bandwidth for MBA
    
commit ceb6725d945490a7153cf8b10ad3cd972d3f1c16
Author: Martin Kletzander <mkletzan>
Date:   Mon Mar 11 11:13:32 2019 +0100

    resctrl: Set MBA defaults properly
    
commit bf8c8755dc8a6d53632b90aa79ba546594714264
Author: Martin Kletzander <mkletzan>
Date:   Tue Mar 12 09:53:48 2019 +0100

    resctrl: Fix testing line

Comment 11 Luyao Huang 2019-03-13 07:47:22 UTC
(In reply to Martin Kletzander from comment #7)
> (In reply to Luyao Huang from comment #6)
> I believe it is.  However it is the same with CAT where we had to do the
> workaround in order not to wait for the kernel fix to land.  If you submit
> the patch, though, I will fully support that ;)  And thanks for noting the
> thing in the patch, yes, that was there for testing.

Hmmm... you mean submit a patch to kernel ? I will have a try but don't expect too much ;)

Anyway, i will create a kernel bug to track this issue and see if I can get some
code level support from kernel developer.

Comment 13 Martin Kletzander 2019-03-13 09:28:13 UTC
(In reply to Luyao Huang from comment #11)
I meant "create the BZ", not "submit the patch", my bad, sorry.  I was doing too many things at once.

Moving to POST.

Comment 18 Luyao Huang 2019-06-14 03:07:28 UTC
Verify this bug with libvirt-5.0.0-10.module+el8.0.1+3363+49e420ce.x86_64;

S1: start a guest with cachetune + memorytune

1. prepare a host support intel RDT
2. mount resctrl dir

# mount -t resctrl resctrl /sys/fs/resctrl
# echo "L3:0=0ff;1=0ff" > /sys/fs/resctrl/schemata

3. restart libvirtd
4. prepare an inactive guest which have cachetune+memorytune element like this:

  <cputune>
    <cachetune vcpus='2-4'>
      <cache id='0' level='3' type='both' size='3' unit='MiB'/>
      <cache id='1' level='3' type='both' size='3' unit='MiB'/>
      <monitor level='3' vcpus='2-3'/>
      <monitor level='3' vcpus='4'/>
    </cachetune>
    <memorytune vcpus='1'>
      <node id='0' bandwidth='20'/>
      <node id='1' bandwidth='40'/>
    </memorytune>
  </cputune>

5. start guest
# virsh start vm1
Domain vm1 started

6. check guest live xml:
# virsh dumpxml vm1
  <cputune>
    <cachetune vcpus='2-4' id='vcpus_2-4'>
      <cache id='0' level='3' type='both' size='3' unit='MiB'/>
      <cache id='1' level='3' type='both' size='3' unit='MiB'/>
      <monitor level='3' vcpus='2-3'/>
      <monitor level='3' vcpus='4'/>
    </cachetune>
    <memorytune vcpus='2-4' id='vcpus_2-4'>
      <node id='0' bandwidth='100'/>
      <node id='1' bandwidth='100'/>
    </memorytune>
    <memorytune vcpus='1' id='vcpus_1'>
      <node id='0' bandwidth='20'/>
      <node id='1' bandwidth='40'/>
    </memorytune>
  </cputune>

7. check libvirtd debug log and verify libvirt set default value:

2019-06-13 09:30:42.502+0000: 14354: debug : virResctrlAllocCreate:2414 : Writing resctrl schemata 'L3:0=700;1=700
MB:0=100;1=100
' into '/sys/fs/resctrl/qemu-3-vm1-vcpus_2-4/schemata'

2019-06-13 09:30:42.503+0000: 14354: debug : virResctrlAllocCreate:2414 : Writing resctrl schemata 'L3:0=0ff;1=0ff
MB:0=20;1=40
' into '/sys/fs/resctrl/qemu-3-vm1-vcpus_1/schemata'

Comment 20 errata-xmlrpc 2019-08-07 10:41:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2395


Note You need to log in before you can comment on or make changes to this bug.