Bug 1686274
| Summary: | Cannot start a guest which have cachetune + memorytune settings | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Advanced Virtualization | Reporter: | Luyao Huang <lhuang> |
| Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> |
| Status: | CLOSED ERRATA | QA Contact: | Luyao Huang <lhuang> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 8.0 | CC: | dyuan, jdenemar, jinqi, jiyan, jsuchane, knoel, lhuang, mkletzan, rbalakri, xuzhang, yalzhang |
| Target Milestone: | rc | Keywords: | Upstream |
| Target Release: | 8.0 | Flags: | knoel:
mirror+
|
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | libvirt-5.0.0-9.el8 | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-07 10:41:10 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
(In reply to Luyao Huang from comment #0) Hi, thanks for reporting this. I don't have access to machine with MBA support, so I cannot check that, but I'm guessing you have yet another resctrl group (not the default /sys/fs/resctrl, but some subdirectory o that) which has 100% allocated. It works roughly the same way as the L3 allocation. Even though libvirt could just write `MB:0=20;1=20` in /schemata, it would not _guarantee_ that it gets the memory bandwidth allocated. I, personally, am not sure how to perceive the documentation about bandwidth allocation, not only for libvirt, but also for kernel. Let me know what are the allocations you have in the system; something like the output of: find /sys/fs/resctrl -name schemata | xargs head or similar. (In reply to Martin Kletzander from comment #1) > (In reply to Luyao Huang from comment #0) > Hi, thanks for reporting this. I don't have access to machine with MBA > support, so I cannot check that, but I'm guessing you have yet another > resctrl group (not the default /sys/fs/resctrl, but some subdirectory o > that) which has 100% allocated. > It works roughly the same way as the L3 allocation. Even though libvirt > could just write `MB:0=20;1=20` in /schemata, it would not _guarantee_ that > it gets the memory bandwidth allocated. I, personally, am not sure how to > perceive the documentation about bandwidth allocation, not only for libvirt, > but also for kernel. Let me know what are the allocations you have in the > system; something like the output of: > > find /sys/fs/resctrl -name schemata | xargs head > > or similar. Hi Martin, I only mount resctrl one time under /sys/fs/resctrl, and this is the output when no active vm: # find /sys/fs/resctrl -name schemata | xargs head L3:0=0ff;1=0ff MB:0=100;1=100 And test result if more than one machine set memorytune and total bandwidth > 100%: # find /sys/fs/resctrl -name schemata | xargs head ==> /sys/fs/resctrl/qemu-15-vm2-vcpus_2-4/schemata <== L3:0=200;1=200 MB:0=100;1=100 ==> /sys/fs/resctrl/qemu-14-vm1-vcpus_2-4/schemata <== L3:0=100;1=100 MB:0= 20;1= 40 ==> /sys/fs/resctrl/schemata <== L3:0=0ff;1=0ff MB:0=100;1=100 ==> /sys/fs/resctrl/qemu-14-vm1-vcpus_1/schemata <== L3:0=0ff;1=0ff MB:0= 20;1= 40 ==> /sys/fs/resctrl/qemu-15-vm2-vcpus_1/schemata <== L3:0=0ff;1=0ff MB:0= 20;1= 40 you can see that the total bandwidth over 100% and 2 guest can start success. And I am not sure why libvirt will check the total number of memory bandwidth, since it is a speed limitation like mem cgroup. Also from the kernel resctrl document (https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt): 2. Same bandwidth percentage may mean different actual bandwidth depending on # of threads: For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although they have same percentage bandwidth of 10%. This is simply because as threads start using more cores in an rdtgroup, the actual bandwidth may increase or vary although user specified bandwidth percentage is same. If we set 4 vcpus in one group with 10% bandwidth, they can consume 40% bandwidth. In that case one vm can consume more than 100% bandwidth if user set all vcpus in one group, and our MB total check is not make that sense. Obviously this is just my opinion, corrections are most welcome. Thanks And another interesting issue is that only delete dir
under /sys/fs/resctrl cannot clear the data in MSR, and this will affect
next time we create dir:
1. prepare a guest and set MB in xml:
<cputune>
<cachetune vcpus='2-4'>
<cache id='0' level='3' type='both' size='2' unit='MiB'/>
<cache id='1' level='3' type='both' size='2' unit='MiB'/>
<monitor level='3' vcpus='2-3'/>
<monitor level='3' vcpus='4'/>
</cachetune>
<memorytune vcpus='2-4'>
<node id='0' bandwidth='20'/>
<node id='1' bandwidth='40'/>
</memorytune>
<memorytune vcpus='1'>
<node id='0' bandwidth='20'/>
<node id='1' bandwidth='40'/>
</memorytune>
</cputune>
2. start guest and check resctrl:
# find /sys/fs/resctrl -name schemata | xargs head
==> /sys/fs/resctrl/qemu-16-vm1-vcpus_2-4/schemata <==
L3:0=100;1=100
MB:0= 20;1= 40
==> /sys/fs/resctrl/schemata <==
L3:0=0ff;1=0ff
MB:0=100;1=100
==> /sys/fs/resctrl/qemu-16-vm1-vcpus_1/schemata <==
L3:0=0ff;1=0ff
MB:0= 20;1= 40
3. destroy guest and check resctrl:
# find /sys/fs/resctrl -name schemata | xargs head
L3:0=0ff;1=0ff
MB:0=100;1=100
4. use pqos to read MSR you will find data not been cleared:
# pqos -s
NOTE: Mixed use of MSR and kernel interfaces to manage
CAT or CMT & MBM may lead to unexpected behavior.
WARN: resctl filesystem mounted! Using MSR interface may corrupt resctrl filesystem and cause unexpected behaviour
L3CA/MBA COS definitions for Socket 0:
L3CA COS0 => MASK 0xff
L3CA COS1 => MASK 0x100
L3CA COS2 => MASK 0xff
L3CA COS3 => MASK 0x200
L3CA COS4 => MASK 0xff
L3CA COS5 => MASK 0x7ff
L3CA COS6 => MASK 0x7ff
L3CA COS7 => MASK 0x7ff
L3CA COS8 => MASK 0x7ff
L3CA COS9 => MASK 0x7ff
L3CA COS10 => MASK 0x7ff
L3CA COS11 => MASK 0x7ff
L3CA COS12 => MASK 0x7ff
L3CA COS13 => MASK 0x7ff
L3CA COS14 => MASK 0x7ff
L3CA COS15 => MASK 0x7ff
MBA COS0 => 100% available
MBA COS1 => 20% available
MBA COS2 => 20% available
MBA COS3 => 100% available
MBA COS4 => 20% available
MBA COS5 => 100% available
MBA COS6 => 100% available
MBA COS7 => 100% available
L3CA/MBA COS definitions for Socket 1:
L3CA COS0 => MASK 0xff
L3CA COS1 => MASK 0x100
L3CA COS2 => MASK 0xff
L3CA COS3 => MASK 0x200
L3CA COS4 => MASK 0xff
L3CA COS5 => MASK 0x7ff
L3CA COS6 => MASK 0x7ff
L3CA COS7 => MASK 0x7ff
L3CA COS8 => MASK 0x7ff
L3CA COS9 => MASK 0x7ff
L3CA COS10 => MASK 0x7ff
L3CA COS11 => MASK 0x7ff
L3CA COS12 => MASK 0x7ff
L3CA COS13 => MASK 0x7ff
L3CA COS14 => MASK 0x7ff
L3CA COS15 => MASK 0x7ff
MBA COS0 => 100% available
MBA COS1 => 40% available
MBA COS2 => 40% available
MBA COS3 => 100% available
MBA COS4 => 40% available
MBA COS5 => 100% available
MBA COS6 => 100% available
MBA COS7 => 100% available
5. change guest xml and remove MB settings:
<cputune>
<cachetune vcpus='2-4'>
<cache id='0' level='3' type='both' size='2' unit='MiB'/>
<cache id='1' level='3' type='both' size='2' unit='MiB'/>
<monitor level='3' vcpus='2-3'/>
<monitor level='3' vcpus='4'/>
</cachetune>
</cputune>
6. start guest and recheck resctrl:
# find /sys/fs/resctrl -name schemata | xargs head
==> /sys/fs/resctrl/schemata <==
L3:0=0ff;1=0ff
MB:0=100;1=100
==> /sys/fs/resctrl/qemu-17-vm1-vcpus_2-4/schemata <==
L3:0=100;1=100
MB:0= 20;1= 40
You can see that MB still the same with the last time.
Thanks for the detailed info. I agree that lot of it does not make sense. The default values after removing and re-creating the groups is something that happens with CAT values as well and it is why we need to write the defaults there even if they are not mentioned in the XML However for CAT it is easy to do that, with MBA it does not make sense. Anyway, we should still fix this at least in a way that will make some sense, even if we cannot do that perfectly. The other thing of no allowing bigger bandwidth than 100% for the whole machine is wrong because a) it does not take the default group settings into account and b) it is only a limitation, not an allocation (although it is impossible to find that in any documentation). The fact that it is per-thread is something I would not consider an issue after both (a) and (b) are handled. I'll send a patch for those. Patches proposed upstream: https://www.redhat.com/archives/libvir-list/2019-March/msg00589.html (In reply to Martin Kletzander from comment #4) > Thanks for the detailed info. I agree that lot of it does not make sense. > The default values after removing and re-creating the groups is something > that happens with CAT values as well and it is why we need to write the > defaults there even if they are not mentioned in the XML However for CAT it > is easy to do that, with MBA it does not make sense. Anyway, we should > still fix this at least in a way that will make some sense, even if we > cannot do that perfectly. The other thing of no allowing bigger bandwidth > than 100% for the whole machine is wrong because a) it does not take the > default group settings into account and b) it is only a limitation, not an > allocation (although it is impossible to find that in any documentation). > The fact that it is per-thread is something I would not consider an issue > after both (a) and (b) are handled. I'll send a patch for those. Agreed, but i am still wondering if it is a bug in kernel that the MSR not been reset after remove dir. (In reply to Martin Kletzander from comment #5) > Patches proposed upstream: > > https://www.redhat.com/archives/libvir-list/2019-March/msg00589.html Thanks a lot for your quick fix and i noticed maybe there is a small problem in the first patch before create a build to test it: + for (i = 0; i < src_bw->nbandwidths; i++) { + if (dst_bw->bandwidths[i]) { + *dst_bw->bandwidths[i] = 123; <----here + continue; + } I guess you forgot remove this line (also with curly braces) before send a patch :) (In reply to Luyao Huang from comment #6) I believe it is. However it is the same with CAT where we had to do the workaround in order not to wait for the kernel fix to land. If you submit the patch, though, I will fully support that ;) And thanks for noting the thing in the patch, yes, that was there for testing. Fixed upstream with v5.1.0-173-g408aeebcef1e..v5.1.0-175-gbf8c8755dc8a:
commit 408aeebcef1e81e55bebb4f2d47403d04ee16c0f
Author: Martin Kletzander <mkletzan>
Date: Mon Mar 11 10:23:10 2019 +0100
resctrl: Do not calculate free bandwidth for MBA
commit ceb6725d945490a7153cf8b10ad3cd972d3f1c16
Author: Martin Kletzander <mkletzan>
Date: Mon Mar 11 11:13:32 2019 +0100
resctrl: Set MBA defaults properly
commit bf8c8755dc8a6d53632b90aa79ba546594714264
Author: Martin Kletzander <mkletzan>
Date: Tue Mar 12 09:53:48 2019 +0100
resctrl: Fix testing line
(In reply to Martin Kletzander from comment #7) > (In reply to Luyao Huang from comment #6) > I believe it is. However it is the same with CAT where we had to do the > workaround in order not to wait for the kernel fix to land. If you submit > the patch, though, I will fully support that ;) And thanks for noting the > thing in the patch, yes, that was there for testing. Hmmm... you mean submit a patch to kernel ? I will have a try but don't expect too much ;) Anyway, i will create a kernel bug to track this issue and see if I can get some code level support from kernel developer. (In reply to Luyao Huang from comment #11) I meant "create the BZ", not "submit the patch", my bad, sorry. I was doing too many things at once. Moving to POST. Verify this bug with libvirt-5.0.0-10.module+el8.0.1+3363+49e420ce.x86_64;
S1: start a guest with cachetune + memorytune
1. prepare a host support intel RDT
2. mount resctrl dir
# mount -t resctrl resctrl /sys/fs/resctrl
# echo "L3:0=0ff;1=0ff" > /sys/fs/resctrl/schemata
3. restart libvirtd
4. prepare an inactive guest which have cachetune+memorytune element like this:
<cputune>
<cachetune vcpus='2-4'>
<cache id='0' level='3' type='both' size='3' unit='MiB'/>
<cache id='1' level='3' type='both' size='3' unit='MiB'/>
<monitor level='3' vcpus='2-3'/>
<monitor level='3' vcpus='4'/>
</cachetune>
<memorytune vcpus='1'>
<node id='0' bandwidth='20'/>
<node id='1' bandwidth='40'/>
</memorytune>
</cputune>
5. start guest
# virsh start vm1
Domain vm1 started
6. check guest live xml:
# virsh dumpxml vm1
<cputune>
<cachetune vcpus='2-4' id='vcpus_2-4'>
<cache id='0' level='3' type='both' size='3' unit='MiB'/>
<cache id='1' level='3' type='both' size='3' unit='MiB'/>
<monitor level='3' vcpus='2-3'/>
<monitor level='3' vcpus='4'/>
</cachetune>
<memorytune vcpus='2-4' id='vcpus_2-4'>
<node id='0' bandwidth='100'/>
<node id='1' bandwidth='100'/>
</memorytune>
<memorytune vcpus='1' id='vcpus_1'>
<node id='0' bandwidth='20'/>
<node id='1' bandwidth='40'/>
</memorytune>
</cputune>
7. check libvirtd debug log and verify libvirt set default value:
2019-06-13 09:30:42.502+0000: 14354: debug : virResctrlAllocCreate:2414 : Writing resctrl schemata 'L3:0=700;1=700
MB:0=100;1=100
' into '/sys/fs/resctrl/qemu-3-vm1-vcpus_2-4/schemata'
2019-06-13 09:30:42.503+0000: 14354: debug : virResctrlAllocCreate:2414 : Writing resctrl schemata 'L3:0=0ff;1=0ff
MB:0=20;1=40
' into '/sys/fs/resctrl/qemu-3-vm1-vcpus_1/schemata'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2395 |
Description of problem: Cannot start a guest which have cachetune + memorytune settings Version-Release number of selected component (if applicable): libvirt-5.0.0-6.module+el8+2860+4e0fe96a.x86_64 How reproducible: 100% Steps to Reproduce: 1. prepare a host support intel RDT 2. mount resctrl dir # mount -t resctrl resctrl /sys/fs/resctrl # echo "L3:0=0ff;1=0ff" > /sys/fs/resctrl/schemata 3. restart libvirtd 4. prepare an inactive guest which have cachetune+memorytune element like this: <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> <memorytune vcpus='1'> <node id='0' bandwidth='20'/> <node id='1' bandwidth='40'/> </memorytune> </cputune> 5. try to start guest: # virsh start vm1 error: Failed to start domain vm1 error: unsupported configuration: Not enough room for allocation of 20% bandwidth on node 0, available bandwidth 0% check libvirtd debug log: 2019-03-07 06:42:56.691+0000: 29277: debug : virResctrlAllocCreate:2400 : Writing resctrl schemata 'L3:0=700;1=700 ' into '/sys/fs/resctrl/qemu-4-vm1-vcpus_2-4/schemata' 6. remove memorytune settings and start guest again: # virsh dumpxml vm1 <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> </cputune> # virsh start vm1 Domain vm1 started 7. check resctrl schemata and that MB have default value 100% # cat /sys/fs/resctrl/qemu-5-vm1-vcpus_2-4/schemata L3:0=700;1=700 MB:0=100;1=100 libvirtd debug log: 2019-03-07 06:43:50.065+0000: 29277: debug : virResctrlAllocCreate:2400 : Writing resctrl schemata 'L3:0=700;1=700 ' into '/sys/fs/resctrl/qemu-5-vm1-vcpus_2-4/schemata' Actual results: Cannot start guest with a validate config in step 5 and Expected results: Start guest successfully in step 5 or report a useful error Additional info: from the debug log you can see that libvirt only set the L3 part and there is no MB, but kernel add a default MB value 100% and this cause fail to pass a check in virResctrlAllocMemoryBandwidth when try to create another dir.