Description of problem: Cannot start a guest which have cachetune + memorytune settings Version-Release number of selected component (if applicable): libvirt-5.0.0-6.module+el8+2860+4e0fe96a.x86_64 How reproducible: 100% Steps to Reproduce: 1. prepare a host support intel RDT 2. mount resctrl dir # mount -t resctrl resctrl /sys/fs/resctrl # echo "L3:0=0ff;1=0ff" > /sys/fs/resctrl/schemata 3. restart libvirtd 4. prepare an inactive guest which have cachetune+memorytune element like this: <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> <memorytune vcpus='1'> <node id='0' bandwidth='20'/> <node id='1' bandwidth='40'/> </memorytune> </cputune> 5. try to start guest: # virsh start vm1 error: Failed to start domain vm1 error: unsupported configuration: Not enough room for allocation of 20% bandwidth on node 0, available bandwidth 0% check libvirtd debug log: 2019-03-07 06:42:56.691+0000: 29277: debug : virResctrlAllocCreate:2400 : Writing resctrl schemata 'L3:0=700;1=700 ' into '/sys/fs/resctrl/qemu-4-vm1-vcpus_2-4/schemata' 6. remove memorytune settings and start guest again: # virsh dumpxml vm1 <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> </cputune> # virsh start vm1 Domain vm1 started 7. check resctrl schemata and that MB have default value 100% # cat /sys/fs/resctrl/qemu-5-vm1-vcpus_2-4/schemata L3:0=700;1=700 MB:0=100;1=100 libvirtd debug log: 2019-03-07 06:43:50.065+0000: 29277: debug : virResctrlAllocCreate:2400 : Writing resctrl schemata 'L3:0=700;1=700 ' into '/sys/fs/resctrl/qemu-5-vm1-vcpus_2-4/schemata' Actual results: Cannot start guest with a validate config in step 5 and Expected results: Start guest successfully in step 5 or report a useful error Additional info: from the debug log you can see that libvirt only set the L3 part and there is no MB, but kernel add a default MB value 100% and this cause fail to pass a check in virResctrlAllocMemoryBandwidth when try to create another dir.
(In reply to Luyao Huang from comment #0) Hi, thanks for reporting this. I don't have access to machine with MBA support, so I cannot check that, but I'm guessing you have yet another resctrl group (not the default /sys/fs/resctrl, but some subdirectory o that) which has 100% allocated. It works roughly the same way as the L3 allocation. Even though libvirt could just write `MB:0=20;1=20` in /schemata, it would not _guarantee_ that it gets the memory bandwidth allocated. I, personally, am not sure how to perceive the documentation about bandwidth allocation, not only for libvirt, but also for kernel. Let me know what are the allocations you have in the system; something like the output of: find /sys/fs/resctrl -name schemata | xargs head or similar.
(In reply to Martin Kletzander from comment #1) > (In reply to Luyao Huang from comment #0) > Hi, thanks for reporting this. I don't have access to machine with MBA > support, so I cannot check that, but I'm guessing you have yet another > resctrl group (not the default /sys/fs/resctrl, but some subdirectory o > that) which has 100% allocated. > It works roughly the same way as the L3 allocation. Even though libvirt > could just write `MB:0=20;1=20` in /schemata, it would not _guarantee_ that > it gets the memory bandwidth allocated. I, personally, am not sure how to > perceive the documentation about bandwidth allocation, not only for libvirt, > but also for kernel. Let me know what are the allocations you have in the > system; something like the output of: > > find /sys/fs/resctrl -name schemata | xargs head > > or similar. Hi Martin, I only mount resctrl one time under /sys/fs/resctrl, and this is the output when no active vm: # find /sys/fs/resctrl -name schemata | xargs head L3:0=0ff;1=0ff MB:0=100;1=100 And test result if more than one machine set memorytune and total bandwidth > 100%: # find /sys/fs/resctrl -name schemata | xargs head ==> /sys/fs/resctrl/qemu-15-vm2-vcpus_2-4/schemata <== L3:0=200;1=200 MB:0=100;1=100 ==> /sys/fs/resctrl/qemu-14-vm1-vcpus_2-4/schemata <== L3:0=100;1=100 MB:0= 20;1= 40 ==> /sys/fs/resctrl/schemata <== L3:0=0ff;1=0ff MB:0=100;1=100 ==> /sys/fs/resctrl/qemu-14-vm1-vcpus_1/schemata <== L3:0=0ff;1=0ff MB:0= 20;1= 40 ==> /sys/fs/resctrl/qemu-15-vm2-vcpus_1/schemata <== L3:0=0ff;1=0ff MB:0= 20;1= 40 you can see that the total bandwidth over 100% and 2 guest can start success. And I am not sure why libvirt will check the total number of memory bandwidth, since it is a speed limitation like mem cgroup. Also from the kernel resctrl document (https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt): 2. Same bandwidth percentage may mean different actual bandwidth depending on # of threads: For the same SKU in #1, a 'single thread, with 10% bandwidth' and '4 thread, with 10% bandwidth' can consume upto 10GBps and 40GBps although they have same percentage bandwidth of 10%. This is simply because as threads start using more cores in an rdtgroup, the actual bandwidth may increase or vary although user specified bandwidth percentage is same. If we set 4 vcpus in one group with 10% bandwidth, they can consume 40% bandwidth. In that case one vm can consume more than 100% bandwidth if user set all vcpus in one group, and our MB total check is not make that sense. Obviously this is just my opinion, corrections are most welcome. Thanks
And another interesting issue is that only delete dir under /sys/fs/resctrl cannot clear the data in MSR, and this will affect next time we create dir: 1. prepare a guest and set MB in xml: <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='2' unit='MiB'/> <cache id='1' level='3' type='both' size='2' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> <memorytune vcpus='2-4'> <node id='0' bandwidth='20'/> <node id='1' bandwidth='40'/> </memorytune> <memorytune vcpus='1'> <node id='0' bandwidth='20'/> <node id='1' bandwidth='40'/> </memorytune> </cputune> 2. start guest and check resctrl: # find /sys/fs/resctrl -name schemata | xargs head ==> /sys/fs/resctrl/qemu-16-vm1-vcpus_2-4/schemata <== L3:0=100;1=100 MB:0= 20;1= 40 ==> /sys/fs/resctrl/schemata <== L3:0=0ff;1=0ff MB:0=100;1=100 ==> /sys/fs/resctrl/qemu-16-vm1-vcpus_1/schemata <== L3:0=0ff;1=0ff MB:0= 20;1= 40 3. destroy guest and check resctrl: # find /sys/fs/resctrl -name schemata | xargs head L3:0=0ff;1=0ff MB:0=100;1=100 4. use pqos to read MSR you will find data not been cleared: # pqos -s NOTE: Mixed use of MSR and kernel interfaces to manage CAT or CMT & MBM may lead to unexpected behavior. WARN: resctl filesystem mounted! Using MSR interface may corrupt resctrl filesystem and cause unexpected behaviour L3CA/MBA COS definitions for Socket 0: L3CA COS0 => MASK 0xff L3CA COS1 => MASK 0x100 L3CA COS2 => MASK 0xff L3CA COS3 => MASK 0x200 L3CA COS4 => MASK 0xff L3CA COS5 => MASK 0x7ff L3CA COS6 => MASK 0x7ff L3CA COS7 => MASK 0x7ff L3CA COS8 => MASK 0x7ff L3CA COS9 => MASK 0x7ff L3CA COS10 => MASK 0x7ff L3CA COS11 => MASK 0x7ff L3CA COS12 => MASK 0x7ff L3CA COS13 => MASK 0x7ff L3CA COS14 => MASK 0x7ff L3CA COS15 => MASK 0x7ff MBA COS0 => 100% available MBA COS1 => 20% available MBA COS2 => 20% available MBA COS3 => 100% available MBA COS4 => 20% available MBA COS5 => 100% available MBA COS6 => 100% available MBA COS7 => 100% available L3CA/MBA COS definitions for Socket 1: L3CA COS0 => MASK 0xff L3CA COS1 => MASK 0x100 L3CA COS2 => MASK 0xff L3CA COS3 => MASK 0x200 L3CA COS4 => MASK 0xff L3CA COS5 => MASK 0x7ff L3CA COS6 => MASK 0x7ff L3CA COS7 => MASK 0x7ff L3CA COS8 => MASK 0x7ff L3CA COS9 => MASK 0x7ff L3CA COS10 => MASK 0x7ff L3CA COS11 => MASK 0x7ff L3CA COS12 => MASK 0x7ff L3CA COS13 => MASK 0x7ff L3CA COS14 => MASK 0x7ff L3CA COS15 => MASK 0x7ff MBA COS0 => 100% available MBA COS1 => 40% available MBA COS2 => 40% available MBA COS3 => 100% available MBA COS4 => 40% available MBA COS5 => 100% available MBA COS6 => 100% available MBA COS7 => 100% available 5. change guest xml and remove MB settings: <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='2' unit='MiB'/> <cache id='1' level='3' type='both' size='2' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> </cputune> 6. start guest and recheck resctrl: # find /sys/fs/resctrl -name schemata | xargs head ==> /sys/fs/resctrl/schemata <== L3:0=0ff;1=0ff MB:0=100;1=100 ==> /sys/fs/resctrl/qemu-17-vm1-vcpus_2-4/schemata <== L3:0=100;1=100 MB:0= 20;1= 40 You can see that MB still the same with the last time.
Thanks for the detailed info. I agree that lot of it does not make sense. The default values after removing and re-creating the groups is something that happens with CAT values as well and it is why we need to write the defaults there even if they are not mentioned in the XML However for CAT it is easy to do that, with MBA it does not make sense. Anyway, we should still fix this at least in a way that will make some sense, even if we cannot do that perfectly. The other thing of no allowing bigger bandwidth than 100% for the whole machine is wrong because a) it does not take the default group settings into account and b) it is only a limitation, not an allocation (although it is impossible to find that in any documentation). The fact that it is per-thread is something I would not consider an issue after both (a) and (b) are handled. I'll send a patch for those.
Patches proposed upstream: https://www.redhat.com/archives/libvir-list/2019-March/msg00589.html
(In reply to Martin Kletzander from comment #4) > Thanks for the detailed info. I agree that lot of it does not make sense. > The default values after removing and re-creating the groups is something > that happens with CAT values as well and it is why we need to write the > defaults there even if they are not mentioned in the XML However for CAT it > is easy to do that, with MBA it does not make sense. Anyway, we should > still fix this at least in a way that will make some sense, even if we > cannot do that perfectly. The other thing of no allowing bigger bandwidth > than 100% for the whole machine is wrong because a) it does not take the > default group settings into account and b) it is only a limitation, not an > allocation (although it is impossible to find that in any documentation). > The fact that it is per-thread is something I would not consider an issue > after both (a) and (b) are handled. I'll send a patch for those. Agreed, but i am still wondering if it is a bug in kernel that the MSR not been reset after remove dir. (In reply to Martin Kletzander from comment #5) > Patches proposed upstream: > > https://www.redhat.com/archives/libvir-list/2019-March/msg00589.html Thanks a lot for your quick fix and i noticed maybe there is a small problem in the first patch before create a build to test it: + for (i = 0; i < src_bw->nbandwidths; i++) { + if (dst_bw->bandwidths[i]) { + *dst_bw->bandwidths[i] = 123; <----here + continue; + } I guess you forgot remove this line (also with curly braces) before send a patch :)
(In reply to Luyao Huang from comment #6) I believe it is. However it is the same with CAT where we had to do the workaround in order not to wait for the kernel fix to land. If you submit the patch, though, I will fully support that ;) And thanks for noting the thing in the patch, yes, that was there for testing.
Fixed upstream with v5.1.0-173-g408aeebcef1e..v5.1.0-175-gbf8c8755dc8a: commit 408aeebcef1e81e55bebb4f2d47403d04ee16c0f Author: Martin Kletzander <mkletzan> Date: Mon Mar 11 10:23:10 2019 +0100 resctrl: Do not calculate free bandwidth for MBA commit ceb6725d945490a7153cf8b10ad3cd972d3f1c16 Author: Martin Kletzander <mkletzan> Date: Mon Mar 11 11:13:32 2019 +0100 resctrl: Set MBA defaults properly commit bf8c8755dc8a6d53632b90aa79ba546594714264 Author: Martin Kletzander <mkletzan> Date: Tue Mar 12 09:53:48 2019 +0100 resctrl: Fix testing line
(In reply to Martin Kletzander from comment #7) > (In reply to Luyao Huang from comment #6) > I believe it is. However it is the same with CAT where we had to do the > workaround in order not to wait for the kernel fix to land. If you submit > the patch, though, I will fully support that ;) And thanks for noting the > thing in the patch, yes, that was there for testing. Hmmm... you mean submit a patch to kernel ? I will have a try but don't expect too much ;) Anyway, i will create a kernel bug to track this issue and see if I can get some code level support from kernel developer.
(In reply to Luyao Huang from comment #11) I meant "create the BZ", not "submit the patch", my bad, sorry. I was doing too many things at once. Moving to POST.
Verify this bug with libvirt-5.0.0-10.module+el8.0.1+3363+49e420ce.x86_64; S1: start a guest with cachetune + memorytune 1. prepare a host support intel RDT 2. mount resctrl dir # mount -t resctrl resctrl /sys/fs/resctrl # echo "L3:0=0ff;1=0ff" > /sys/fs/resctrl/schemata 3. restart libvirtd 4. prepare an inactive guest which have cachetune+memorytune element like this: <cputune> <cachetune vcpus='2-4'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> <memorytune vcpus='1'> <node id='0' bandwidth='20'/> <node id='1' bandwidth='40'/> </memorytune> </cputune> 5. start guest # virsh start vm1 Domain vm1 started 6. check guest live xml: # virsh dumpxml vm1 <cputune> <cachetune vcpus='2-4' id='vcpus_2-4'> <cache id='0' level='3' type='both' size='3' unit='MiB'/> <cache id='1' level='3' type='both' size='3' unit='MiB'/> <monitor level='3' vcpus='2-3'/> <monitor level='3' vcpus='4'/> </cachetune> <memorytune vcpus='2-4' id='vcpus_2-4'> <node id='0' bandwidth='100'/> <node id='1' bandwidth='100'/> </memorytune> <memorytune vcpus='1' id='vcpus_1'> <node id='0' bandwidth='20'/> <node id='1' bandwidth='40'/> </memorytune> </cputune> 7. check libvirtd debug log and verify libvirt set default value: 2019-06-13 09:30:42.502+0000: 14354: debug : virResctrlAllocCreate:2414 : Writing resctrl schemata 'L3:0=700;1=700 MB:0=100;1=100 ' into '/sys/fs/resctrl/qemu-3-vm1-vcpus_2-4/schemata' 2019-06-13 09:30:42.503+0000: 14354: debug : virResctrlAllocCreate:2414 : Writing resctrl schemata 'L3:0=0ff;1=0ff MB:0=20;1=40 ' into '/sys/fs/resctrl/qemu-3-vm1-vcpus_1/schemata'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2395