Bug 1976690
Summary: | virsh numatune cmd does not work well in rhel9 | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Jing Qi <jinqi> | |
Component: | libvirt | Assignee: | Michal Privoznik <mprivozn> | |
libvirt sub component: | General | QA Contact: | Jing Qi <jinqi> | |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | ||
Severity: | high | |||
Priority: | unspecified | CC: | jdenemar, jsuchane, mprivozn, phrdina, virt-maint, xuzhang | |
Version: | 9.0 | Keywords: | Regression, Upstream | |
Target Milestone: | beta | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | libvirt-7.6.0-1.el9 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1980430 (view as bug list) | Environment: | ||
Last Closed: | 2021-12-07 21:57:54 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | 7.6.0 | |
Embargoed: | ||||
Bug Depends On: | 1980430 | |||
Bug Blocks: |
Description
Jing Qi
2021-06-28 04:09:28 UTC
Michal, please have a look. Thanks. The problem is that libvirt doesn't see cpuset controller. I mean, when starting a VM we have this qemuProcessHook() function which does the following: if (virDomainNumatuneGetMode(h->vm->def->numa, -1, &mode) == 0) { if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT && h->cfg->cgroupControllers & (1 << VIR_CGROUP_CONTROLLER_CPUSET) && virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET)) { /* Use virNuma* API iff necessary. Once set and child is exec()-ed, * there's no way for us to change it. Rely on cgroups (if available * and enabled in the config) rather than virNuma*. */ VIR_DEBUG("Relying on CGroups for memory binding"); } else { nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa, priv->autoNodeset, -1); if (virNumaSetupMemoryPolicy(mode, nodeset) < 0) goto cleanup; } } And problem here is that we take the else branch. Simply because virCgroupControllerAvailable() returns false. Why? Because it looks into /sys/fs/cgroup/user.slice/user-1000.slice/session-*.scope/cgroup.controllers where only a subset of controllers is available (in my case that's memory and pids). However, machine slice has broader set of controllers available: # cat /sys/fs/cgroup/machine.slice/machine-qemu*.scope/cgroup.controllers cpuset cpu io memory pids Let me see if I can fix this. Problem with taking the else branch is that the process will use numa_* APIs to set NUMA affinity (before exec()-ing qemu) and those work on a different level than cpuset controller. Thus no matter how hard we try to set cpuset.mems we won't change NUMA affinity. The assessment above is not completely correct. Process libvirtd will look into /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers but that's not important as it will still see only memory and pids controllers. The code above has access to VM qemuDomainObjPrivate so I'm wondering if we could use: virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET) instead of virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET) as it would look into the VM cgroup where cpuset controller should be enabled. (In reply to Pavel Hrdina from comment #3) > The assessment above is not completely correct. Process libvirtd will look > into /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers but > that's not important as it will still see only memory and pids controllers. Oh, my bad. I had started libvirtd from cmd line thus the misleading path. Anyway, started it via systemctl, attached strace and this is the path I got: [pid 1399] openat(AT_FDCWD, "/proc/self/cgroup", O_RDONLY) = 3 [pid 1399] newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0 [pid 1399] openat(AT_FDCWD, "/sys/fs/cgroup/system.slice/cgroup.controllers", O_RDONLY) = 3 But you're correct that this is the libvirtd cgroup: # cat /proc/$(pgrep libvirtd)/cgroup 0::/system.slice/libvirtd.service Anyway, both files have only 'memory pids' controllers: # tail /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers /sys/fs/cgroup/system.slice/cgroup.controllers ==> /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers <== memory pids ==> /sys/fs/cgroup/system.slice/cgroup.controllers <== memory pids > > The code above has access to VM qemuDomainObjPrivate so I'm wondering if we > could use: > > virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET) > > instead of > > virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET) > > as it would look into the VM cgroup where cpuset controller should be > enabled. Unfortunately, this won't be possible because priv->cgroup is NULL at this point. I mean, CGroup detection happens only after the hook is done. And even if it didn't - it surely can't happen before virCommandRun() (there's no process to detect CGroups for) and as soon as child forks off, any modification to priv->cgroup done in parent (libvirtd) are invisible to the child (qemu). But it looks like /sys/fs/cgroup/cgroup.controllers is the only file that contains cpuset: # tail $(find /sys/fs/cgroup/ -name cgroup.controllers) ==> /sys/fs/cgroup/cgroup.controllers <== cpuset cpu io memory hugetlb pids rdma misc Of course, if there's machine running then machine.slice mysteriously gains cpuset controller too: ==> /sys/fs/cgroup/machine.slice/cgroup.controllers <== cpuset cpu io memory pids and many other slices. (In reply to Michal Privoznik from comment #4) > (In reply to Pavel Hrdina from comment #3) > > The assessment above is not completely correct. Process libvirtd will look > > into /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers but > > that's not important as it will still see only memory and pids controllers. > > Oh, my bad. I had started libvirtd from cmd line thus the misleading path. > Anyway, started it via systemctl, attached strace and this is the path I got: > > [pid 1399] openat(AT_FDCWD, "/proc/self/cgroup", O_RDONLY) = 3 > [pid 1399] newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, > AT_EMPTY_PATH) = 0 > [pid 1399] openat(AT_FDCWD, > "/sys/fs/cgroup/system.slice/cgroup.controllers", O_RDONLY) = 3 > > But you're correct that this is the libvirtd cgroup: > > # cat /proc/$(pgrep libvirtd)/cgroup > 0::/system.slice/libvirtd.service > > Anyway, both files have only 'memory pids' controllers: > > # tail /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers > /sys/fs/cgroup/system.slice/cgroup.controllers > ==> /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers <== > memory pids > > ==> /sys/fs/cgroup/system.slice/cgroup.controllers <== > memory pids > > > > > > The code above has access to VM qemuDomainObjPrivate so I'm wondering if we > > could use: > > > > virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET) > > > > instead of > > > > virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET) > > > > as it would look into the VM cgroup where cpuset controller should be > > enabled. > > Unfortunately, this won't be possible because priv->cgroup is NULL at this > point. I mean, CGroup detection happens only after the hook is done. And > even if it didn't - it surely can't happen before virCommandRun() (there's > no process to detect CGroups for) and as soon as child forks off, any > modification to priv->cgroup done in parent (libvirtd) are invisible to the > child (qemu). Right, I thought there was a reason not to use it in the first place but I was too lazy to investigate it :). Looking into the code it should be safe to change virCgroupControllerAvailable() to check if the root cgroup has the controller available, in case of cgroups v1 there shoud be no difference and in case of cgroups v2 it should fix this case. It will affect simillar code for LXC as well, virLXCControllerSetupResourceLimits, which will be also fixed. > But it looks like /sys/fs/cgroup/cgroup.controllers is the only file that > contains cpuset: > > # tail $(find /sys/fs/cgroup/ -name cgroup.controllers) > > ==> /sys/fs/cgroup/cgroup.controllers <== > cpuset cpu io memory hugetlb pids rdma misc > > Of course, if there's machine running then machine.slice mysteriously gains > cpuset controller too: > ==> /sys/fs/cgroup/machine.slice/cgroup.controllers <== > cpuset cpu io memory pids > > and many other slices. Alright, I think we've ran into a combination of problems. The first one was that libvirt used numactl APIs to set NUMA affinity instead of CGroups. However, after I fixed the bug I can see cpuset.mems being updated but the memory is not migrated between the nodes. But I can reproduce this issue even without libvirt, therefore I'll clone this bug over to kernel for further investigation. Anyway, I've posted patch for the libvirt part of the problem: https://listman.redhat.com/archives/libvir-list/2021-July/msg00170.html Merged upstream as: c159db4cc0 vircgroup: Improve virCgroupControllerAvailable wrt to CGroupsV2 v7.5.0-33-gc159db4cc0 The scratch build can be found here: https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=426781 if that expires I've saved RPMs here: https://mprivozn.fedorapeople.org/numatune/ I tried with the scratch build, but it still failed to migrate the memory with it. Is it caused by the dependent bug 1980430 which is still in New status? @Michal (In reply to Jing Qi from comment #10) > I tried with the scratch build, but it still failed to migrate the memory > with it. Is it caused by the dependent bug 1980430 which is still in New > status? @Michal Yes, that's exactly why. The change in libvirt is subtle - previously, libvirt would use numa_set_membind() (which is just a wrapper over set_mempolicy syscall), but after the fix it relies on CGroups (which as we found don't work either). The problem with using numa_set_membind() is that a process can change it's numa preferences only for itself, not for others. Therefore, libvirt can call numa_set_membind() in the forked process just before exec()-ing QEMU, but after that it would be QEMU who would need to call numa_set_membind() (e.g. as a result of a monitor command). But there was never even an intention to implement that in QEMU. What was implemented instead was CGrgoups. They allow libvirt to change NUMA location without any modification on QEMU side. Long story short, with that scratch build you should see the following debug message: VIR_DEBUG("Relying on CGroups for memory binding"); While with the broken build, you shouldn't see it. And as soon as Kernel bug is fixed the whole problem is fixed. (In reply to Michal Privoznik from comment #11) ... > > Long story short, with that scratch build you should see the following debug > message: > > VIR_DEBUG("Relying on CGroups for memory binding"); > > While with the broken build, you shouldn't see it. And as soon as Kernel bug > is fixed the whole problem is fixed. I found the " Relying on CGroups for memory binding" from the domain log. So, let's wait the kernel bug being fixed. Verified with libvirt-7.6.0-2.el9.x86_64 & qemu-kvm-6.1.0-1.el9.x86_64. kernel version: #uname -a Linux ** 5.14.0-1.2.1.el9.x86_64 #1 SMP Fri Sep 17 04:16:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux 1. Start vm with numatune set to 0. <maxMemory slots='16' unit='KiB'>20971520</maxMemory> <memory unit='KiB'>2097152</memory> <currentMemory unit='KiB'>2097152</currentMemory> <vcpu placement='static'>2</vcpu> <numatune> <memory mode='strict' nodeset='0'/> </numatune> ... <cpu mode='host-model' check='partial'> <feature policy='disable' name='vmx'/> <numa> <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/> </numa> </cpu> # virsh start vm1 2. check the memory is allocated in node 0. numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 226966 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 27.64 0.00 27.64 Stack 0.02 0.00 0.02 Private 761.22 0.00 761.22 ---------------- --------------- --------------- --------------- Total 788.87 0.00 788.87 2. #virsh numatune avocado-vt-vm1 0 1 3. The memory is re-allocated to node 1. # numastat -p `pidof qemu-kvm` Per-node process memory usage (in MBs) for PID 226966 (qemu-kvm) Node 0 Node 1 Total --------------- --------------- --------------- Huge 0.00 0.00 0.00 Heap 0.00 27.64 27.64 Stack 0.00 0.02 0.02 Private 2.02 759.21 761.23 ---------------- --------------- --------------- --------------- Total 2.02 786.86 788.88 |