Bug 1976690

Summary:	virsh numatune cmd does not work well in rhel9
Product:	Red Hat Enterprise Linux 9	Reporter:	Jing Qi <jinqi>
Component:	libvirt	Assignee:	Michal Privoznik <mprivozn>
libvirt sub component:	General	QA Contact:	Jing Qi <jinqi>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	jdenemar, jsuchane, mprivozn, phrdina, virt-maint, xuzhang
Version:	9.0	Keywords:	Regression, Upstream
Target Milestone:	beta
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-7.6.0-1.el9	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1980430 (view as bug list)		Environment:
Last Closed:	2021-12-07 21:57:54 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:	7.6.0
Embargoed:
Bug Depends On:	1980430
Bug Blocks:

Description Jing Qi 2021-06-28 04:09:28 UTC

Description of problem:
virsh numatune cmd does not work well

Version-Release number of selected component (if applicable):
libvirt-7.4.0-1.el9.x86_64
qemu-kvm-6.0.0-6.el9.x86_64
How reproducible:

always

Steps to Reproduce:
1.  prepare vm - 
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.4.0'>hvm</type>
    <boot dev='hd'/>
  </os>...
<numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>IvyBridge-IBRS</model>
    <vendor>Intel</vendor>
   ...    <feature policy='disable' name='vmx'/>
    <numa>
      <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
    </numa>
2. start vm #virsh start avocado-vt-vm1
Domain 'avocado-vt-vm1' started
3.#numastat `pidof qemu-kvm`
Per-node process memory usage (in MBs) for PID 432527 (qemu-kvm)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        28.24            0.00            0.00
Stack                        0.04            0.00            0.00
Private                    579.88            0.00            0.00
----------------  --------------- --------------- ---------------
Total                      608.16            0.00            0.00

                           Node 3          Node 4          Node 5
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                      0.08            0.00            0.80
----------------  --------------- --------------- ---------------
Total                        0.08            0.00            0.80

                           Node 6          Node 7           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00           28.24
Stack                        0.00            0.00            0.04
Private                      0.00            0.00          580.75
----------------  --------------- --------------- ---------------
Total                        0.00            0.00          609.03

4. virsh numatune avocado-vt-vm1 0 1

5.  memory didn't migrate after numatune
# numastat `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 432527 (qemu-kvm)
                           Node 0          Node 1          Node 2
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        28.24            0.00            0.00
Stack                        0.04            0.00            0.00
Private                    579.88            0.00            0.00
----------------  --------------- --------------- ---------------
Total                      608.16            0.00            0.00

                           Node 3          Node 4          Node 5
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00            0.00
Stack                        0.00            0.00            0.00
Private                      0.08            0.00            0.80
----------------  --------------- --------------- ---------------
Total                        0.08            0.00            0.80

                           Node 6          Node 7           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00            0.00           28.24
Stack                        0.00            0.00            0.04
Private                      0.00            0.00          580.75
----------------  --------------- --------------- ---------------
Total                        0.00            0.00          609.03

Expected results:memory can be migrate to node 1

Actual results:memory didn't migrate

Additional info:The function can work well in version before , including rhel8.5

Comment 1 Jaroslav Suchanek 2021-07-02 15:21:58 UTC

Michal, please have a look. Thanks.

Comment 2 Michal Privoznik 2021-07-08 12:33:56 UTC

The problem is that libvirt doesn't see cpuset controller. I mean, when starting a VM we have this qemuProcessHook() function which does the following:

    if (virDomainNumatuneGetMode(h->vm->def->numa, -1, &mode) == 0) {
        if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT &&
            h->cfg->cgroupControllers & (1 << VIR_CGROUP_CONTROLLER_CPUSET) &&
            virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET)) {
            /* Use virNuma* API iff necessary. Once set and child is exec()-ed,
             * there's no way for us to change it. Rely on cgroups (if available
             * and enabled in the config) rather than virNuma*. */
            VIR_DEBUG("Relying on CGroups for memory binding");
        } else {
            nodeset = virDomainNumatuneGetNodeset(h->vm->def->numa,
                                                  priv->autoNodeset, -1);

            if (virNumaSetupMemoryPolicy(mode, nodeset) < 0)
                goto cleanup;
        }
    }

And problem here is that we take the else branch. Simply because virCgroupControllerAvailable() returns false. Why? Because it looks into /sys/fs/cgroup/user.slice/user-1000.slice/session-*.scope/cgroup.controllers  where only a subset of controllers is available (in my case that's memory and pids). However, machine slice has broader set of controllers available:

# cat /sys/fs/cgroup/machine.slice/machine-qemu*.scope/cgroup.controllers 
cpuset cpu io memory pids

Let me see if I can fix this. Problem with taking the else branch is that the process will use numa_* APIs to set NUMA affinity (before exec()-ing qemu) and those work on a different level than cpuset controller. Thus no matter how hard we try to set cpuset.mems we won't change NUMA affinity.

Comment 3 Pavel Hrdina 2021-07-08 13:09:40 UTC

The assessment above is not completely correct. Process libvirtd will look into /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers but that's not important as it will still see only memory and pids controllers.

The code above has access to VM qemuDomainObjPrivate so I'm wondering if we could use:

    virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)

instead of

    virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET)

as it would look into the VM cgroup where cpuset controller should be enabled.

Comment 4 Michal Privoznik 2021-07-08 13:50:48 UTC

(In reply to Pavel Hrdina from comment #3)
> The assessment above is not completely correct. Process libvirtd will look
> into /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers but
> that's not important as it will still see only memory and pids controllers.

Oh, my bad. I had started libvirtd from cmd line thus the misleading path. Anyway, started it via systemctl, attached strace and this is the path I got:

[pid  1399] openat(AT_FDCWD, "/proc/self/cgroup", O_RDONLY) = 3
[pid  1399] newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
[pid  1399] openat(AT_FDCWD, "/sys/fs/cgroup/system.slice/cgroup.controllers", O_RDONLY) = 3

But you're correct that this is the libvirtd cgroup:

# cat /proc/$(pgrep libvirtd)/cgroup
0::/system.slice/libvirtd.service

Anyway, both files have only 'memory pids' controllers:

# tail /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers /sys/fs/cgroup/system.slice/cgroup.controllers
==> /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers <==
memory pids

==> /sys/fs/cgroup/system.slice/cgroup.controllers <==
memory pids


> 
> The code above has access to VM qemuDomainObjPrivate so I'm wondering if we
> could use:
> 
>     virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)
> 
> instead of
> 
>     virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET)
> 
> as it would look into the VM cgroup where cpuset controller should be
> enabled.

Unfortunately, this won't be possible because priv->cgroup is NULL at this point. I mean, CGroup detection happens only after the hook is done. And even if it didn't - it surely can't happen before virCommandRun() (there's no process to detect CGroups for) and as soon as child forks off, any modification to priv->cgroup done in parent (libvirtd) are invisible to the child (qemu).

But it looks like /sys/fs/cgroup/cgroup.controllers is the only file that contains cpuset:

# tail $(find /sys/fs/cgroup/ -name cgroup.controllers)

==> /sys/fs/cgroup/cgroup.controllers <==
cpuset cpu io memory hugetlb pids rdma misc

Of course, if there's machine running then machine.slice mysteriously gains cpuset controller too:

==> /sys/fs/cgroup/machine.slice/cgroup.controllers <==
cpuset cpu io memory pids

and many other slices.

Comment 5 Pavel Hrdina 2021-07-08 14:06:10 UTC

(In reply to Michal Privoznik from comment #4)
> (In reply to Pavel Hrdina from comment #3)
> > The assessment above is not completely correct. Process libvirtd will look
> > into /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers but
> > that's not important as it will still see only memory and pids controllers.
> 
> Oh, my bad. I had started libvirtd from cmd line thus the misleading path.
> Anyway, started it via systemctl, attached strace and this is the path I got:
> 
> [pid  1399] openat(AT_FDCWD, "/proc/self/cgroup", O_RDONLY) = 3
> [pid  1399] newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...},
> AT_EMPTY_PATH) = 0
> [pid  1399] openat(AT_FDCWD,
> "/sys/fs/cgroup/system.slice/cgroup.controllers", O_RDONLY) = 3
> 
> But you're correct that this is the libvirtd cgroup:
> 
> # cat /proc/$(pgrep libvirtd)/cgroup
> 0::/system.slice/libvirtd.service
> 
> Anyway, both files have only 'memory pids' controllers:
> 
> # tail /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers
> /sys/fs/cgroup/system.slice/cgroup.controllers
> ==> /sys/fs/cgroup/system.slice/libvirtd.service/cgroup.controllers <==
> memory pids
> 
> ==> /sys/fs/cgroup/system.slice/cgroup.controllers <==
> memory pids
> 
> 
> > 
> > The code above has access to VM qemuDomainObjPrivate so I'm wondering if we
> > could use:
> > 
> >     virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)
> > 
> > instead of
> > 
> >     virCgroupControllerAvailable(VIR_CGROUP_CONTROLLER_CPUSET)
> > 
> > as it would look into the VM cgroup where cpuset controller should be
> > enabled.
> 
> Unfortunately, this won't be possible because priv->cgroup is NULL at this
> point. I mean, CGroup detection happens only after the hook is done. And
> even if it didn't - it surely can't happen before virCommandRun() (there's
> no process to detect CGroups for) and as soon as child forks off, any
> modification to priv->cgroup done in parent (libvirtd) are invisible to the
> child (qemu).

Right, I thought there was a reason not to use it in the first place but I was too lazy to investigate it :).

Looking into the code it should be safe to change virCgroupControllerAvailable() to check if the root cgroup has the controller available, in case of cgroups v1 there shoud be no difference and in case of cgroups v2 it should fix this case. It will affect simillar code for LXC as well, virLXCControllerSetupResourceLimits, which will be also fixed.

> But it looks like /sys/fs/cgroup/cgroup.controllers is the only file that
> contains cpuset:
> 
> # tail $(find /sys/fs/cgroup/ -name cgroup.controllers)
> 
> ==> /sys/fs/cgroup/cgroup.controllers <==
> cpuset cpu io memory hugetlb pids rdma misc
> 
> Of course, if there's machine running then machine.slice mysteriously gains
> cpuset controller too:
> ==> /sys/fs/cgroup/machine.slice/cgroup.controllers <==
> cpuset cpu io memory pids
> 
> and many other slices.

Comment 6 Michal Privoznik 2021-07-08 16:03:07 UTC

Alright, I think we've ran into a combination of problems. The first one was that libvirt used numactl APIs to set NUMA affinity instead of CGroups. However, after I fixed the bug I can see cpuset.mems being updated but the memory is not migrated between the nodes. But I can reproduce this issue even without libvirt, therefore I'll clone this bug over to kernel for further investigation.

Comment 7 Michal Privoznik 2021-07-08 16:24:54 UTC

Anyway, I've posted patch for the libvirt part of the problem:

https://listman.redhat.com/archives/libvir-list/2021-July/msg00170.html

Comment 8 Michal Privoznik 2021-07-09 07:08:03 UTC

Merged upstream as:

c159db4cc0 vircgroup: Improve virCgroupControllerAvailable wrt to CGroupsV2

v7.5.0-33-gc159db4cc0

Comment 9 Michal Privoznik 2021-07-21 13:44:58 UTC

The scratch build can be found here:

https://kojihub.stream.rdu2.redhat.com/koji/taskinfo?taskID=426781

if that expires I've saved RPMs here:

https://mprivozn.fedorapeople.org/numatune/

Comment 10 Jing Qi 2021-07-22 01:59:06 UTC

I tried with the scratch build, but it still failed to migrate the memory with it. Is it caused by the dependent bug 1980430 which is still in New status? @Michal

Comment 11 Michal Privoznik 2021-07-22 04:17:48 UTC

(In reply to Jing Qi from comment #10)
> I tried with the scratch build, but it still failed to migrate the memory
> with it. Is it caused by the dependent bug 1980430 which is still in New
> status? @Michal

Yes, that's exactly why. The change in libvirt is subtle - previously, libvirt would use numa_set_membind() (which is just a wrapper over set_mempolicy syscall), but after the fix it relies on CGroups (which as we found don't work either). The problem with using numa_set_membind() is that a process can change it's numa preferences only for itself, not for others. Therefore, libvirt can call numa_set_membind() in the forked process just before exec()-ing QEMU, but after that it would be QEMU who would need to call numa_set_membind() (e.g. as a result of a monitor command). But there was never even an intention to implement that in QEMU. What was implemented instead was CGrgoups. They allow libvirt to change NUMA location without any modification on QEMU side.

Long story short, with that scratch build you should see the following debug message:

  VIR_DEBUG("Relying on CGroups for memory binding");

While with the broken build, you shouldn't see it. And as soon as Kernel bug is fixed the whole problem is fixed.

Comment 12 Jing Qi 2021-07-22 09:10:58 UTC

(In reply to Michal Privoznik from comment #11)
...
> 
> Long story short, with that scratch build you should see the following debug
> message:
> 
>   VIR_DEBUG("Relying on CGroups for memory binding");
> 
> While with the broken build, you shouldn't see it. And as soon as Kernel bug
> is fixed the whole problem is fixed.


I found the " Relying on CGroups for memory binding" from the domain log. So, let's wait the kernel bug being fixed.

Comment 23 Jing Qi 2021-09-18 06:08:04 UTC

Verified with libvirt-7.6.0-2.el9.x86_64 & qemu-kvm-6.1.0-1.el9.x86_64.
kernel version:
#uname -a
Linux ** 5.14.0-1.2.1.el9.x86_64 #1 SMP Fri Sep 17 04:16:18 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux

1. Start vm with numatune set to 0.

  <maxMemory slots='16' unit='KiB'>20971520</maxMemory>
  <memory unit='KiB'>2097152</memory>
  <currentMemory unit='KiB'>2097152</currentMemory>
  <vcpu placement='static'>2</vcpu>
  <numatune>
    <memory mode='strict' nodeset='0'/>
  </numatune>
  ...
  <cpu mode='host-model' check='partial'>
    <feature policy='disable' name='vmx'/>
    <numa>
      <cell id='0' cpus='0-1' memory='2097152' unit='KiB'/>
    </numa>
  </cpu>

  # virsh start vm1

2. check the memory is allocated in node 0. 
numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 226966 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        27.64            0.00           27.64
Stack                        0.02            0.00            0.02
Private                    761.22            0.00          761.22
----------------  --------------- --------------- ---------------
Total                      788.87            0.00          788.87


2. #virsh numatune avocado-vt-vm1 0 1

3. The memory is re-allocated to node 1.

# numastat -p `pidof qemu-kvm`

Per-node process memory usage (in MBs) for PID 226966 (qemu-kvm)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         0.00           27.64           27.64
Stack                        0.00            0.02            0.02
Private                      2.02          759.21          761.23
----------------  --------------- --------------- ---------------
Total                        2.02          786.86          788.88