Description of problem:
In OVS-DPDK networking, VM memory has to be backed up by huge pages. With RHOSP 10, we can push one sized huge page through OSPd's first-boot.yaml.
ComputeKernelArgs: "intel_iommu=on default_hugepagesz=1GB hugepagesz=1G hugepages=12"
To efficiently use compute host's memory, we need capability in OSPd to provide the following:
1. Size of huge page 2M or 1G
2. Per sized huge page, No. of huge pages per numa node
Franck did the NFV QE team look at this type of scenario at all? Is this currently feasible from a kernel POV? If it is I would have thought the direct nature of the kernel argument passthrough at the moment would allow this.
Looking at the /sys structure it seems like it should be:
$ tree /sys/devices/system/node/node0/hugepages/
│ ├── free_hugepages
│ ├── nr_hugepages
│ └── surplus_hugepages
2 directories, 6 files
I'm just not sure there is a way to do this from the kernel arguments for *both* sizes you want (versus runtime allocation which you can do by direct manipulation of nr_hugepages albeit this is not likely to pan out for 1G pages).
Yes, it can't be done just through kernel arguments. We typically follow these steps and some of these needs to be configured through ansible roles and requires reboot of the node.
A) Prepare the Compute host first and then reboot it :
i) Setup mount points :
# mkdir –p /mnt/hugepages_2M
# mkdir –p /mnt/hugepages_1G
ii) Add the following to /etc/fstab
hugetlbfs /mnt/hugepages_2M hugetlbfs defaults 0 0
hugetlbfs /mnt/hugepages_1G hugetlbfs pagesize=1GB 0 0
iii) Add the following to the kernel command line (grub config) :
Note: This could also be a place where you can setup other kernel cmdline arguments like isolcpus etc. (based on CRM inputs)
iv) Reboot the Compute host.
B) After the Host OS reboots one can setup huge pages on each NUMA node…( Note: This is the step which needs to be repeated each time the Compute Host reboots…based on what was previously set up by HOS-HLM (via CRM inputs)).
(In this example: setting up 100 2MB pages on numa node 0 and 125 2MB pages on numa node 1 and 8 1G huge pages on numa node 0 and 16 1G huge pages on numa node 1)
Using virsh :
# virsh allocpages 2048KiB 100 --cellno 0
# virsh allocpages 2048KiB 125 --cellno 1
# virsh allocpages 1048576KiB 16 --cellno 0
# virsh allocpages 1048576KiB 8 --cellno 1
If you don’t like to use virsh you can do the following :
echo 100 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 125 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 8 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
Just for verification :
# virsh freepages --all
Right, unfortunately I do not believe we will be able to achieve this in the Pike timeframe, I would like to re-evaluate for the Queens release.
Moving target to Rocky/RHOSP 14.
Multiple hugepage sizes can be allocated via kernel cmdline e.g default_hugepagesz=1G hugepagesz=1G hugepages=1 hugepagesz=2M hugepages=10:
# hugeadm --pool-list
Size Minimum Current Maximum Default
2097152 10 10 10
1073741824 1 1 1 *
However AFAIK it's not possible to control the numa placement of these hugepages.
Updated bugzilla summary as both 2M and 1G hugepages can already be allocated at boot time. Numa affinity cannot be controlled via kernel boot cmdline however.
I think it's about time this was closed. While this is definitely a useful feature, I think it's something that we really ought to have support for in the kernel and sysctl to do and just consume that. Given this isn't the case and probably won't be for some time, I think it's time to close this out.