Bug 1411971

Summary: HPE [RFE] [Director] Allow configuring numa affinity of hugepages
Product: Red Hat OpenStack Reporter: hrushi <hrushikesh.gangur>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED WONTFIX QA Contact: Amit Ugol <augol>
Severity: medium Docs Contact:
Priority: medium    
Version: 12.0 (Pike)CC: chegu_vinod, dbecker, fbaudin, hrushikesh.gangur, jcoufal, lyarwood, mburns, morazi, owalsh, rhel-osp-director-maint, sgordon, skramaja, stephenfin
Target Milestone: ---Keywords: FutureFeature, Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-01 16:36:59 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1341176, 1476900, 1521118    

Description hrushi 2017-01-10 21:38:09 UTC
Description of problem:

In OVS-DPDK networking, VM memory has to be backed up by huge pages. With RHOSP 10, we can push one sized huge page through OSPd's first-boot.yaml.

ComputeKernelArgs: "intel_iommu=on default_hugepagesz=1GB hugepagesz=1G hugepages=12"

To efficiently use compute host's memory, we need capability in OSPd to provide the following:
1. Size of huge page 2M or 1G
2. Per sized huge page, No. of huge pages per numa node

Comment 2 Stephen Gordon 2017-02-02 21:43:15 UTC
Franck did the NFV QE team look at this type of scenario at all? Is this currently feasible from a kernel POV? If it is I would have thought the direct nature of the kernel argument passthrough at the moment would allow this.

Comment 3 Stephen Gordon 2017-02-02 21:45:49 UTC
Looking at the /sys structure it seems like it should be:

$ tree /sys/devices/system/node/node0/hugepages/
/sys/devices/system/node/node0/hugepages/
├── hugepages-1048576kB
│   ├── free_hugepages
│   ├── nr_hugepages
│   └── surplus_hugepages
└── hugepages-2048kB
    ├── free_hugepages
    ├── nr_hugepages
    └── surplus_hugepages

2 directories, 6 files

I'm just not sure there is a way to do this from the kernel arguments for *both* sizes you want (versus runtime allocation which you can do by direct manipulation of nr_hugepages albeit this is not likely to pan out for 1G pages).

Comment 4 hrushi 2017-02-02 21:56:53 UTC
Yes, it can't be done just through kernel arguments. We typically follow these steps and some of these needs to be configured through ansible roles and requires reboot of the node. 

--
A)	Prepare the Compute host first and then reboot it :

i)	Setup mount points :

# mkdir –p /mnt/hugepages_2M
# mkdir –p /mnt/hugepages_1G

ii)	Add the following to /etc/fstab

hugetlbfs      /mnt/hugepages_2M    hugetlbfs  defaults 0 0
hugetlbfs      /mnt/hugepages_1G    hugetlbfs  pagesize=1GB 0 0

iii)	Add the following to the kernel command line (grub config) :

 hugepagesz=1G hugepages=4 

Note: This could also be a place where you can setup other kernel cmdline arguments like isolcpus etc. (based on CRM inputs)

iv)	Reboot the Compute host.



B)	After the Host OS reboots one can setup huge pages on each NUMA node…( Note:  This is the step which needs to be repeated each time the Compute Host reboots…based on what was previously set up by HOS-HLM (via CRM inputs)).

(In this example: setting up 100 2MB pages on numa node 0 and 125 2MB pages on numa node 1 and 8 1G huge pages on numa node 0 and 16 1G huge pages on numa node 1)

Using virsh :

  # virsh allocpages 2048KiB 100 --cellno 0
  # virsh allocpages 2048KiB 125 --cellno 1

  # virsh allocpages  1048576KiB 16 --cellno 0
  # virsh allocpages  1048576KiB 8   --cellno 1

If you don’t like to use virsh you can do the following :

echo 100 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 125 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
echo 16 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
echo 8 >  /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages

Just for verification :

# virsh freepages --all

Node 0:
4KiB: 3896664
2048KiB: 100
1048576KiB: 16

Node 1:
4KiB: 6024538
2048KiB: 125
1048576KiB: 8

--

Comment 5 Stephen Gordon 2017-02-03 23:31:25 UTC
Right, unfortunately I do not believe we will be able to achieve this in the Pike timeframe, I would like to re-evaluate for the Queens release.

Thanks,

Steve

Comment 7 Stephen Gordon 2017-08-30 15:48:59 UTC
Moving target to Rocky/RHOSP 14.

Comment 9 Ollie Walsh 2018-01-16 19:45:17 UTC
Multiple hugepage sizes can be allocated via kernel cmdline e.g default_hugepagesz=1G hugepagesz=1G hugepages=1 hugepagesz=2M hugepages=10:

# hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152       10       10       10         
1073741824        1        1        1        *

However AFAIK it's not possible to control the numa placement of these hugepages.

Comment 10 Ollie Walsh 2018-03-14 15:05:25 UTC
Updated bugzilla summary as both 2M and 1G hugepages can already be allocated at boot time. Numa affinity cannot be controlled via kernel boot cmdline however.

Comment 12 Stephen Finucane 2019-10-01 16:36:59 UTC
I think it's about time this was closed. While this is definitely a useful feature, I think it's something that we really ought to have support for in the kernel and sysctl to do and just consume that. Given this isn't the case and probably won't be for some time, I think it's time to close this out.