Bug 1217537

Summary: Unable to create a NUMA node with CPUs and 0 MB of RAM
Product: Red Hat Enterprise Linux 7 Reporter: Daniel Berrangé <berrange>
Component: qemu-kvm-rhevAssignee: Eduardo Habkost <ehabkost>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: ehabkost, huding, juzhang, kchamart, virt-maint, xfu
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-29 18:27:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1217144, 1662586    
Attachments:
Description Flags
Screenshot of what seems to be a guest kernel bug or limitation none

Description Daniel Berrangé 2015-04-30 15:13:35 UTC
Description of problem:
In order to reproduce some customer problems running OpenStack on host with unusual NUMA topology, QE have been attempting to create a guest with a NUMA node which has CPUs but not RAM.

To this end we created a guest with 8 GB of RAM and 8 CPUs

  <memory unit='KiB'>8192000</memory>
  <currentMemory unit='KiB'>8192000</currentMemory>
  <vcpu placement='static'>8</vcpu>

And configured it to have 3 NUMA nodes, 4 CPUS + 4 GB RAM in first node. 2 CPUs and 4 GB RAM in the second node. 2 CPUs and 0 ram in the 3rd node

  <cpu mode='host-passthrough'>
    <numa>
      <cell id='0' cpus='0-3' memory='4096000' unit='KiB'/>
      <cell id='1' cpus='4-5' memory='4096000' unit='KiB'/>
      <cell id='2' cpus='6-7' memory='0' unit='KiB'/>
    </numa>
  </cpu>

Libvirt is turning that config into the following QEMU command line parmaeters:

 -smp 8,sockets=8,cores=1,threads=1 -numa node,nodeid=0,cpus=0-3,mem=4000 -numa node,nodeid=1,cpus=4-5,mem=4000 -numa node,nodeid=2,cpus=6-7,mem=0 


When the guest boots up though, and we look at the NUMA topology inside it, KVM has only created 2 nodes.  The CPUs from the 3rd NUMA node we requested, have been placed into the 1st NUMA node.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 6 7
node 0 size: 3856 MB
node 0 free: 3392 MB
node 1 cpus: 4 5
node 1 size: 3937 MB
node 1 free: 3736 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

So it looks as if KVM is incorrectly configuring the NUMA tables when memory=0


Version-Release number of selected component (if applicable):
qemu-kvm-rhev-2.1.2-23.el7.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Launch guest with multiple NUMA nodes, where one of the nodes has 0 MB of RAM

Actual results:
Node with 0 MB RAM is not created, and its CPUs are silently merged into another node.

Expected results:
All requested NUMA nodes are created (or an error is raised at startup explaining why it is forbidden)

Additional info:

Comment 3 Eduardo Habkost 2015-07-28 20:42:24 UTC
Created attachment 1057120 [details]
Screenshot of what seems to be a guest kernel bug or limitation

Comment 4 Eduardo Habkost 2015-07-28 20:44:56 UTC
As this was never supported before, marking as FutureFeature. Is this kind of NUMA topology really supported by Linux and numactl?

The guest kernel is ignoring the CPU affinity entries in the SRAT table for the no-RAM nodes. I need to compare this with real hardware, to find out if this is really a guest kernel bug, a QEMU bug, or something that was never supported by Linux. Do we have any physical host having NUMA nodes with no RAM in our labs?

Comment 5 Daniel Berrangé 2015-07-29 09:48:58 UTC
Hmm, so perhaps this might be architecture dependant.

This RFE came about as a result of a bug filed against OpenStack for not considering the possibility that NUMA nodes can have CPUS without any RAM, so tgis was on real physical hardware

 https://bugs.launchpad.net/nova/+bug/1418187

From the libvirt capabilities attached to that bug report you can see the real hardware config:

  https://launchpadlibrarian.net/196616667/capabilities.txt

In particular though notice this is powerpc64 host hardware, not x86.  

So I guess it is conceivable that this might not be supported on x86 kernels. I'm not really sure if x86 takes different codepath than ppc64 when doing NUMA setup.

Comment 6 Eduardo Habkost 2015-07-29 17:25:05 UTC
(In reply to Daniel Berrange from comment #5)
> So I guess it is conceivable that this might not be supported on x86
> kernels. I'm not really sure if x86 takes different codepath than ppc64 when
> doing NUMA setup.

The mapping of CPUs to NUMA nodes is arch-specific code inside arch/{x86,powerpc}, so this behavior is likely to be arch-specific.

Comment 7 Eduardo Habkost 2015-07-29 18:27:32 UTC
Just confirmed that there's x86-specific code in Linux that ignores nodes without enough RAM:

At arch/x86/mm/numa.c:numa_register_memblks()
        for_each_node_mask(nid, node_possible_map) {
                /* [...] */
                /*
                 * Don't confuse VM with a node that doesn't have the
                 * minimum amount of memory:
                 */
                if (end && (end - start) < NODE_MIN_SIZE)
                        continue;

                /* alloc_node_data() will call node_set_online(nid) */
                alloc_node_data(nid);
        }

At arch/x86/mm/numa.c:init_cpu_to_node():
        for_each_possible_cpu(cpu) {
                int node = numa_cpu_node(cpu);

                if (node == NUMA_NO_NODE)
                        continue;
                if (!node_online(node))
                        node = find_near_online_node(node);
                numa_set_node(cpu, node);
        }