Bug 1344450

Summary: [RFE] SLIT table in KVM differs from Host SLIT table
Product: Red Hat Enterprise Linux 7 Reporter: djdumas
Component: qemu-kvm-rhevAssignee: Igor Mammedov <imammedo>
Status: CLOSED ERRATA QA Contact: Yumei Huang <yuhuang>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.2CC: bgray, chayang, djdumas, drjones, imammedo, jinzhao, juzhang, knoel, mkletzan, mrezanin, mst, mtessun, peter.engel, srao, trees, yuhuang
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.12.0-1.el7 Doc Type: Enhancement
Doc Text:
Story Points: ---
Clone Of:
: 1344494 (view as bug list) Environment:
Last Closed: 2018-11-01 11:01:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1344494, 1344497, 1609081    

Description djdumas 2016-06-09 17:46:56 UTC
Description of problem:
Partner uses information in the SLIT table to make NUMA placement decisions in their application.  Without accurate information, performance can be impacted.

Passing the correct SLIT table information from the host to the guest seems to make sense as long as the CPUs are pinned in a 1-to-1 fashion between the host and guest.  If the guest consumes all the numa nodes, then the whole table could be passed.  In the case of a guest using only a subset of nodes, the appropriate section (or sections) of the host SLIT table could be used.

The following example is from an 8-socket SGI system (this is not unique to this system however):

Host> numactl -H
…

node   0   1   2   3   4   5   6   7
  0:  10  16  19  16  50  50  50  50
  1:  16  10  16  19  50  50  50  50
  2:  19  16  10  16  50  50  50  50
  3:  16  19  16  10  50  50  50  50
  4:  50  50  50  50  10  16  19  16
  5:  50  50  50  50  16  10  16  19
  6:  50  50  50  50  19  16  10  16
  7:  50  50  50  50  16  19  16  10
 
-------------------------------------------------


Guest> numactl -H
…

node   0   1   2   3   4   5   6   7
  0:  10  20  20  20  20  20  20  20
  1:  20  10  20  20  20  20  20  20
  2:  20  20  10  20  20  20  20  20
  3:  20  20  20  10  20  20  20  20
  4:  20  20  20  20  10  20  20  20
  5:  20  20  20  20  20  10  20  20
  6:  20  20  20  20  20  20  10  20
  7:  20  20  20  20  20  20  20  10


Version-Release number of selected component (if applicable):
Host and guest - RHEL 7.2
qemu-kvm-rhev  10:2.3.0-31

How reproducible:


Steps to Reproduce:
1. Create a guest, pin host and guest CPUs on a 1-to-1 basis
2. Run numactl -H command on host and guest and compare


Actual results:
SLIT tables are different between host and guest

Expected results:
Whole SLIT table or appropriate section of SLIT table in guest and host are the same

Additional info:

Comment 2 Karen Noel 2016-06-09 19:25:14 UTC
The NUMA topology is set up by libvirt, while the slit table lives in the guest firmware. 

Should the slit table change after live migration if the destination host is different? I think so, but guests may not expect a dynamic slit table. Application may have to modified to take migration into account.

Comment 6 Igor Mammedov 2016-06-10 06:58:51 UTC
currently QEMU doesn't create SLIT table at all and doesn't have any interface to define it. So what numactl -H shows is autogenerated locality infor with an assumption that latency to other nodes are the same while on host it's not so.

Comment 7 Igor Mammedov 2016-08-09 14:09:18 UTC
too late for 7.3, move to 7.4

Comment 11 Martin Kletzander 2017-08-09 17:03:57 UTC
Can't this be specified with -acpitable parameter?  I understand that would have to provide binary data, but those could be dumped from the host somehow, I guess, right?

Comment 12 Igor Mammedov 2017-08-14 07:49:22 UTC
(In reply to Martin Kletzander from comment #11)
> Can't this be specified with -acpitable parameter?  I understand that would
> have to provide binary data, but those could be dumped from the host
> somehow, I guess, right?

instead of that since qemu-2.10 there is dedicated CLI option to specify mapping,
see for example https://bugzilla.redhat.com/show_bug.cgi?id=1395339#c14

Comment 13 Igor Mammedov 2017-10-11 11:20:27 UTC
upstream commit
  0f203430d numa: Allow setting NUMA distance for different NUMA nodes

so it's coming to 7.5 with rebase

Comment 17 Yumei Huang 2018-05-09 09:27:16 UTC
IIUC, this bug aims to enable configuration of numa distance in QEMU so that guest SLIT table could be same with host. Please correct me if I'm wrong.

I did the following steps to verify.  

Versions:
qemu-kvm-rhev-2.12.0-1.el7
kernel-3.10.0-862.el7.x86_64

Steps:
1. Check host SLIT table

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 8095 MB
node 0 free: 6545 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 16384 MB
node 1 free: 14584 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 


2. Boot guest 

# /usr/libexec/qemu-kvm -m 24G -smp 32 \

-numa node,mem=8G,nodeid=0 \

-numa node,mem=16G,nodeid=1 \

-numa dist,src=0,dst=1,val=21 \

rhel76-64-virtio-scsi.qcow2 \

-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \

-vnc :1 -monitor stdio


3. Check guest SLIT table

# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 8191 MB
node 0 free: 7099 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 16384 MB
node 1 free: 15449 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

The node distances in guest are same with host.  

Igor, could you please help check if above verification is valid, thanks!

Comment 18 Igor Mammedov 2018-05-14 09:07:41 UTC
Sorry for the late reply,

your test is valid but it's limited only to symmetric usecase,
you probably should test for asymmetric case and for hitting error paths as well.
For details look at the function validate_numa_distance()

Comment 19 Yumei Huang 2018-05-15 03:33:08 UTC
Thanks Igor. 

I did more test to verify this bug with qemu-kvm-rhev-2.12.0-1.el7.

1. asymmetric case

Boot guest with asymmetric numa distance configure, check guest SLIT table.

QEMU cmdline:
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0,cpus=0-3 \
-numa node,nodeid=1,cpus=4-7 \
-numa node,nodeid=2,cpus=8-11 \
-numa node,nodeid=3,cpus=12-15 \
-numa dist,src=0,dst=1,val=20 \
-numa dist,src=0,dst=2,val=30 \
-numa dist,src=0,dst=3,val=40 \
-numa dist,src=1,dst=0,val=50 \
-numa dist,src=1,dst=2,val=60 \
-numa dist,src=1,dst=3,val=70 \
-numa dist,src=2,dst=0,val=80 \
-numa dist,src=2,dst=1,val=90 \
-numa dist,src=2,dst=3,val=100 \
-numa dist,src=3,dst=0,val=110 \
-numa dist,src=3,dst=1,val=120 \
-numa dist,src=3,dst=2,val=130

Guest SLIT table:
# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 6143 MB
node 0 free: 5164 MB
node 1 cpus: 4 5 6 7
node 1 size: 6144 MB
node 1 free: 5859 MB
node 2 cpus: 8 9 10 11
node 2 size: 6144 MB
node 2 free: 5503 MB
node 3 cpus: 12 13 14 15
node 3 size: 6144 MB
node 3 free: 6000 MB
node distances:
node   0   1   2   3 
  0:  10  20  30  40 
  1:  50  10  60  70 
  2:  80  90  10  100 
  3:  110  120  130  10 


2. error path cases

Boot guest with error configuration, qemu quits with a prompt.

Case a,
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0,cpus=0-3 \
-numa node,nodeid=1,cpus=4-7 \
-numa node,nodeid=2,cpus=8-11 \
-numa node,nodeid=3,cpus=12-15 \
-numa dist,src=0,dst=1,val=20 \
-numa dist,src=0,dst=2,val=30 \
-numa dist,src=0,dst=3,val=40
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: The distance between node 1 and 2 is missing, at least one distance value between each nodes should be provided.

Case b,
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0,cpus=0-3 \
-numa node,nodeid=1,cpus=4-7 \
-numa node,nodeid=2,cpus=8-11 \
-numa node,nodeid=3,cpus=12-15 \
-numa dist,src=0,dst=1,val=20 \
-numa dist,src=0,dst=2,val=30 \
-numa dist,src=0,dst=3,val=40 \
-numa dist,src=1,dst=0,val=50 \
-numa dist,src=1,dst=2,val=60 \
-numa dist,src=1,dst=3,val=70 \
-numa dist,src=2,dst=0,val=80 \
-numa dist,src=2,dst=1,val=90 \
-numa dist,src=2,dst=3,val=100 \
-numa dist,src=3,dst=0,val=110 
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: At least one asymmetrical pair of distances is given, please provide distances for both directions of all node pairs.

Case c,
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0,cpus=0-3 \
-numa node,nodeid=1,cpus=4-7 \
-numa node,nodeid=2,cpus=8-11 \
-numa node,nodeid=3,cpus=12-15 \
-numa dist,src=0,dst=0,val=20
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: -numa dist,src=0,dst=0,val=20: Local distance of node 0 should be 10.

Case d,
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0,cpus=0-3 \
-numa node,nodeid=1,cpus=4-7 \
-numa node,nodeid=2,cpus=8-11 \
-numa node,nodeid=3,cpus=12-15 \
-numa dist,src=0,dst=5,val=20
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: -numa dist,src=0,dst=5,val=20: Source/Destination NUMA node is missing. Please use '-numa node' option to declare it first.

Case e,
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0 \
-numa node,nodeid=1 \
-numa dist,src=0,dst=1,val=1
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: -numa dist,src=0,dst=1,val=1: NUMA distance (1) is invalid, it shouldn't be less than 10.

Case f,
# /usr/libexec/qemu-kvm -m 24G -smp 16 rhel76-64-virtio-scsi.qcow2 \
-netdev tap,id=tap0 -device virtio-net-pci,id=net0,netdev=tap0 \
-vnc :1 -monitor stdio \
-numa node,nodeid=0 \
-numa node,nodeid=1 \
-numa dist,src=128,dst=1,val=20
QEMU 2.12.0 monitor - type 'help' for more information
(qemu) qemu-kvm: -numa dist,src=128,dst=1,val=20: Invalid node 128, max possible could be 128


All above cases get expected results, except in case f, the prompt message is kind of contradict. It says the max could be 128 while it rejects 128 as invalid. 

Maybe I should file a new bug for case f and verify this one, what do you think, Igor?

Comment 20 Igor Mammedov 2018-05-15 10:26:15 UTC
Ok, file a bug for f case.
Thing is that src/dst is node index in range [0..127] so it's total 128 nodes.
I'll post a patch upstream to clarify error message.

Comment 21 Yumei Huang 2018-05-15 12:48:42 UTC
Thanks Igor. 

A new bug is filed. 

Bug 1578381 - Error message need update when specify numa distance with node index >=128

Comment 24 errata-xmlrpc 2018-11-01 11:01:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443