Bug 1248406

Summary: Numa node assignation info is not consistency between hmp and in guest
Product: Red Hat Enterprise Linux 7 Reporter: Zhengtong <zhengtli>
Component: qemu-kvm-rhevAssignee: David Gibson <dgibson>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.2CC: ehabkost, hannsj_uhl, jen, knoel, lvivier, michen, mrezanin, qzhang, thuth, virt-maint, zhengtli
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64le   
OS: Linux   
Whiteboard:
Fixed In Version: qemu-2.4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-07 20:31:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1359843    

Description Zhengtong 2015-07-30 08:45:24 UTC
Description of problem:

When I use two numa nodes to boot up guest , with "-smp 4,sockets=1,cores=2,threads=2" ,the numa information is different b/w hmp cli and in guest

Version-Release number of selected component (if applicable):


How reproducible:
4/4

Steps to Reproduce:
1.boot up guest with command below:
/usr/libexec/qemu-kvm -name liuzt-RHEL-7.2_LE -machine pseries,accel=kvm,usb=off \
-m 4G,slots=4,maxmem=8G -numa node,nodeid=0 -numa node,nodeid=1 \
-realtime mlock=off \
-smp 4,sockets=1,cores=2,threads=2 \
-rtc base=localtime,clock=host,driftfix=slew \
-monitor stdio \
-monitor unix:/tmp/monitor3,server,nowait \
-no-shutdown \
-boot strict=on \
-device usb-ehci,id=usb,bus=pci.0,addr=0x2 \
-device pci-ohci,id=usb1,bus=pci.0,addr=0x1 \
-drive file=/root/test_home/liuzt/vdisk/rhel_le_memory_hot.img,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none \
-device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x6 \
-drive file=/root/test_home/liuzt/vdisk/RHEL-7.2-20150708.1-Server-ppc64le-dvd1.iso,format=raw,id=drive_iso,if=none \
-device scsi-cd,drive=drive_iso,bus=scsi0.0,id=sr0,bootindex=2 \
-serial pty \
-device usb-kbd,id=input0 \
-device usb-mouse,id=input1 \
-device usb-tablet,id=input2 \
-vnc 0:16 -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x4 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \
-msg timestamp=on \
-netdev tap,id=hostnet0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown \
-device spapr-vlan,netdev=hostnet0,id=net0,mac=52:54:00:c4:e7:73,reg=0x2000 \
-qmp tcp:0:4444,server,nowait \

2. check the numa info in hmp cli.

(qemu) info numa
2 nodes
node 0 cpus: 0 2
node 0 size: 2048 MB
node 1 cpus: 1 3
node 1 size: 2048 MB

3.Log into guest ,and check numa info by "numactl" tool
[root@dhcp71-14 ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 2048 MB
node 0 free: 360 MB
node 1 cpus:
node 1 size: 2048 MB
node 1 free: 2027 MB
node distances:
node   0   1 
  0:  10  10 
  1:  10  10 



Actual results:
Found that the cpus belongs to node0 & node1 is different.

Expected results:
The result b/w hmp and in guest should be consistent 

Additional info:

This may have relationship with -smp parameter. If I set -smp 4,sockets=2,cores=2,threads=1. it won't have such problem.

Comment 2 David Gibson 2015-07-31 04:23:46 UTC
qemu seems to be doing something really strange here - it is describing the vcpus as alternating between nodes.  Since vcpus 0 & 1 are different threads on the same guest core, it makes no sense for them to be in different nodes.  There's not even a way to describe them as being in different nodes to the guest, which is why they show up  in the same node in the guest.

Ok.. I see the code that does this in numa.c.  Looks like we need to implement a cpu_index_to_socket_id callback for Power.  Although frankly the qemu default behaviour is very odd.

Comment 3 David Gibson 2015-07-31 06:58:13 UTC
I've sent a draft upstream patch for discussion with the NUMA maintainer.  Also adding Eduardo on CC for context.

Comment 4 David Gibson 2015-07-31 07:15:09 UTC
A final fix needs some discussion with upstream maintainers.  However, to check my reasoning, can you try the draft fix which I've built at:

https://brewweb.devel.redhat.com/taskinfo?taskID=9621483

Comment 5 Zhengtong 2015-08-01 03:19:45 UTC
(In reply to David Gibson from comment #4)
> A final fix needs some discussion with upstream maintainers.  However, to
> check my reasoning, can you try the draft fix which I've built at:
> 
> https://brewweb.devel.redhat.com/taskinfo?taskID=9621483

Hi  David,
I tried with the test package and here is the result:


/usr/libexec/qemu-kvm \
...
-smp 4,sockets=1,cores=2,threads=2 \
...
(qemu) info numa
2 nodes
node 0 cpus: 0 1 2 3
node 0 size: 2048 MB
node 1 cpus:
node 1 size: 2048 MB

[root@dhcp71-14 ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 2048 MB
node 0 free: 361 MB
node 1 cpus:
node 1 size: 2048 MB
node 1 free: 2027 MB
node distances:
node   0   1 
  0:  10  10 
  1:  10  10 


And another result

/usr/libexec/qemu-kvm \
...
-smp 16,sockets=1,cores=2,threads=8 \
...
(qemu) info numa
2 nodes
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 2048 MB
node 1 cpus:
node 1 size: 2048 MB

[root@dhcp71-14 ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 2048 MB
node 0 free: 259 MB
node 1 cpus:
node 1 size: 2048 MB
node 1 free: 2023 MB
node distances:
node   0   1 
  0:  10  10 
  1:  10  10 




It seams like the dis-consistence problems is resolved, but obviously , all the cpus are assigned to node0, is this supposed to be, or another bug is caused?

Comment 6 David Gibson 2015-08-03 01:21:38 UTC
My tentative patch allocates vcpus to numa nodes at "socket" granularity.  Because you've only declared one socket, all the vcpus will be in one numa node, so the behaviour you see is expected.

Comment 8 David Gibson 2015-11-05 02:58:00 UTC
The fix for this is now upstream and will be in qemu 2.4.  Therefore, we'll pick it up for RHEL7.3 on the rebase.

Comment 10 Mike McCune 2016-03-28 22:27:32 UTC
This bug was accidentally moved from POST to MODIFIED via an error in automation, please see mmccune with any questions

Comment 12 Zhengtong 2016-06-06 03:38:38 UTC
Tried with the latest qemu version : qemu-kvm-rhev-2.6.0-4.el7. The result is passed.


1. Boot up guest with cmd:
/usr/libexec/qemu-kvm ...
-smp 4,sockets=1,cores=2,threads=2 \
-numa node,nodeid=0 -numa node,nodeid=1 \
...

2. After guest boot up , check the numa info from hmp and in guest:
-------------------------------
(qemu) info numa
2 nodes
node 0 cpus: 0 1 2 3
node 0 size: 4096 MB
node 1 cpus:
node 1 size: 4096 MB
(qemu) 
-------------------------------
[root@dhcp70-245 ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3
node 0 size: 4096 MB
node 0 free: 2072 MB
node 1 cpus:
node 1 size: 4096 MB
node 1 free: 4060 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 
--------------------------------
The results are consistence

3. Tried again with this cmd:
/usr/libexec/qemu-kvm ...
-smp 4,sockets=2,cores=2,threads=1 \
-numa node,nodeid=0 -numa node,nodeid=1 \
...

and get the result as follows:
--------------------------------
(qemu) info numa
2 nodes
node 0 cpus: 0 1
node 0 size: 4096 MB
node 1 cpus: 2 3
node 1 size: 4096 MB
(qemu) 
---------------------------------
[root@dhcp70-245 ~]# numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1
node 0 size: 4096 MB
node 0 free: 2633 MB
node 1 cpus: 2 3
node 1 size: 4096 MB
node 1 free: 3482 MB
node distances:
node   0   1 

--------------------------------

That is goood.



  0:  10  40 
  1:  40  10

Comment 14 errata-xmlrpc 2016-11-07 20:31:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2673.html