Bug 1373075

Summary: [CGROUP]STEAL time doesn't work on POWER
Product: Red Hat Enterprise Linux 7 Reporter: Min Deng <mdeng>
Component: qemu-kvm-rhevAssignee: David Gibson <dgibson>
Status: CLOSED NOTABUG QA Contact: Virtualization Bugs <virt-bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: knoel, qzhang, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-18 02:54:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Min Deng 2016-09-05 06:03:10 UTC
Description of problem:
STEAL time doesn't work on POWER
Version-Release number of selected component (if applicable):
kernel-3.10.0-495.el7.ppc64le
qemu-kvm-rhev-2.6.0-22.el7.ppc64le
RHEL7.3 - with kernel-3.10.0-495.el7.ppc64le
How reproducible:
5/5
Steps to Reproduce:
Settings
#mount -t cgroup -o cpuset cpuset /cgroup
#cd /cgroup
1. Create cgroups
# mkdir cpuset1
2. set cpus/mems
# echo 0 > cpuset1/cpuset.cpus  [1 means the host cpu 1]
# echo 0 > cpuset1/cpuset.mems  [0 means the host numa node 0]
or
# echo 120 > cpuset1/cpuset.cpus    
# echo 1 > cpuset1/cpuset.mems
My hosts
available: 2 nodes (0-1)
node 0 cpus: 0 8 16 24 32 40 48 56
node 0 size: 131072 MB
node 0 free: 121412 MB
node 1 cpus: 64 72 80 88 96 104 112 120
node 1 size: 131072 MB
node 1 free: 126452 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 
3.Boot two guests
/usr/libexec/qemu-kvm -smp 1...
root       6252 41.2  0.6 2446592 1845056 pts/2 SLl+ 01:18   5:06 /usr/libexec/qemu-kvm -name virt-tests-vm1 -sandbox off -machine pseries -nodefaults -vga std -serial unix:/tmp/socket-mazhang,server,nowait -qmp tcp:0:2221,server,nowait -m 2G -smp 1 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=06,disable-legacy=off,disable-modern=on -drive id=drive_disk1,if=none,snapshot=off,aio=threads,file=/home/RHEL2.qcow2 -device scsi-hd,id=disk1,drive=drive_disk1,bootindex=0 -vnc :0 -rtc base=utc,clock=host -boot menu=on -enable-kvm -monitor stdio -device virtio-mouse-pci,id=mouse0 -device virtio-keyboard-pci,id=kbd0 -chardev pty,id=pty0
root       6274 43.1  0.6 2406528 1842496 pts/0 SLl+ 01:18   5:13 /usr/libexec/qemu-kvm -name virt-tests-vm1 -sandbox off -machine pseries -nodefaults -vga std -serial unix:/tmp/socket-mazhang,server,nowait -qmp tcp:0:6661,server,nowait -m 2G -smp 1 -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=06,disable-legacy=off,disable-modern=on -drive id=drive_disk1,if=none,snapshot=off,aio=threads,file=/home/RHEL.qcow2 -device scsi-hd,id=disk1,drive=drive_disk1,bootindex=0 -vnc :1 -rtc base=utc,clock=host -boot menu=on -enable-kvm -monitor stdio -device virtio-mouse-pci,id=mouse0 -device virtio-keyboard-pci,id=kbd0 -chardev pty,id=pty0
3. echo two guess pid to tasks
#echo xxx > cpuset1/tasks (contain threads)
4. Run stress in both guests.
EG: for((;;));do x=1;done
5.Check the steal time inside both guests
#top

Actual results:
All st is zero %

Expected results:
The two guests' st time both about 50%

Additional info:
Communicate with x86 guys,they cannot reproduce it on x86 platform.Any issues please let me know.

Comment 3 David Gibson 2016-11-18 02:54:56 UTC
I've discussed this with Paul Mackerras at IBM, and I believe we've determined the cause.  This is a side-effect of the way that in normal operation a POWER host can run more guest threads than host threads - that's because hardware-level threads can be used in the guest, but not in the host, due to restrictions of the virtualization hardware.

More specifically, the dynamic multithreading code we include means that although both VMs are bound to the same host thread, they could actually run on different "subcores" of the host core.  When this is the case, it won't get accounted as  stolen time (the two VMs still may affect each others' performance, but how much depends on whether the workloads on each are using the same functional units in the CPU, so it can't be measured as an amount of time).

The test case will need to be adjusted for Power: there are two obvious ways to do this:

1) Disable dynamic multi-threading:

    echo 0 >/sys/module/kvm_hv/parameters/dynamic_mt_modes

With this executed before performing the test, the stolen time results should be as expected.

2) Increase each VM to 8 threads, and run 8 stress threads on each VM

This ensures that each VM occupies a whole host core, so they can't be packed onto the same core at the same time.