Bug 1657403 - CNV guest using all cores with all threads busy shows performance issues compared to KVM (single node)
Summary: CNV guest using all cores with all threads busy shows performance issues comp...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.11.0
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: ---
: 4.1.0
Assignee: Seth Jennings
QA Contact: Jianwei Hou
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-12-07 22:13 UTC by Jenifer Abrams
Modified: 2019-02-19 20:49 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-02-19 20:49:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Jenifer Abrams 2018-12-07 22:13:34 UTC
Description of problem:
Opening this bug to track an investigation into a performance issue seen with CNV for a "maxed out" VM. 

Running the same cpu-bound benchmark (blackscholes3) on RHEL KVM produces significantly better performance than on CNV for nearly all testcases when many vcpus are in use. For a single 32vcpu guest on a 32c Skylake host (HTon, 64 total cpus), CNV performance is similar to KVM when 1-8 threads are used for the testcase, but when 16 or 32 threads are used, CNV starts to fall behind: ~10% for 16 threads, and ~50-70% worse for 32 threads. 

There seems to be significant differences in host scheduling behavior..
Looking at the 32thread testcase on a Skylake host node w/ the following topology:
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63
( where HT siblings are: 0 & 32, 1 & 33, 2 & 34 and so on.. )

Both KVM and CNV guest is using the same vcpu topology:
    <topology sockets='1' cores='32' threads='1'/>

On CNV the guest procs (vcpus) tend to either run on every even cpu (i.e. mostly on node0 & using HTs) or utilization is spread among all 64 logical cpus. The guest also shows some unexpected steal time, up to 10-15% on a couple cpus. 
   guest mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/CNV-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-CNV1.1_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHost_cpuPT-RHEL75z-noksm_blackscholes64_2018-12-07_1421_2018.12.07T20.21.43/1/reference-result/tools-default/vm1/mpstat/mpstat-stdout.txt
   host mpstat:  http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/CNV-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-CNV1.1_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHost_cpuPT-RHEL75z-noksm_blackscholes64_2018-12-07_1421_2018.12.07T20.21.43/1/reference-result/tools-default/perf146/mpstat/mpstat-stdout.txt

On KVM the guest procs (vcpus) tend to consolidate on cores better, which spans both sockets more often and does not use all HyperThreads. The guest shows no steal time. 
   guest mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/KVM-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-KVM_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHostTuned_cpuPT-RHEL75z-noksm_blackscholes3_64-32Cores-defaultkern_2018-12-07_0953_2018.12.07T15.53.21/1/reference-result/tools-default/dhcp31-172/mpstat/mpstat-stdout.txt
   host mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/KVM-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-KVM_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHostTuned_cpuPT-RHEL75z-noksm_blackscholes3_64-32Cores-defaultkern_2018-12-07_0953_2018.12.07T15.53.21/1/reference-result/tools-default/perf146/mpstat/mpstat-stdout.txt

Will get more scheduler data.. 

I am not sure if this behavior is related to the default milicores setting (i.e. could configure OCP CPU Manager to  a full core with "cpu=1000m" ?), however I am not ready to upgrade OCP/CNV to try since we are currently staying on v1.1 for a customer eval.

Also not sure if the difference in Parent PID plays any role in scheduling decisions since on CNV the PPID of the qemu-kvm (vcpu) processes is virt-launcher while on RHEL KVM the PPID is init. 

KVM host:
3.10.0-862.14.3.el7.x86_64
KVM guest:
3.10.0-862.14.1.el7.x86_64

CNV 1.1 / OCP 3.10 
CNV host node:
3.10.0-862.14.4.el7.x86_64
CNV guest:
3.10.0-862.14.1.el7.x86_64

In all cases, using retpoline (on skylake).

Comment 1 Jenifer Abrams 2018-12-07 22:39:10 UTC
Actually,  HToff data shows this might not really be about "scheduler placement" (i.e. hyperthread usage and consolidation on a node).. 

CNV cores=32, 32c host HToff, 32thread testcase:

the guest shows even more steal & idle time:
03:59:10 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:59:12 PM  all   50.40    0.00    0.00    0.00    0.00    0.00   25.99    0.00    0.00   23.61

and the host shows idle time:
03:58:32 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
03:58:34 PM  all    0.16    0.00    0.11    0.00    0.00    0.00    0.00   69.01    0.00   30.72

--
while the same test on KVM shows no steal time, 100% guest as expected:

guest:
04:35:35 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:35:37 PM  all   99.94    0.00    0.00    0.00    0.00    0.00    0.02    0.00    0.00    0.05

host:
04:35:07 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:35:09 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.98    0.00    0.03

Will look into possible cgroup scheduler differences..

Comment 2 Fabian Deutsch 2018-12-10 10:02:10 UTC
Good catch.

Adding Martin and targeting it to 1.3.1 due to potential customer impact. Let's revisit the target once we know the cause for this issue.

Comment 3 Jenifer Abrams 2018-12-11 00:10:27 UTC
This appears to be related to the division of cpu.shares among the kubevirt pods. I would like to understand the current cpu.shares logic and discuss how it might be improved. 
Also I am still on the older CNV 1.1 version so it would be good to confirm if CNV 1.3 still has the same behavior.

I am running CNV on a single baremetal 32core Skylake node and creating a single VM with cores=32.

By default my cpu.shares distribution looks like this for each kubepod related slice:

/sys/fs/cgroup/cpuacct/kubepods.slice/cpu.shares 
65536
/sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-besteffort.slice/cpu.shares
2
/sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 
204
/sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19572175_d196_11e8_a77f_b499ba08b6ce.slice/cpu.shares
(openvswitch_ovs-6w2ns_openshift-sdn)
102
/sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1962b9e1_d196_11e8_a77f_b499ba08b6ce.slice/cpu.shares 
(sdn-sqsn6_openshift-sdn)
102
/sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcf7662c2_fcc3_11e8_ac8f_b499ba08b6ce.slice/cpu.shares 
(compute_virt-launcher-vm1-v75wr_kube-system)
2


A few questions:
  - Does OCP or Kubevirt determine these cpu.shares values? Is it dynamic based on the pods that are running or static? 
  - I think adding the OCP CPU Manager feature may allow you to increase the overall number of shares, but does the current default distribution make sense for most use cases? 
  - Why is there a large gap between kubepods.slice shares (65536) and it's children kubepods-besteffort.slice and kubepods-burstable.slice in my case (total of 206). Do those shares go to some other kubevirt component that doesn't have a child slice?
  - What determines the distribution of shares among the burstable-pods? OVS and SDN get many more shares than the VM Compute pod.

As a simple test without too much thought into the actual values, I tried to increase the shares for the compute virt-launcher pod with: 
# echo 1024 >  /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 
  ^^^ which takes, although appears to get overwritten a bit later: 
   # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 
   204
echo 820 > /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcf7662c2_fcc3_11e8_ac8f_b499ba08b6ce.slice/cpu.shares
echo 800 > /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcf7662c2_fcc3_11e8_ac8f_b499ba08b6ce.slice/docker-93ce6df3fb9c87681c80ebab3d0e003c044cb56c03ef9d474463cc807e52e949.scope/cpu.shares 
      ^^ this one contains the qemu task


Even though the parent "kubepods-burstable.slice" value doesn't actually stick, assigning more shares to the container with qemu does improve performance to close to on par w/ KVM. I think this is because it has a larger value than the OVS and SND containers now so it gets a bigger slice of the parent.. which I think is still actually limited to 204 based on its parent(?).

Comment 4 Jenifer Abrams 2018-12-11 00:20:38 UTC
I should also note, with the increased cpu.shares test the steal time in the guest goes away, although there is still a bit more idle time in the guest compared to KVM.

Comment 5 Jenifer Abrams 2018-12-17 17:19:35 UTC
This behavior does still happen with CNV1.3, I will be trying CPU manager next to see if that helps.

I suspect many of the default cpu.shares values are coming from openshift, but my main concern is if a value of 2 is a good default for the virt-launcher pod the VM runs in considering the other non-VM pods under kubepods-burstable get 102 shares each.


Just a note of the workaround I have been using to increase cpu.shares for VM pods:
# cd  /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/

Check virt-launcher shares:   (should all show "2" by default. note: this 'cut' cmd will work as long as there is no "_" in oc vm name)
# for i in `docker ps | grep compute_virt-launcher | cut -f 5 -d "_"`; do cat kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/cpu.shares; cat kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/docker-`docker inspect --format="{{.Id}}" \`docker ps | grep $i | grep compute | cut -f 1 -d " "\``.scope/cpu.shares; done

Set to 998/1000:
# for i in `docker ps | grep compute_virt-launcher | cut -f 5 -d "_"`; do echo 1000 > kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/cpu.shares; echo 998 > kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/docker-`docker inspect --format="{{.Id}}" \`docker ps | grep $i | grep compute | cut -f 1 -d " "\``.scope/cpu.shares; done

Comment 6 Jenifer Abrams 2019-01-04 22:39:38 UTC
Enabling CPUManager uses the Guaranteed QoS class which gives us a much better cpu.shares value than the default Burstable class. For example, in my case I asked to reserve a full core with "kube-reserved: cpu=1000m":

-- Burstable class (default behavior) --

The VM pod only gets 2 of the parent kubepods-burstable 204 shares, no matter how many cores the VM requests:

# cat /sys/fs/cgroup/cpuacct/kubepods.slice/cpu.shares 
64512
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 
204
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<compute_virt-launcher-vm>.slice/cpu.shares 
2
note, parent shares are also divided among these children (much larger than VM shares which impacts large VM performance):
# /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<openvswitch_ovs-sdn>.slice/cpu.shares
102
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<sdn_openshift-sdn>.slice/cpu.shares 
102


-- Guaranteed class (CPUManager) --

The VM pod gets number of shares equivalent to dedicated cpu request and are divided among the top level kubepods parent share.
EX:
        cores: 32
        dedicatedCpuPlacement: true

# cat /sys/fs/cgroup/cpuacct/kubepods.slice/cpu.shares 
64512
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<compute_virt-launcher-vm>.slice/cpu.shares 
32768
note, parent shares are also divided among these children:
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-besteffort.slice/cpu.shares 
2
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 
204
# cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-pod<sdn_openshift-sdn>.slice/cpu.shares 
102


The default value of 2 shares for the VM may just be default Openshift behavior, and maybe that is acceptable since it's burstable QoS, but I also wonder what Openshift logic is setting the other burstable children (SDN, OVS) to a higher value since that is what causes the large imbalance in total shares. Would like to understand if certain types of burstable pods are designed to get more for a better chance to run.

Comment 7 Jenifer Abrams 2019-01-09 22:27:11 UTC
I should note, all the above behavior is without any CFS bw control (i.e. cpu.cfs_quota_us=-1) so the regression is only due to the low amount of default Burstable shares for the VM pod out of the total.

Comment 8 Jenifer Abrams 2019-01-10 21:53:38 UTC
Moving to the openshift pod component for input since this is about the default number of shares assigned to the VM pods limiting large VM/pod performance.

Comment 9 Seth Jennings 2019-01-24 22:07:59 UTC
Ah yes, cpu requests on the Pod mapping to cpu.shares.  I'll try to make this short.

This has to do with Pod QoS tiers in Kubernetes.
https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/

There are 3:
    Guaranteed (limits but not requests or requests = limits)
    Burstable (requests but no limits)
    BestEffort (no requests or limits)

kubelet equates 1024 shares to 1 core (or 1000 millicores)

A pod cpu request of 500m is a cpu.share of 512

kubepods.slice is the top level slice for pods.  Its cpu.shares are allocatable (allocatable = capacity - kube-reserved - system-reserved) cores converted to shares.  In the case of a 4 core machine with a 500m system-reserved setting on the kubelet, the cpu.shares will be 4096-512=3584.

Guaranteed pods run in a slice under kubepods.slice and have cpu.shares converted from the pod cpu request.  If the Guaranteed pod requests 800m of cpu, the cpu shares is 800 * (1024/1000) = 819.

Burstable pods run in a slice under kubepods.slice/kubepods-burstable.slice.  The burstable slice has a cpu.shares = sum of all burstable pod requests (roughly due to rounding errors)

kubepods-burstable.slice]$ cat */cpu.shares | grep -v ^2$ (exclude 2 shares per sandbox container)
153
10
102
10
10
20
20
307
10
102
10
102
102 (sums to 958)
kubepods-burstable.slice]$ cat cpu.shares
962

BestEffort pods run in a slice under kubepods.slice/kubepods-besteffort.slice.  The besteffort slice has a cpu.shares = 2 (you request nothing, you get nothing under contention).  Each pod slice also has a cpu.shares = 2, which just means they get an equal share of nothing under contention.

If you aren't confused yet, due to system-reserved and kube-reserved settings dropping the cpu.shares for kubepods.slice, at no QoS level do you actually get the cpu you request in actual cpu time. It is all relative, yet expressed in the Pod spec as an absolute value (cores).  Your cpu request is reduced by node-allocatable-cpu/node-capacity-cpu (allocatable = capacity - kube-reserved - system-reserved).

Also keep in mind that cpu.shares is not for cpu capping.  If the machine is idle and a Burstable or BestEffort pod wants to use all the cpu in the box, it can.  Guaranteed pods are capped by CFS quota even if there is idle cpu time available.

Comment 11 Jenifer Abrams 2019-02-19 20:49:27 UTC
Sorry I haven't updated this bug in awhile. 

Now I see that the cpu.shares are properly adjusted to the desired amount when setting requests = limits in the VM yaml, however it is odd that the virt-compute pod is kept in the Burstable QoS in that case. This does not happen for a non-kubevirt pod. Remaining in Burstable does not impact performance since it "is guaranteed to get the minimum amount of CPU requested", it was just a bit unexpected to me.

For example for a regular pod, if I ask for:

    resources:
      requests:
        cpu: 32
        memory: "1G"
      limits:
        cpu: 32
        memory: "1G"

the shares are correct and in the parent kubepods.slice directory:
[root@perf146 kubepods.slice]# cat kubepods-podd5815ebd_346c_11e9_8da3_b499ba08b6ce.slice/cpu.shares 
32768
[root@perf146 kubepods.slice]# cat kubepods-podd5815ebd_346c_11e9_8da3_b499ba08b6ce.slice/crio-e632c53a102628bdc195bfa3452aca8a8da24477c3818bb13ad92e3d5aa01ac9.scope/cpu.shares 
32768


When I do something similar for a VM:

     spec:
      domain:
       cpu:
        sockets: 2
        cores: 16
[...]
       resources:
          overcommitGuestOverhead: true
          requests:
            cpu: 32
            memory: 100Gi
          limits:
            cpu: 32
            memory: 100Gi

the shares are correct, although the pod is still under the burstable slice:
[root@perf146 kubepods.slice]# cat kubepods-burstable.slice/kubepods-burstable-pod5bfe19ad_3476_11e9_8da3_b499ba08b6ce.slice/cpu.shares 
32768
[root@perf146 kubepods.slice]# cat kubepods-burstable.slice/kubepods-burstable-pod5bfe19ad_3476_11e9_8da3_b499ba08b6ce.slice/crio-5f5e84b9d4a16f33f996c028a0252fef497388b4b5006ec6fbddbc91b0884c7b.scope/cpu.shares
32768


So I will close this as NOTABUG now since the performance issue for large VMs can be resolved by setting requests = limits (as designed), assuming it is otherwise fine the pod remains in Burstable.


Note You need to log in before you can comment on or make changes to this bug.