Description of problem: Opening this bug to track an investigation into a performance issue seen with CNV for a "maxed out" VM. Running the same cpu-bound benchmark (blackscholes3) on RHEL KVM produces significantly better performance than on CNV for nearly all testcases when many vcpus are in use. For a single 32vcpu guest on a 32c Skylake host (HTon, 64 total cpus), CNV performance is similar to KVM when 1-8 threads are used for the testcase, but when 16 or 32 threads are used, CNV starts to fall behind: ~10% for 16 threads, and ~50-70% worse for 32 threads. There seems to be significant differences in host scheduling behavior.. Looking at the 32thread testcase on a Skylake host node w/ the following topology: node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 ( where HT siblings are: 0 & 32, 1 & 33, 2 & 34 and so on.. ) Both KVM and CNV guest is using the same vcpu topology: <topology sockets='1' cores='32' threads='1'/> On CNV the guest procs (vcpus) tend to either run on every even cpu (i.e. mostly on node0 & using HTs) or utilization is spread among all 64 logical cpus. The guest also shows some unexpected steal time, up to 10-15% on a couple cpus. guest mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/CNV-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-CNV1.1_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHost_cpuPT-RHEL75z-noksm_blackscholes64_2018-12-07_1421_2018.12.07T20.21.43/1/reference-result/tools-default/vm1/mpstat/mpstat-stdout.txt host mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/CNV-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-CNV1.1_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHost_cpuPT-RHEL75z-noksm_blackscholes64_2018-12-07_1421_2018.12.07T20.21.43/1/reference-result/tools-default/perf146/mpstat/mpstat-stdout.txt On KVM the guest procs (vcpus) tend to consolidate on cores better, which spans both sockets more often and does not use all HyperThreads. The guest shows no steal time. guest mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/KVM-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-KVM_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHostTuned_cpuPT-RHEL75z-noksm_blackscholes3_64-32Cores-defaultkern_2018-12-07_0953_2018.12.07T15.53.21/1/reference-result/tools-default/dhcp31-172/mpstat/mpstat-stdout.txt host mpstat: http://pbench.perf.lab.eng.bos.redhat.com/users/jhopper/perf146/gs/tuned_rhel7.5_CNV1.1_OCP3.10/VH-RETPboth-cpuPT-BIOS-flushguest-freshboot/1VM_32vcpu/KVM-default-7.5/cores/pbench-user-benchmark_blackscholes3_64-32Cores-defaultkern_HTon_1VM_32vcpu_100GB-KVM_localhost_GSconfigs_BIOSoscntrlPerfEEP-VirtualHostTuned_cpuPT-RHEL75z-noksm_blackscholes3_64-32Cores-defaultkern_2018-12-07_0953_2018.12.07T15.53.21/1/reference-result/tools-default/perf146/mpstat/mpstat-stdout.txt Will get more scheduler data.. I am not sure if this behavior is related to the default milicores setting (i.e. could configure OCP CPU Manager to a full core with "cpu=1000m" ?), however I am not ready to upgrade OCP/CNV to try since we are currently staying on v1.1 for a customer eval. Also not sure if the difference in Parent PID plays any role in scheduling decisions since on CNV the PPID of the qemu-kvm (vcpu) processes is virt-launcher while on RHEL KVM the PPID is init. KVM host: 3.10.0-862.14.3.el7.x86_64 KVM guest: 3.10.0-862.14.1.el7.x86_64 CNV 1.1 / OCP 3.10 CNV host node: 3.10.0-862.14.4.el7.x86_64 CNV guest: 3.10.0-862.14.1.el7.x86_64 In all cases, using retpoline (on skylake).
Actually, HToff data shows this might not really be about "scheduler placement" (i.e. hyperthread usage and consolidation on a node).. CNV cores=32, 32c host HToff, 32thread testcase: the guest shows even more steal & idle time: 03:59:10 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 03:59:12 PM all 50.40 0.00 0.00 0.00 0.00 0.00 25.99 0.00 0.00 23.61 and the host shows idle time: 03:58:32 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 03:58:34 PM all 0.16 0.00 0.11 0.00 0.00 0.00 0.00 69.01 0.00 30.72 -- while the same test on KVM shows no steal time, 100% guest as expected: guest: 04:35:35 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 04:35:37 PM all 99.94 0.00 0.00 0.00 0.00 0.00 0.02 0.00 0.00 0.05 host: 04:35:07 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle 04:35:09 PM all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.98 0.00 0.03 Will look into possible cgroup scheduler differences..
Good catch. Adding Martin and targeting it to 1.3.1 due to potential customer impact. Let's revisit the target once we know the cause for this issue.
This appears to be related to the division of cpu.shares among the kubevirt pods. I would like to understand the current cpu.shares logic and discuss how it might be improved. Also I am still on the older CNV 1.1 version so it would be good to confirm if CNV 1.3 still has the same behavior. I am running CNV on a single baremetal 32core Skylake node and creating a single VM with cores=32. By default my cpu.shares distribution looks like this for each kubepod related slice: /sys/fs/cgroup/cpuacct/kubepods.slice/cpu.shares 65536 /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-besteffort.slice/cpu.shares 2 /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 204 /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod19572175_d196_11e8_a77f_b499ba08b6ce.slice/cpu.shares (openvswitch_ovs-6w2ns_openshift-sdn) 102 /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod1962b9e1_d196_11e8_a77f_b499ba08b6ce.slice/cpu.shares (sdn-sqsn6_openshift-sdn) 102 /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcf7662c2_fcc3_11e8_ac8f_b499ba08b6ce.slice/cpu.shares (compute_virt-launcher-vm1-v75wr_kube-system) 2 A few questions: - Does OCP or Kubevirt determine these cpu.shares values? Is it dynamic based on the pods that are running or static? - I think adding the OCP CPU Manager feature may allow you to increase the overall number of shares, but does the current default distribution make sense for most use cases? - Why is there a large gap between kubepods.slice shares (65536) and it's children kubepods-besteffort.slice and kubepods-burstable.slice in my case (total of 206). Do those shares go to some other kubevirt component that doesn't have a child slice? - What determines the distribution of shares among the burstable-pods? OVS and SDN get many more shares than the VM Compute pod. As a simple test without too much thought into the actual values, I tried to increase the shares for the compute virt-launcher pod with: # echo 1024 > /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares ^^^ which takes, although appears to get overwritten a bit later: # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 204 echo 820 > /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcf7662c2_fcc3_11e8_ac8f_b499ba08b6ce.slice/cpu.shares echo 800 > /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podcf7662c2_fcc3_11e8_ac8f_b499ba08b6ce.slice/docker-93ce6df3fb9c87681c80ebab3d0e003c044cb56c03ef9d474463cc807e52e949.scope/cpu.shares ^^ this one contains the qemu task Even though the parent "kubepods-burstable.slice" value doesn't actually stick, assigning more shares to the container with qemu does improve performance to close to on par w/ KVM. I think this is because it has a larger value than the OVS and SND containers now so it gets a bigger slice of the parent.. which I think is still actually limited to 204 based on its parent(?).
I should also note, with the increased cpu.shares test the steal time in the guest goes away, although there is still a bit more idle time in the guest compared to KVM.
This behavior does still happen with CNV1.3, I will be trying CPU manager next to see if that helps. I suspect many of the default cpu.shares values are coming from openshift, but my main concern is if a value of 2 is a good default for the virt-launcher pod the VM runs in considering the other non-VM pods under kubepods-burstable get 102 shares each. Just a note of the workaround I have been using to increase cpu.shares for VM pods: # cd /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/ Check virt-launcher shares: (should all show "2" by default. note: this 'cut' cmd will work as long as there is no "_" in oc vm name) # for i in `docker ps | grep compute_virt-launcher | cut -f 5 -d "_"`; do cat kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/cpu.shares; cat kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/docker-`docker inspect --format="{{.Id}}" \`docker ps | grep $i | grep compute | cut -f 1 -d " "\``.scope/cpu.shares; done Set to 998/1000: # for i in `docker ps | grep compute_virt-launcher | cut -f 5 -d "_"`; do echo 1000 > kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/cpu.shares; echo 998 > kubepods-burstable-pod`echo $i | sed -e "s/-/_/g"`.slice/docker-`docker inspect --format="{{.Id}}" \`docker ps | grep $i | grep compute | cut -f 1 -d " "\``.scope/cpu.shares; done
Enabling CPUManager uses the Guaranteed QoS class which gives us a much better cpu.shares value than the default Burstable class. For example, in my case I asked to reserve a full core with "kube-reserved: cpu=1000m": -- Burstable class (default behavior) -- The VM pod only gets 2 of the parent kubepods-burstable 204 shares, no matter how many cores the VM requests: # cat /sys/fs/cgroup/cpuacct/kubepods.slice/cpu.shares 64512 # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 204 # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<compute_virt-launcher-vm>.slice/cpu.shares 2 note, parent shares are also divided among these children (much larger than VM shares which impacts large VM performance): # /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<openvswitch_ovs-sdn>.slice/cpu.shares 102 # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<sdn_openshift-sdn>.slice/cpu.shares 102 -- Guaranteed class (CPUManager) -- The VM pod gets number of shares equivalent to dedicated cpu request and are divided among the top level kubepods parent share. EX: cores: 32 dedicatedCpuPlacement: true # cat /sys/fs/cgroup/cpuacct/kubepods.slice/cpu.shares 64512 # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<compute_virt-launcher-vm>.slice/cpu.shares 32768 note, parent shares are also divided among these children: # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-besteffort.slice/cpu.shares 2 # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/cpu.shares 204 # cat /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-pod<sdn_openshift-sdn>.slice/cpu.shares 102 The default value of 2 shares for the VM may just be default Openshift behavior, and maybe that is acceptable since it's burstable QoS, but I also wonder what Openshift logic is setting the other burstable children (SDN, OVS) to a higher value since that is what causes the large imbalance in total shares. Would like to understand if certain types of burstable pods are designed to get more for a better chance to run.
I should note, all the above behavior is without any CFS bw control (i.e. cpu.cfs_quota_us=-1) so the regression is only due to the low amount of default Burstable shares for the VM pod out of the total.
Moving to the openshift pod component for input since this is about the default number of shares assigned to the VM pods limiting large VM/pod performance.
Ah yes, cpu requests on the Pod mapping to cpu.shares. I'll try to make this short. This has to do with Pod QoS tiers in Kubernetes. https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/ There are 3: Guaranteed (limits but not requests or requests = limits) Burstable (requests but no limits) BestEffort (no requests or limits) kubelet equates 1024 shares to 1 core (or 1000 millicores) A pod cpu request of 500m is a cpu.share of 512 kubepods.slice is the top level slice for pods. Its cpu.shares are allocatable (allocatable = capacity - kube-reserved - system-reserved) cores converted to shares. In the case of a 4 core machine with a 500m system-reserved setting on the kubelet, the cpu.shares will be 4096-512=3584. Guaranteed pods run in a slice under kubepods.slice and have cpu.shares converted from the pod cpu request. If the Guaranteed pod requests 800m of cpu, the cpu shares is 800 * (1024/1000) = 819. Burstable pods run in a slice under kubepods.slice/kubepods-burstable.slice. The burstable slice has a cpu.shares = sum of all burstable pod requests (roughly due to rounding errors) kubepods-burstable.slice]$ cat */cpu.shares | grep -v ^2$ (exclude 2 shares per sandbox container) 153 10 102 10 10 20 20 307 10 102 10 102 102 (sums to 958) kubepods-burstable.slice]$ cat cpu.shares 962 BestEffort pods run in a slice under kubepods.slice/kubepods-besteffort.slice. The besteffort slice has a cpu.shares = 2 (you request nothing, you get nothing under contention). Each pod slice also has a cpu.shares = 2, which just means they get an equal share of nothing under contention. If you aren't confused yet, due to system-reserved and kube-reserved settings dropping the cpu.shares for kubepods.slice, at no QoS level do you actually get the cpu you request in actual cpu time. It is all relative, yet expressed in the Pod spec as an absolute value (cores). Your cpu request is reduced by node-allocatable-cpu/node-capacity-cpu (allocatable = capacity - kube-reserved - system-reserved). Also keep in mind that cpu.shares is not for cpu capping. If the machine is idle and a Burstable or BestEffort pod wants to use all the cpu in the box, it can. Guaranteed pods are capped by CFS quota even if there is idle cpu time available.
Sorry I haven't updated this bug in awhile. Now I see that the cpu.shares are properly adjusted to the desired amount when setting requests = limits in the VM yaml, however it is odd that the virt-compute pod is kept in the Burstable QoS in that case. This does not happen for a non-kubevirt pod. Remaining in Burstable does not impact performance since it "is guaranteed to get the minimum amount of CPU requested", it was just a bit unexpected to me. For example for a regular pod, if I ask for: resources: requests: cpu: 32 memory: "1G" limits: cpu: 32 memory: "1G" the shares are correct and in the parent kubepods.slice directory: [root@perf146 kubepods.slice]# cat kubepods-podd5815ebd_346c_11e9_8da3_b499ba08b6ce.slice/cpu.shares 32768 [root@perf146 kubepods.slice]# cat kubepods-podd5815ebd_346c_11e9_8da3_b499ba08b6ce.slice/crio-e632c53a102628bdc195bfa3452aca8a8da24477c3818bb13ad92e3d5aa01ac9.scope/cpu.shares 32768 When I do something similar for a VM: spec: domain: cpu: sockets: 2 cores: 16 [...] resources: overcommitGuestOverhead: true requests: cpu: 32 memory: 100Gi limits: cpu: 32 memory: 100Gi the shares are correct, although the pod is still under the burstable slice: [root@perf146 kubepods.slice]# cat kubepods-burstable.slice/kubepods-burstable-pod5bfe19ad_3476_11e9_8da3_b499ba08b6ce.slice/cpu.shares 32768 [root@perf146 kubepods.slice]# cat kubepods-burstable.slice/kubepods-burstable-pod5bfe19ad_3476_11e9_8da3_b499ba08b6ce.slice/crio-5f5e84b9d4a16f33f996c028a0252fef497388b4b5006ec6fbddbc91b0884c7b.scope/cpu.shares 32768 So I will close this as NOTABUG now since the performance issue for large VMs can be resolved by setting requests = limits (as designed), assuming it is otherwise fine the pod remains in Burstable.