Bug 1393576
| Summary: | libvirt CPU scheduler scheduling most vCPUs onto first CPU | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Andreas Karis <akaris> |
| Component: | libvirt | Assignee: | Libvirt Maintainers <libvirt-maint> |
| Status: | CLOSED NOTABUG | QA Contact: | Virtualization Bugs <virt-bugs> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 7.2 | CC: | berrange, dyuan, lhuang, rbalakri, xuzhang |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-30 15:18:11 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The ask: We do have a (bad) mitigation for this scheduling issue, but need an explanation and permanent fix for this before black Friday. Customer is afraid that this issue may reappear at any time. Also, we do absolutely not understand why the mitigation is working. Currently, the issue persists on a few of the hypervisors so that that we can analyze it. (In reply to Andreas Karis from comment #0) > Description of problem: > The customer is observing high CPU steal values on instances on specific > (most of) his hypervisors. After analysis of all hypervisors, it seems that > most vCPUs get mostly scheduled on CPU 0 of the hypervisors. These > hypervisors were configured with isolcpus kernel command line parameter. The > customer is aware that this is a bad configuration and that they will have > to remove this eventually. > > Theory: > - some bug in the scheduler (possibly triggered due to isolcpus) puts most > vCPUs on CPU 0 and thus creates high contention for that CPU and high steal > values within the VMs This is not a bug - it is normal behaviour of the isolcpus setting. When isolcpus is set, the kernel will *never* move processes between pCPUs - each process will stay on whichever pCPU it first launched on. As such isolcpus should only ever be used on hosts where your VMs are set to use exclusive CPU pinning of vCPUS <-> pCPUs - in fact more than that - isolcpus should *only* be used if running realtime guests with CPU pinning. If you run VMs with floating CPUs on a host using isolcpus, their vCPUs will never float, so you can end up with far too many vCPUs on the same pCPU. If running VMs with floating CPUs, the nova.conf setting should be used if you want to reserve a subset of pCPUs for non-VM tasks, instead of isolcpus. Hi Daniel, When you are saying ~~~ If running VMs with floating CPUs, the nova.conf setting should be used if you want to reserve a subset of pCPUs for non-VM tasks, instead of isolcpus. ~~~ do you mean this setting? # Defines which pcpus that instance vcpus can use. For example, "4-12,^8,15" # (string value) #vcpu_pin_set=<None> Thanks, Andreas Yes, vcpu_pin_set controls which host CPUs VMs are allowed to roam across Hi Daniel, I know that this is getting a bit out of scope here, but the customer want to reserve a few resources for the OS in case that oversubscription gets too high. We are now settings (2 cpus on each numa node): vcpu_pin_set=2-9,12-39 reserved_host_memory_mb=512 I guess that the above will assure that libvirt does not touch these resources, and thus the kernel / other user space services other than libvirt would always have these resources reserved. Please let me know if this makes sense, Thanks, Andreas Yes, that is fine, though 512 MB is pretty low to be honest - host OS services + overhead of QEMU itself will easily consume that and more. 1 GB is probably a more realistic starting point. Hi, I'd also like to clarify: ~~~ the kernel will *never* move processes between pCPUs - each process will stay on whichever pCPU it first launched on. ~~~ Does that apply for *all* CPUs or only for the CPU set within *isolcpus*? Because I think that we saw that the other CPUs were still roaming, we only saw very high counts of vcpu<->pcpu mapping on the subset of pCPUs in the isolcpu parameter. Regards, Andreas I was referring to the isolated cpus.. Also, and this is the last question (promised): does it make sense to run numad on these hypervisors and let it handle the pinning (I think that it will overwrite the vcpu_pin_set, though), or does numad have negative performance impacts? Thanks! Nova has built-in support for NUMA placement, so you should not use numad, instead enable Nova's NUMA features. THis is required so that the Nova schedular can intelligently place guests on compute nodes with sufficient space on their NUMA nodes Daniel, thank you very much for all of the great help! I am keeping this open for the time being, but the customer is currently testing this and your explanations really filled the knowledge gaps that I had and helped us move this forward. Thanks a lot!!! Thanks for the help! |
Description of problem: The customer is observing high CPU steal values on instances on specific (most of) his hypervisors. After analysis of all hypervisors, it seems that most vCPUs get mostly scheduled on CPU 0 of the hypervisors. These hypervisors were configured with isolcpus kernel command line parameter. The customer is aware that this is a bad configuration and that they will have to remove this eventually. Theory: - some bug in the scheduler (possibly triggered due to isolcpus) puts most vCPUs on CPU 0 and thus creates high contention for that CPU and high steal values within the VMs Version-Release number of selected component (if applicable): libvirt-client-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-interface-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-network-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-qemu-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-secret-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-driver-storage-1.2.17-13.el7_2.4.x86_64 libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64 libvirt-python-1.2.17-2.el7.x86_64 How reproducible: The issue can be observed on most hypervisors which we did not take action upon yet: - run ` virsh list | awk '{print $2}' | xargs -I {} virsh vcpuinfo {} | egrep '^CPU\:' | awk '{print $NF}' | sort | uniq -c | sort -nr` on all hypervisors to find the top scheduled CPUs at a given moment ~~~ d4-ucos-nova4 | SUCCESS | rc=0 >> 39 0 9 8 9 6 (...) d4-ucos-nova7 | SUCCESS | rc=0 >> 55 0 4 21 4 13 3 22 3 18 (...) d4-ucos-nova8 | SUCCESS | rc=0 >> 44 0 5 9 4 5 4 4 (...) d4-ucos-nova10 | SUCCESS | rc=0 >> 43 0 7 9 6 8 6 7 6 5 (...) d4-ucos-nova14 | SUCCESS | rc=0 >> 21 0 3 21 2 9 2 7 (...) ~~~ The ask: We do have a mitigation for this scheduling issue, but need an explanation and permanent fix. How to mitigate the issue: - create script `pinning.sh`. Modify reserve_first to change the CPU Affinity of all VMs on a given hypervisor to `${reserve_first}-$[ $cpu_count -1 ]` ~~~ #!/bin/bash # ################################################# # This script takes an instance name, pins all of its vCPUs to the first hypervisor vcpus and lets all other VMs' vcpus roam freely among the rest # 2016 - Red Hat - akaris ################################################# cpu_count=`lscpu | egrep '^CPU\(s\)' | awk '{print $NF}'` reserve_first=0 echo "Adjusting pinning for all instances" virsh list | awk '{print $2}' | tail -n+3 | head -n-1 | while read instance;do if [ "$instance" == "$reserve_instance" ];then echo "Skipping instance $instance" continue fi echo "Adjusting pinning for $instance" virsh vcpupin $instance | egrep '^\s+[0-9]' | awk -F ':' '{print $1}' | while read vcpu;do virsh vcpupin $instance $vcpu ${reserve_first}-$[ $cpu_count -1 ] done virsh vcpupin $instance done echo "" echo "===============================================" echo "Verification output" echo "===============================================" virsh list | awk '{print $2}' | xargs -I {} bash -c "echo {}; virsh vcpupin {}" 2>/dev/null ~~~ - run pinning.sh with `reserve_first=5` (5 was chosen randomly. We would need to investigate if 4 is the first effective value, if so, then this may likely have something to do with `isolcpu`) - run pinning.sh with `reserve_first=0` - observe that CPU 0 is not being scheduled any more Example on nova2 for mitigation procedure: same issue on nova2 ~~~ [root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh error: failed to get domain 'Name' error: Domain not found: no domain with matching name 'Name' 78 0 11 22 7 5 7 20 5 23 4 9 4 8 3 6 3 4 3 19 3 15 3 14 2 7 2 21 2 18 2 13 2 10 1 28 1 27 1 12 1 11 ~~~ ~~~ [root@d4-ucos-nova2 ~]# virsh list | awk '{print $2}' | xargs -I {} virsh vcpuinfo {} | grep Aff | uniq -c error: failed to get domain 'Name' error: Domain not found: no domain with matching name 'Name' 145 CPU Affinity: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy ~~~ ran pinning.sh with 0-39 .. did not change anything ... ~~~ [root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh error: failed to get domain 'Name' error: Domain not found: no domain with matching name 'Name' 79 0 9 7 8 23 7 22 5 21 4 6 4 5 3 9 3 8 3 4 3 18 2 20 2 19 2 16 2 13 2 12 1 39 1 38 1 27 1 17 1 14 1 11 1 10 ~~~ changing 1-39 ~~~ [root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh error: failed to get domain 'Name' error: Domain not found: no domain with matching name 'Name' 79 1 10 23 6 6 5 4 5 19 5 14 4 9 4 22 4 21 4 20 3 8 3 7 3 5 3 17 2 15 2 10 1 18 1 16 1 11 ~~~ changing 5-39: ~~~ instance-00007656 VCPU: CPU Affinity ---------------------------------- 0: 5-39 1: 5-39 [root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh error: failed to get domain 'Name' error: Domain not found: no domain with matching name 'Name' 20 23 18 7 17 22 15 5 14 8 11 9 11 21 7 6 7 24 7 20 4 17 2 33 2 28 2 18 2 16 1 27 1 26 1 19 1 13 1 12 1 10 ~~~ changing back to 0-39 ~~~ instance-00007656 VCPU: CPU Affinity ---------------------------------- 0: 0-39 1: 0-39 [root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh error: failed to get domain 'Name' error: Domain not found: no domain with matching name 'Name' 19 9 17 23 17 22 14 5 10 21 9 8 9 6 8 7 7 20 4 11 3 4 3 27 3 25 3 19 3 17 3 16 3 15 2 28 2 13 1 35 1 33 1 29 1 24 1 18 1 10 ~~~