1393576 – libvirt CPU scheduler scheduling most vCPUs onto first CPU

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1393576 - libvirt CPU scheduler scheduling most vCPUs onto first CPU

Summary: libvirt CPU scheduler scheduling most vCPUs onto first CPU

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	---
Assignee:	Libvirt Maintainers
QA Contact:	Virtualization Bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-11-09 21:58 UTC by Andreas Karis
Modified:	2019-12-16 07:21 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-11-30 15:18:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Andreas Karis 2016-11-09 21:58:03 UTC

Description of problem:
The customer is observing high CPU steal values on instances on specific (most of) his hypervisors. After analysis of all hypervisors, it seems that most vCPUs get mostly scheduled on CPU 0 of the hypervisors. These hypervisors were configured with isolcpus kernel command line parameter. The customer is aware that this is a bad configuration and that they will have to remove this eventually.

Theory:
- some bug in the scheduler (possibly triggered due to isolcpus) puts most vCPUs on CPU 0 and thus creates high contention for that CPU and high steal values within the VMs

Version-Release number of selected component (if applicable):
libvirt-client-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-interface-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-network-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-nodedev-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-nwfilter-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-qemu-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-secret-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-driver-storage-1.2.17-13.el7_2.4.x86_64
libvirt-daemon-kvm-1.2.17-13.el7_2.4.x86_64
libvirt-python-1.2.17-2.el7.x86_64


How reproducible:
The issue can be observed on most hypervisors which we did not take action upon yet:
- run ` virsh list | awk '{print $2}' | xargs -I {} virsh vcpuinfo {} | egrep '^CPU\:' | awk '{print $NF}' | sort | uniq -c | sort -nr` on all hypervisors to find the top scheduled CPUs at a given moment
~~~
d4-ucos-nova4 | SUCCESS | rc=0 >>
     39 0
      9 8
      9 6
(...)
d4-ucos-nova7 | SUCCESS | rc=0 >>
     55 0
      4 21
      4 13
      3 22
      3 18
(...)
d4-ucos-nova8 | SUCCESS | rc=0 >>
     44 0
      5 9
      4 5
      4 4
(...)
d4-ucos-nova10 | SUCCESS | rc=0 >>
     43 0
      7 9
      6 8
      6 7
      6 5
(...)
d4-ucos-nova14 | SUCCESS | rc=0 >>
     21 0
      3 21
      2 9
      2 7
(...)
~~~

The ask:
We do have a mitigation for this scheduling issue, but need an explanation and permanent fix.

How to mitigate the issue:
- create script `pinning.sh`. Modify reserve_first to change the CPU Affinity of all VMs on a given hypervisor to `${reserve_first}-$[ $cpu_count -1 ]`
~~~
#!/bin/bash
#
#################################################
# This script takes an instance name, pins all of its vCPUs to the first hypervisor vcpus and lets all other VMs' vcpus roam freely among the rest
# 2016 - Red Hat - akaris
#################################################

cpu_count=`lscpu | egrep '^CPU\(s\)' | awk '{print $NF}'`
reserve_first=0

echo "Adjusting pinning for all instances"
virsh list | awk '{print $2}' | tail -n+3 | head -n-1 | while read instance;do
  if [ "$instance" == "$reserve_instance" ];then
    echo "Skipping instance $instance"
    continue
  fi
  echo "Adjusting pinning for $instance"
  virsh vcpupin $instance | egrep '^\s+[0-9]' | awk -F ':' '{print $1}' | while read vcpu;do
    virsh vcpupin $instance $vcpu ${reserve_first}-$[ $cpu_count -1 ]
  done
  virsh vcpupin $instance
done

echo ""
echo "==============================================="
echo "Verification output"
echo "==============================================="
virsh list | awk '{print $2}' | xargs -I {} bash -c "echo {}; virsh vcpupin {}" 2>/dev/null
~~~
- run pinning.sh with `reserve_first=5` (5 was chosen randomly. We would need to investigate if 4 is the first effective value, if so, then this may likely have something to do with `isolcpu`)
- run pinning.sh with `reserve_first=0`
- observe that CPU 0 is not being scheduled any more

Example on nova2 for mitigation procedure:
same issue on nova2
~~~
[root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh
error: failed to get domain 'Name'
error: Domain not found: no domain with matching name 'Name'
     78 0
     11 22
      7 5
      7 20
      5 23
      4 9
      4 8
      3 6
      3 4
      3 19
      3 15
      3 14
      2 7
      2 21
      2 18
      2 13
      2 10
      1 28
      1 27
      1 12
      1 11
~~~

~~~
[root@d4-ucos-nova2 ~]#  virsh list | awk '{print $2}' | xargs -I {} virsh vcpuinfo {} | grep Aff  | uniq -c
error: failed to get domain 'Name'
error: Domain not found: no domain with matching name 'Name'
    145 CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
~~~

ran pinning.sh with 0-39 .. did not change anything ...

~~~
[root@d4-ucos-nova2 ~]#  ./vcpu_scheduling.sh
error: failed to get domain 'Name'
error: Domain not found: no domain with matching name 'Name'
     79 0
      9 7
      8 23
      7 22
      5 21
      4 6
      4 5
      3 9
      3 8
      3 4
      3 18
      2 20
      2 19
      2 16
      2 13
      2 12
      1 39
      1 38
      1 27
      1 17
      1 14
      1 11
      1 10
~~~

changing 1-39
~~~
[root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh
error: failed to get domain 'Name'
error: Domain not found: no domain with matching name 'Name'
     79 1
     10 23
      6 6
      5 4
      5 19
      5 14
      4 9
      4 22
      4 21
      4 20
      3 8
      3 7
      3 5
      3 17
      2 15
      2 10
      1 18
      1 16
      1 11
~~~

changing 5-39:
~~~
instance-00007656
VCPU: CPU Affinity
----------------------------------
   0: 5-39
   1: 5-39

[root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh
error: failed to get domain 'Name'
error: Domain not found: no domain with matching name 'Name'
     20 23
     18 7
     17 22
     15 5
     14 8
     11 9
     11 21
      7 6
      7 24
      7 20
      4 17
      2 33
      2 28
      2 18
      2 16
      1 27
      1 26
      1 19
      1 13
      1 12
      1 10
~~~

changing back to 0-39
~~~
instance-00007656
VCPU: CPU Affinity
----------------------------------
   0: 0-39
   1: 0-39

[root@d4-ucos-nova2 ~]# ./vcpu_scheduling.sh
error: failed to get domain 'Name'
error: Domain not found: no domain with matching name 'Name'
     19 9
     17 23
     17 22
     14 5
     10 21
      9 8
      9 6
      8 7
      7 20
      4 11
      3 4
      3 27
      3 25
      3 19
      3 17
      3 16
      3 15
      2 28
      2 13
      1 35
      1 33
      1 29
      1 24
      1 18
      1 10
~~~

Comment 1 Andreas Karis 2016-11-09 22:10:12 UTC

The ask:
We do have a (bad) mitigation for this scheduling issue, but need an explanation and permanent fix for this before black Friday. Customer is afraid that this issue may reappear at any time. Also, we do absolutely not understand why the mitigation is working. Currently, the issue persists on a few of the hypervisors so that that we can analyze it.

Comment 3 Daniel Berrangé 2016-11-10 09:04:34 UTC

(In reply to Andreas Karis from comment #0)
> Description of problem:
> The customer is observing high CPU steal values on instances on specific
> (most of) his hypervisors. After analysis of all hypervisors, it seems that
> most vCPUs get mostly scheduled on CPU 0 of the hypervisors. These
> hypervisors were configured with isolcpus kernel command line parameter. The
> customer is aware that this is a bad configuration and that they will have
> to remove this eventually.
> 
> Theory:
> - some bug in the scheduler (possibly triggered due to isolcpus) puts most
> vCPUs on CPU 0 and thus creates high contention for that CPU and high steal
> values within the VMs

This is not a bug - it is normal behaviour of the isolcpus setting. When isolcpus is set, the kernel will *never* move processes between pCPUs - each process will stay on whichever pCPU it first launched on. As such isolcpus should only ever be used on hosts where your VMs are set to use exclusive CPU pinning of vCPUS <-> pCPUs - in fact more than that - isolcpus should *only* be used if running realtime guests with CPU pinning. 

If you run VMs with floating CPUs on a host using isolcpus, their vCPUs will never float, so you can end up with far too many vCPUs on the same pCPU.

If running VMs with floating CPUs, the nova.conf setting should be used if you want to reserve a subset of pCPUs for non-VM tasks, instead of isolcpus.

Comment 4 Andreas Karis 2016-11-10 16:21:03 UTC

Hi Daniel,

When you are saying 

~~~
If running VMs with floating CPUs, the nova.conf setting should be used if you want to reserve a subset of pCPUs for non-VM tasks, instead of isolcpus.
~~~

do you mean this setting?

# Defines which pcpus that instance vcpus can use. For example, "4-12,^8,15"
# (string value)
#vcpu_pin_set=<None>

Thanks,

Andreas

Comment 5 Daniel Berrangé 2016-11-10 16:32:02 UTC

Yes, vcpu_pin_set controls which host CPUs VMs are allowed to roam across

Comment 6 Andreas Karis 2016-11-10 16:58:03 UTC

Hi Daniel,

I know that this is getting a bit out of scope here, but the customer want to reserve a few resources for the OS in case that oversubscription gets too high.

We are now settings (2 cpus on each numa node):
vcpu_pin_set=2-9,12-39
reserved_host_memory_mb=512

I guess that the above will assure that libvirt does not touch these resources, and thus the kernel / other user space services other than libvirt would always have these resources reserved.

Please let me know if this makes sense,

Thanks,

Andreas

Comment 7 Daniel Berrangé 2016-11-10 17:01:26 UTC

Yes, that is fine, though 512 MB is pretty low to be honest - host OS services + overhead of QEMU itself will easily consume that and more. 1 GB is probably a more realistic starting point.

Comment 8 Andreas Karis 2016-11-10 17:04:10 UTC

Hi,

I'd also like to clarify:

~~~
the kernel will *never* move processes between pCPUs - each process will stay on whichever pCPU it first launched on. 
~~~

Does that apply for *all* CPUs or only for the CPU set within *isolcpus*? Because I think that we saw that the other CPUs were still roaming, we only saw very high counts of vcpu<->pcpu mapping on the subset of pCPUs in the isolcpu parameter.

Regards,

Andreas

Comment 9 Daniel Berrangé 2016-11-10 17:05:47 UTC

I was referring to the isolated cpus..

Comment 10 Andreas Karis 2016-11-10 17:06:08 UTC

Also, and this is the last question (promised): does it make sense to run numad on these hypervisors and let it handle the pinning (I think that it will overwrite the vcpu_pin_set, though), or does numad have negative performance impacts?

Thanks!

Comment 11 Daniel Berrangé 2016-11-10 17:07:21 UTC

Nova has built-in support for NUMA placement, so you should not use numad, instead enable Nova's NUMA features. THis is required so that the Nova schedular can intelligently place guests on compute nodes with sufficient space on their NUMA nodes

Comment 12 Andreas Karis 2016-11-10 17:44:21 UTC

Daniel, thank you very much for all of the great help! I am keeping this open for the time being, but the customer is currently testing this and your explanations really filled the knowledge gaps that I had and helped us move this forward.

Thanks a lot!!!

Comment 13 Andreas Karis 2016-11-30 15:18:11 UTC

Thanks for the help!

Note You need to log in before you can comment on or make changes to this bug.