1664466 – CPU Manager policy optimization

Bug 1664466 - CPU Manager policy optimization

Summary: CPU Manager policy optimization

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	1.3
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	future
Assignee:	Arik
QA Contact:	Israel Pinto
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1667854 (view as bug list)
Depends On:
Blocks:	1667854
TreeView+	depends on / blocked

Reported:	2019-01-08 21:52 UTC by Jenifer Abrams
Modified:	2019-11-07 08:29 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-11-06 15:46:04 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jenifer Abrams 2019-01-08 21:52:48 UTC

Description of problem:
Enabling the Kubernetes CPU Manager feature set to use the "static" policy can cause performance regressions for some cases due to the current topology selection decisions. This bug is to discuss possible future optimizations to the CPU Manager policies.

CPU Manager fills hyperthreads before filling cores (by design: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/) which can cause significant regressions compared to default non-pinned behavior in some cases. 

For example, Kubevirt dedicatedCPU (which uses CPUManager) performance testing shows 10-30% regressions for various CPU-bound microbenchmarks on a cores=2 guest compared to non-pinned (default) behavior. This regression is caused by CPUManager forcing the processes to share hyperthreads of a single core, where in the default case the kernel scheduler will tend to utilize 2 separate cores. 

One simple suggestion may be to add a "spread" type of option to the static policy that would spread among cores before filling hyperthreads. 

Version-Release number of selected component (if applicable):
CNV version: 1.3
oc v3.11.16
kubernetes v1.11.0+d4cacc0
RHEL7.6

How reproducible:
Everytime.

Steps to Reproduce:
1. Measure performance of a cores=2 pod using various CPU benchmarks
2. Enable CPUManager w/ static policy
3. Remeasure performance of a pinned cores=2 pod

Actual results:
Enabling CPUManager with the static policy degrades performance in many cases. 

Expected results:
Enabling CPUManager with the static policy should improve performance in many cases. 


Additional info:
More details in this doc: https://docs.google.com/presentation/d/1TlYS_3EVttNWxsUn5a15Lt1GgC5jwtd3q9zqUMdKQYI/edit?usp=sharing
  - shows 1VM data to illustrate the topology decisions, this study doesn't cover isolation, scale, etc.

Comment 1 Jenifer Abrams 2019-01-10 21:56:34 UTC

Moving component to get input from the pod team, please advise if that is not correct.

Comment 2 Jenifer Abrams 2019-01-14 22:06:52 UTC

I see that there have been related discussions in the Resource Management Working Group, including a core spreading option:
https://github.com/kubernetes/community/pull/654/files/a21b6adf7382ccf88a0ca42b133599af39eb4dd0#r124320417

Comment 5 Seth Jennings 2019-01-23 20:14:02 UTC

This is a feature. Feature tracking is in Jira now.

Please open a Jira ticket on the Pod board if you want us to work on this upstream.
https://jira.coreos.com/secure/RapidBoard.jspa?rapidView=98

Also, it is not clear from the data that even with HToff, which is basically this "spread" policy, it beats no cpu policy at all.

Comment 6 Fabian Deutsch 2019-01-24 08:47:23 UTC

Jenifer, could you please open an RFE on jira, and what are your thoughts wrt HToff?

Comment 7 Jenifer Abrams 2019-01-24 23:37:00 UTC

The CNV team has been discussing this, we want to reopen this bug and move it back to CNV for now for tracking purposes. After brainstorming it seems there may be some CNV-specific considerations related to this, once we get further along I can open a Jira ticket if we need help from the pod team.

Also I updated my SMToff page to be more clear, before it was mixing SMToff and SMTon in one chart which I think was confusing to the eye, now it is only showing SMToff performance vs. default and you can see there are some good wins for multiple benchmarks: https://docs.google.com/presentation/d/1TlYS_3EVttNWxsUn5a15Lt1GgC5jwtd3q9zqUMdKQYI/edit#slide=id.g4ba327a2df_0_216

Some policy optimization notes:

I think part of the issue relates to CNV/KVM user expectation vs. Openshift/K8s user expectation..
Openshift users may be aware that requesting a "cpu" for a pod is essentially a hyperthread.
CNV users may request "cores" for a VM, and in KVM/ESX environments (if there is no contention) the host scheduler tends to spread those vcpu processes out across cores, however with the current K8s CPUMgr static policy they are treated as hyperthreads.

For the "normal" Openshift user case, if the user knows cpu=hyperthread it may be reasonable to request 2*cpus if they want to get a whole core (assuming the host has two threads per core enabled). But for the CNV case, if we want to compare well to KVM/ESX it is probably not reasonable to have to run double the amount of "cores".

So, we could suggest a pinning policy that gives you a "full core" when you request a core in CNV. This means CNV would need to pass some sort of topology info to CPUManager. With CNV 1.4 we can define sockets/cores/threads topology for a VM. A suggestion would be to pass a "full core" value to CPUManager based on the VM sockets*cores, and just include any available hyperthreads of the core in the cpuset regardless of what the threads=x topology is.

Comment 8 Jenifer Abrams 2019-02-15 19:34:38 UTC

Arik started an upstream discussion here: https://github.com/kubernetes/kubernetes/issues/73954

Comment 9 Arik 2019-02-20 09:33:19 UTC

Summarizing a recent offline discussion:
There seem to be 2 possible features that would address this problem without the need to provide CPUMgr with the full emulated topology:

1. "treatThreadsAsCores": when a VM asks for X CPUs, CPUMgr will reserve a whole core for each CPU.

For example: if the node has 2 threads per-core and the VM asks for 2 CPUs (regardless to the number of cores) + treatThreadsAsCores=true then CPUMgr is expected to take 2 cores, pass thread-0 of each of them to the VM and reserve thread-1 of each of them so they will not be assigned to any other pod.

This feature is useful for VMs that demand an extra isolation level - preventing processes from accessing other processes' stuff and therefore not running anything else on the same node core. That is similar to disabling hyper-threading but at the granularity of a VM rather than the whole node.

2. ״dedicatedCore״: when a VM asks for X CPUs, CPUMgr will reserve a while core for each emulated core.

For example: if the node has 4 threads per-core and the VM asks for 2 cores with 2 CPUs-per-core + dedicatedCore=true then CPUMgr is expected to take 2 cores, pass thread-0 and thread-1 of each of them to the VM and reserve thread-2 and thread-3 of each of them so they will not be assigned to any other pod.

This feature is useful for VMs that demand lower level of isolation - preventing other pods from running on the same node core that CPUs of the VM run on.
This enables running more workloads on the nodes.

Obviously, the reported issue won't occur with those two strategies - as CPUs that belong to different emulated cores are "spread" across the node cores.
The question is whether those are enough - the drawback of those two is in the number of wasted CPUs, those that are reserved and not used.
Specifically, let's say that I want to have "dedicated-CPUs" (i.e., I want the VM's CPU to be pinned to specific node CPUs) but neither of those two strategies are relevant to me (I don't require the isolation level they provide and I don't want to compromise the number of workloads that can run on the nodes). I think that would be the direct solution for the reported issue, but it's more complex and the question is whether it is really needed (in other words, whether users that would care about such placement won't require with dedicatedCores or threatThreadsAsCores to be enabled).

Martin, is that makes sense?

Comment 10 Martin Sivák 2019-02-20 17:14:11 UTC

Yeah what you described makes sense. The last case might not be an issue: if dedicatedCpus is used without dedicatedCores then threads will be assigned from random free cores on best effort basis (we can improve the best effort algorithm to keep threads on the same core and socket if possible) and no extra threads will be wasted.

I think what you proposed is good enough for me.

Comment 11 Jenifer Abrams 2019-02-20 23:00:29 UTC

Good suggestions! One question:
>> the VM asks for 2 cores with 2 CPUs-per-core + dedicatedCore=true 
does the "CPUs-per-core" part basically mean it would translate the VM threads=x topology? 

If so I suppose it may need to handle cases with a nonsense request like asking for threads=4 when the hardware only has 2 threads. Maybe in that case it spills over to more cores to satisfy the full vcpu request?

Comment 12 Arik 2019-02-21 13:15:26 UTC

> Yeah what you described makes sense. The last case might not be an issue: if
> dedicatedCpus is used without dedicatedCores then threads will be assigned
> from random free cores on best effort basis (we can improve the best effort
> algorithm to keep threads on the same core and socket if possible) and no
> extra threads will be wasted.
> 

Right, that's what I had in mind as well.

> does the "CPUs-per-core" part basically mean it would translate the VM threads=x topology? 

Yes

>
> If so I suppose it may need to handle cases with a nonsense request like asking for threads=4 when the hardware only has 2 threads. Maybe in that case it spills over > to more cores to satisfy the full vcpu request?

Yeah, that's a good point.
In oVirt, we have a similar check for ensuring the node has enough cores to run the VM [1]. There is no check for the number of threads though - I just simulated the scenario above and oVirt doesn't block it :)
But yeah, comment 9 assumes that either there are validations like you mentioned that when scheduling that prevent such cases or we are willing to go with a placement that doesn't completely conform the topology of the guest.

[1] https://github.com/oVirt/ovirt-engine/blob/master/backend/manager/modules/bll/src/main/java/org/ovirt/engine/core/bll/scheduling/policyunits/CPUPolicyUnit.java#L40

Comment 15 Israel Pinto 2019-02-26 13:38:35 UTC

*** Bug 1667854 has been marked as a duplicate of this bug. ***

Comment 18 Fabian Deutsch 2019-05-03 07:13:26 UTC

Moving this out to 2.1

Comment 21 Israel Pinto 2019-11-06 15:46:04 UTC

tracked here: https://jira.coreos.com/browse/CNV-383

Note You need to log in before you can comment on or make changes to this bug.