Bug 1846428

Summary: Backport upstream CPU Manager fix 90419
Product: OpenShift Container Platform Reporter: Victor Pickard <vpickard>
Component: NodeAssignee: Artyom <alukiano>
Status: CLOSED ERRATA QA Contact: Cameron Meadors <cmeadors>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.6CC: alukiano, aos-bugs, cmeadors, jokerman, mifiedle, rphillips, sjenning, wabouham
Target Milestone: ---Keywords: Reopened
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1846451 (view as bug list) Environment:
Last Closed: 2020-10-27 16:06:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1846451    

Description Victor Pickard 2020-06-11 14:59:12 UTC
Description of problem:

Backport upstream fix for CPU Manager init containers 

https://github.com/kubernetes/kubernetes/pull/90419 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 3 Cameron Meadors 2020-06-22 19:29:42 UTC
I am unable to reproduce the issue as described in the upstream PR [1].  Does anyone have an actual reproducer or can explain exactly what triggers the issue?

[1] https://github.com/kubernetes/kubernetes/pull/90419

Comment 4 Artyom 2020-06-23 10:10:22 UTC
I checked the code, and to be honest, I can not imagine the reproducer, I will need to check with Kevin who provided the PR upstream, to see how it can be reproduced.

Comment 5 Artyom 2020-06-23 10:49:38 UTC
After a discussion with Kevin, the main problem here that we want to re-use CPU's allocated for init containers for the app containers, and we want to prevent that the CPU used by the init containers will back to the shared pool.

So I believe the easiest reproduce:
1. create a pod with the guaranteed QoS class and with the init container(make sure that the init container will live long enough that you will have time to check all the needed stuff)
2. under the node where the pod is running, check the init container cgroup cpuset.cpus(it should be pinned to some exclusive CPU's)
3. under the node where the pod is running, check the cpu_state file(it should be under /var/lib/kubelet/cpu_*), under it, you can find the defaultCpuSet field, check that CPU's under it include the init container CPUs.

Comment 6 Cameron Meadors 2020-06-25 21:02:45 UTC
I have been trying to reproduce the issue and I think I have results that show that cpus for init containers are not being reused in the app containers.  However I am also seeing results that look like the effective requests/limits are not being used.

I am testing on 4.4 with a node that has 8 cpus, cpumanager is enabled and it is labeled with cpumanager=true.  I have pod manifest that has 2 init containers and 2 app containers defined and nodeselector for cpumanager=true.  I have tested many combinations of app and init containers by setting cpu request/limit for each of them to different values and looking at Cpu_allowed_list in /proc/self/status on the containers.  The values for each container match the request/limits in the manifest.  These values match what is in /host/var/lib/kubelet/cpu_manager_state.  QoS tier is set to Guaranteed for the pod.

What I was expecting was that cpus reserved to be the same for all containers and that it matches the effective request/limit as defined in the kubernetes docs [1].  For example, if the cpu request/limit for the sum of the app containers, then that would be the effective request/limit all the containers.  I am not seeing that.  So either I am misunderstanding the docs, checking the wrong thing, or the behavior is wrong.  Can someone confirm how to verify the effective cpu request/limit and if it should be the same for all containers?

I am continuing testing on 4.6 to compare behavior.  I will update this bug with the results.

[1] https://kubernetes.io/docs/concepts/workloads/pods/init-containers/#resources

Comment 7 Cameron Meadors 2020-06-26 15:12:00 UTC
4.6 has the same behavior with regards to reuse oof cpus.  Number of cpus used per container and the actual used cpus match those on 4.4.  Testing the branch now.  Hoping to see different behavior.

Another data point, when I was testing before I had all the various files to check, I was using the scheduler to tell me what it was using for the effective request/limit but setting values that would or would not be scheduled due to available resources.  It seemed to do that right thing.  This leads me to believe that the files I am checking do not actually reflect the effective request/limit, but something else.

Comment 8 Cameron Meadors 2020-06-26 17:14:29 UTC
Here is the results from the PR branch code.

myapp-container request/limit set to 1
init-myservice request/limit set to 2


sh-4.2# cat /host/var/lib/kubelet/cpu_manager_state 
{"policyName":"static","defaultCpuSet":"0,2-3,6-7","entries":{"0ad8f84d-2321-413f-8775-201307c32252":{"init-myservice":"1,5","myapp-container":"4"}},"checksum":1986161931}sh-

sh-4.2# cat /host/sys/fs/cgroup/cpuset/cpuset.cpus
0-7
sh-4.2# cat /host/sys/fs/cgroup/cpuset/cpuset.cpu_exclusive
1
sh-4.2# cat /host/sys/fs/cgroup/cpuset/cpuset.effective_cpus
0-7

sh-4.2# cat /host/etc/kubernetes/kubelet.conf |grep cpuManager
  "cpuManagerPolicy": "static",
  "cpuManagerReconcilePeriod": "5s",

[cmeadors@localhost cmeadors-repro-bug1846428]$ oc get nodes -L cpumanager
NAME                                         STATUS   ROLES    AGE   VERSION   CPUMANAGER
ip-10-0-134-15.us-west-2.compute.internal    Ready    master   65m   v1.18.3   
ip-10-0-172-240.us-west-2.compute.internal   Ready    worker   12m   v1.18.3   true
ip-10-0-190-73.us-west-2.compute.internal    Ready    master   65m   v1.18.3   
ip-10-0-202-199.us-west-2.compute.internal   Ready    master   65m   v1.18.3   
ip-10-0-223-118.us-west-2.compute.internal   Ready    worker   55m   v1.18.3

Comment 9 Cameron Meadors 2020-06-26 17:24:18 UTC
Based on my testing, I don't see any change in behavior.  Specifically, I don't see app containers reusing the cpus from init containers.

Comment 10 Mike Fiedler 2020-06-26 17:41:40 UTC
Based on comment 2, not even sure why this was ON_QA with an unmerged PR.  Looks like it was manually moved to MODIFIED instead of letting the CI automation take care of it.

In any case, based on comment 9, moving back to ASSSIGNED for engineering review.

Comment 11 Artyom 2020-06-28 10:04:28 UTC
The PR still not merged under the master, I do not sure why was it moved to ON_QA, thanks for verification and sorry for your wasted time :(

Comment 13 Artyom 2020-07-23 09:58:16 UTC
The fix will be available after the rebase on top of the k8s 1.19, no need to have the separate bug for it.

Comment 14 Cameron Meadors 2020-07-23 13:23:51 UTC
Could we keep this bug open until the rebase is done?  That will help QE track this change so we can verify it in the rebase.

Comment 15 Artyom 2020-07-23 13:33:19 UTC
Sure, I do not have problem with it. Please change it to the relevant status.

Comment 16 Cameron Meadors 2020-07-23 17:00:00 UTC
On second thought, changing to ASSIGNED.  Please set to ON_QA when rebase happens.

Comment 17 Artyom 2020-08-02 09:53:12 UTC
The OpenShift rebase on top of k8s 1.19.

Comment 18 Cameron Meadors 2020-08-12 19:44:46 UTC
I still don't see the expected behavior.  I see the same behavior as before, init container gets two exclusive cpus then the app container gets a different exclusive CPU.

$ oc exec myapp-pod -c init-myservice -- cat /proc/self/status | grep "Cpus_allowed_list"
Cpus_allowed_list:	1,5

$ oc exec myapp-pod -- cat /proc/self/status | grep "Cpus_allowed_list"
Cpus_allowed_list:	4


Can you point me to the rebase and the lines that include changes that affect this?  Maybe I am missing something.  For now, I have to fail this.

Comment 20 Artyom 2020-08-13 07:41:18 UTC
The main idea that the CPU allocted by the init containers should not be released before the work by the init container is done, so in general the workflow should be:
1. Start the pod with the init container(that live long enough)
2. Verify that the CPU allocated to it does not exist under the shared pool(/var/lib/kubelet/cpu_manager_state)
3. Verify that the CPU allocated by the init container released by the CPU manager, once the init container done his work.

You can expect the reusage of the same CPUs only in case when the topology manager enabled and we should have the PR https://github.com/kubernetes/kubernetes/pull/93189 to be merged under our fork.

Only under the CPU manager we do not have any guarantees that the init container CPU will be reused.

Give me know if you need more details.

Comment 21 Cameron Meadors 2020-08-13 12:59:20 UTC
That may be true, but the description of the original PR https://github.com/kubernetes/kubernetes/pull/90419 that this bug was tracking specifically said that the CPU should be reused.

"this patch updates the strategy to reuse the CPUs assigned to init containers
across into app containers, while avoiding this edge case."

Can you please clarify what this bug is actually changing?  If it is not done because of the additional PR please wait for that work to be ready before putting this in ON_QA.  Maybe we should re-evaluate how we should track these particular upstream changes.

Comment 22 Artyom 2020-08-13 13:38:07 UTC
I agree, it should, but it introduced an additional bug, that was fixed by https://github.com/kubernetes/kubernetes/pull/93189, and I already created a separate bug to follow it, https://bugzilla.redhat.com/show_bug.cgi?id=1868634. The problem that we want to cherry-pick both PRs to release-4.5 and I can not cherry-pick https://github.com/kubernetes/kubernetes/pull/93189 before I cherry-pick this one, because it build one on another and the flow is to have the separate PR for each u/s PR, so I can not join both PRs under the single openshift PR.

Like I said the main thing here is to check that CPUs allocated for the init container did not released before it done.
So the check that I provided above should be sufficient to verify this specic bug.

Once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1868634 will be merged, it will possible to test reusing of CPUs.

Comment 23 Cameron Meadors 2020-08-13 17:16:56 UTC
Ok.  Well from all my testing, that has always worked (started testing on 4.4).  Because the nature of the changes for this bz have changed I am not sure what to do.  It works, but there was never a reproducer for the reserved cpus not being reserved for the life of the init containers.  I can't call this verified as nothing has changed.

I will https://bugzilla.redhat.com/show_bug.cgi?id=1868634 to verify the cpu reuse.

Comment 25 Seth Jennings 2020-08-28 15:18:43 UTC
This upstream PR will come in on the 4.6 rebase against 1.19 (final) now that it has been released.

Comment 26 Seth Jennings 2020-08-28 15:21:58 UTC
And I understand that this has been reopened before for QE tracking, but this is not done for any of the other numerous PRs we inherit from upstream and the bug itself is "we need this PR", not "this is a bug".

Comment 27 Seth Jennings 2020-08-28 16:23:18 UTC
In order to follow process, we need this bug.  Not the PR though.

QE note: wait for 4.6 to rebase against 1.19 final before testing this (i.e. before moving it to ON_QA)

Comment 29 Seth Jennings 2020-09-01 15:38:31 UTC
1.19 rebase PR https://github.com/openshift/kubernetes/pull/325

Delay QE until this is merged.

Comment 30 Cameron Meadors 2020-09-22 14:40:01 UTC
Following steps in comment 20:

Pod created with initcontainer (pod is guaranteed init requested 2 cpus, app request 1) 4 cpus availble

# cat /host/var/lib/kubelet/cpu_manager_state 
{"policyName":"static","defaultCpuSet":"0","entries":{"5e479e83-190b-4092-8b9b-fc472a856e02":{"init-myservice":"1,3","myapp-container":"2"}},"checksum":76317910

Same output after initcontainer finishes and app container starts.

CPUs for initcontainer are reserved for its life, but it doesn't look like the CPUs are released.  That would fail based on comment 20 steps, but if the pod restarts will the init containers cpus still be guaranteed if they are actually released?  It is very unclear what the correct behavior should be.

Comment 31 Cameron Meadors 2020-09-22 17:44:17 UTC
After more consideration, I am convinced that the current behavior satisfies that usecase of the original PR https://github.com/kubernetes/kubernetes/pull/90419 and this can be marked as VERIFIED.   The init container CPU are reserved and in the case of a pod restart, the CPUs would still be available to allow the init container to succeed and allow the app container to start.

It is fully understood that having an init container needing more cpu that the app is not usual, but it was the easiest single case to run to verify this fix.  Additionally, it is recognized that the reserved CPUs for the init container are not reused, but by the fact that they remain reserved satisfies the usecase.

This has not been tested under load or on a cluster with large numbers of cpu to catch all possible issues related to performance as the testing is very complex and time consuming.  If there are performance issues please open new bugs for that so QE can plan accordingly.

Similarly if the behavior is incorrect please open a new bug.  I am happy to provide more details around current behavior, but there is too much history in this bug with the rebase and splitting of changes to continue here.

Comment 33 errata-xmlrpc 2020-10-27 16:06:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196