Description of problem:
Backport upstream fix for CPU Manager init containers
Version-Release number of selected component (if applicable):
Steps to Reproduce:
I am unable to reproduce the issue as described in the upstream PR . Does anyone have an actual reproducer or can explain exactly what triggers the issue?
I checked the code, and to be honest, I can not imagine the reproducer, I will need to check with Kevin who provided the PR upstream, to see how it can be reproduced.
After a discussion with Kevin, the main problem here that we want to re-use CPU's allocated for init containers for the app containers, and we want to prevent that the CPU used by the init containers will back to the shared pool.
So I believe the easiest reproduce:
1. create a pod with the guaranteed QoS class and with the init container(make sure that the init container will live long enough that you will have time to check all the needed stuff)
2. under the node where the pod is running, check the init container cgroup cpuset.cpus(it should be pinned to some exclusive CPU's)
3. under the node where the pod is running, check the cpu_state file(it should be under /var/lib/kubelet/cpu_*), under it, you can find the defaultCpuSet field, check that CPU's under it include the init container CPUs.
I have been trying to reproduce the issue and I think I have results that show that cpus for init containers are not being reused in the app containers. However I am also seeing results that look like the effective requests/limits are not being used.
I am testing on 4.4 with a node that has 8 cpus, cpumanager is enabled and it is labeled with cpumanager=true. I have pod manifest that has 2 init containers and 2 app containers defined and nodeselector for cpumanager=true. I have tested many combinations of app and init containers by setting cpu request/limit for each of them to different values and looking at Cpu_allowed_list in /proc/self/status on the containers. The values for each container match the request/limits in the manifest. These values match what is in /host/var/lib/kubelet/cpu_manager_state. QoS tier is set to Guaranteed for the pod.
What I was expecting was that cpus reserved to be the same for all containers and that it matches the effective request/limit as defined in the kubernetes docs . For example, if the cpu request/limit for the sum of the app containers, then that would be the effective request/limit all the containers. I am not seeing that. So either I am misunderstanding the docs, checking the wrong thing, or the behavior is wrong. Can someone confirm how to verify the effective cpu request/limit and if it should be the same for all containers?
I am continuing testing on 4.6 to compare behavior. I will update this bug with the results.
4.6 has the same behavior with regards to reuse oof cpus. Number of cpus used per container and the actual used cpus match those on 4.4. Testing the branch now. Hoping to see different behavior.
Another data point, when I was testing before I had all the various files to check, I was using the scheduler to tell me what it was using for the effective request/limit but setting values that would or would not be scheduled due to available resources. It seemed to do that right thing. This leads me to believe that the files I am checking do not actually reflect the effective request/limit, but something else.
Here is the results from the PR branch code.
myapp-container request/limit set to 1
init-myservice request/limit set to 2
sh-4.2# cat /host/var/lib/kubelet/cpu_manager_state
sh-4.2# cat /host/sys/fs/cgroup/cpuset/cpuset.cpus
sh-4.2# cat /host/sys/fs/cgroup/cpuset/cpuset.cpu_exclusive
sh-4.2# cat /host/sys/fs/cgroup/cpuset/cpuset.effective_cpus
sh-4.2# cat /host/etc/kubernetes/kubelet.conf |grep cpuManager
[cmeadors@localhost cmeadors-repro-bug1846428]$ oc get nodes -L cpumanager
NAME STATUS ROLES AGE VERSION CPUMANAGER
ip-10-0-134-15.us-west-2.compute.internal Ready master 65m v1.18.3
ip-10-0-172-240.us-west-2.compute.internal Ready worker 12m v1.18.3 true
ip-10-0-190-73.us-west-2.compute.internal Ready master 65m v1.18.3
ip-10-0-202-199.us-west-2.compute.internal Ready master 65m v1.18.3
ip-10-0-223-118.us-west-2.compute.internal Ready worker 55m v1.18.3
Based on my testing, I don't see any change in behavior. Specifically, I don't see app containers reusing the cpus from init containers.
Based on comment 2, not even sure why this was ON_QA with an unmerged PR. Looks like it was manually moved to MODIFIED instead of letting the CI automation take care of it.
In any case, based on comment 9, moving back to ASSSIGNED for engineering review.
The PR still not merged under the master, I do not sure why was it moved to ON_QA, thanks for verification and sorry for your wasted time :(
The fix will be available after the rebase on top of the k8s 1.19, no need to have the separate bug for it.
Could we keep this bug open until the rebase is done? That will help QE track this change so we can verify it in the rebase.
Sure, I do not have problem with it. Please change it to the relevant status.
On second thought, changing to ASSIGNED. Please set to ON_QA when rebase happens.
The OpenShift rebase on top of k8s 1.19.
I still don't see the expected behavior. I see the same behavior as before, init container gets two exclusive cpus then the app container gets a different exclusive CPU.
$ oc exec myapp-pod -c init-myservice -- cat /proc/self/status | grep "Cpus_allowed_list"
$ oc exec myapp-pod -- cat /proc/self/status | grep "Cpus_allowed_list"
Can you point me to the rebase and the lines that include changes that affect this? Maybe I am missing something. For now, I have to fail this.
The main idea that the CPU allocted by the init containers should not be released before the work by the init container is done, so in general the workflow should be:
1. Start the pod with the init container(that live long enough)
2. Verify that the CPU allocated to it does not exist under the shared pool(/var/lib/kubelet/cpu_manager_state)
3. Verify that the CPU allocated by the init container released by the CPU manager, once the init container done his work.
You can expect the reusage of the same CPUs only in case when the topology manager enabled and we should have the PR https://github.com/kubernetes/kubernetes/pull/93189 to be merged under our fork.
Only under the CPU manager we do not have any guarantees that the init container CPU will be reused.
Give me know if you need more details.
That may be true, but the description of the original PR https://github.com/kubernetes/kubernetes/pull/90419 that this bug was tracking specifically said that the CPU should be reused.
"this patch updates the strategy to reuse the CPUs assigned to init containers
across into app containers, while avoiding this edge case."
Can you please clarify what this bug is actually changing? If it is not done because of the additional PR please wait for that work to be ready before putting this in ON_QA. Maybe we should re-evaluate how we should track these particular upstream changes.
I agree, it should, but it introduced an additional bug, that was fixed by https://github.com/kubernetes/kubernetes/pull/93189, and I already created a separate bug to follow it, https://bugzilla.redhat.com/show_bug.cgi?id=1868634. The problem that we want to cherry-pick both PRs to release-4.5 and I can not cherry-pick https://github.com/kubernetes/kubernetes/pull/93189 before I cherry-pick this one, because it build one on another and the flow is to have the separate PR for each u/s PR, so I can not join both PRs under the single openshift PR.
Like I said the main thing here is to check that CPUs allocated for the init container did not released before it done.
So the check that I provided above should be sufficient to verify this specic bug.
Once the fix for https://bugzilla.redhat.com/show_bug.cgi?id=1868634 will be merged, it will possible to test reusing of CPUs.
Ok. Well from all my testing, that has always worked (started testing on 4.4). Because the nature of the changes for this bz have changed I am not sure what to do. It works, but there was never a reproducer for the reserved cpus not being reserved for the life of the init containers. I can't call this verified as nothing has changed.
I will https://bugzilla.redhat.com/show_bug.cgi?id=1868634 to verify the cpu reuse.
This upstream PR will come in on the 4.6 rebase against 1.19 (final) now that it has been released.
And I understand that this has been reopened before for QE tracking, but this is not done for any of the other numerous PRs we inherit from upstream and the bug itself is "we need this PR", not "this is a bug".
In order to follow process, we need this bug. Not the PR though.
QE note: wait for 4.6 to rebase against 1.19 final before testing this (i.e. before moving it to ON_QA)
1.19 rebase PR https://github.com/openshift/kubernetes/pull/325
Delay QE until this is merged.
Following steps in comment 20:
Pod created with initcontainer (pod is guaranteed init requested 2 cpus, app request 1) 4 cpus availble
# cat /host/var/lib/kubelet/cpu_manager_state
Same output after initcontainer finishes and app container starts.
CPUs for initcontainer are reserved for its life, but it doesn't look like the CPUs are released. That would fail based on comment 20 steps, but if the pod restarts will the init containers cpus still be guaranteed if they are actually released? It is very unclear what the correct behavior should be.
After more consideration, I am convinced that the current behavior satisfies that usecase of the original PR https://github.com/kubernetes/kubernetes/pull/90419 and this can be marked as VERIFIED. The init container CPU are reserved and in the case of a pod restart, the CPUs would still be available to allow the init container to succeed and allow the app container to start.
It is fully understood that having an init container needing more cpu that the app is not usual, but it was the easiest single case to run to verify this fix. Additionally, it is recognized that the reserved CPUs for the init container are not reused, but by the fact that they remain reserved satisfies the usecase.
This has not been tested under load or on a cluster with large numbers of cpu to catch all possible issues related to performance as the testing is very complex and time consuming. If there are performance issues please open new bugs for that so QE can plan accordingly.
Similarly if the behavior is incorrect please open a new bug. I am happy to provide more details around current behavior, but there is too much history in this bug with the rebase and splitting of changes to continue here.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.