Bug 1718944

Summary: CPUManager detection on OCP 4.1 fails
Product: Container Native Virtualization (CNV) Reporter: Kedar Bidarkar <kbidarka>
Component: VirtualizationAssignee: Fabian Deutsch <fdeutsch>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 2.0CC: cnv-qe-bugs, fdeutsch, ipinto, ncredi, pousley, sgordon, sgott
Target Milestone: ---   
Target Release: 2.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: hyperconverged-cluster-operator-container-v2.2.0-3 virt-operator-container-v2.2.0-2 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-01-30 16:27:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Logs for CPUManager test runs none

Description Kedar Bidarkar 2019-06-10 15:13:40 UTC
Description of problem:

1) kubevirt-config CM was already configured with CPUManager feature-gate.

[kbidarka@localhost tests]$ oc get cm kubevirt-config -n kubevirt-hyperconverged -o yaml | grep -i feature-gates
  feature-gates: DataVolumes,SRIOV,LiveMigration,CPUManager,CPUNodeDiscovery

2) Followed the OCP4.x guide to set up the "cpuManagerPolicy"  as per this link, 
https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html


After configuring as per the above guide, the "cpumanager=true" is not set on the nodes for the OCP4 setup.


Version-Release number of selected component (if applicable):

OCP4.1 + CNV2.0


How reproducible:

Follow the OCP4.1 guide to set up the "cpuManagerPolicy"  as per this link, 
https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html


Steps to Reproduce:
1. Follow OCP4.1 guide to enable "cpuManagerPolicy"
2. Ensure the "kubevirt-config" cm has included "CPUManager" as the feature-gate.
3. 

Actual results:
cpumanager=true is not set on the nodes.


Expected results:

cpumanager=true should be set successfully if for the nodes.


Additional info:

Comment 1 Kedar Bidarkar 2019-06-10 15:16:44 UTC
Vladik had pointed the following links related to this issue/bug,

a) https://github.com/kubernetes-sigs/node-feature-discovery/issues/165#issuecomment-426530948
b) https://github.com/kubernetes/kubernetes/issues/66525

Comment 2 Fabian Deutsch 2019-06-11 09:31:05 UTC
Kedar, are you saying that the OCP 4.1 description is wrong?

Comment 4 Kedar Bidarkar 2019-06-11 11:55:52 UTC
It appears that things changed with how cpumanager policy config is set for kubelet and because of this the kubelet process no longer shows the cpumanager setting through "ps -ef | grep -i cpu".

Vladik can provide more precise info around this.

He mentioned about the links, that is mentioned in comment1.

Comment 5 Kedar Bidarkar 2019-06-11 11:56:57 UTC
The correct command, I was referring to was, "ps -ef | grep kubelet | grep -i cpu"

Comment 6 Fabian Deutsch 2019-06-12 12:36:14 UTC
Kedar, are you saying that a) CPUManager is running but b) it can not be detected anymore using the command in comment 5?

Comment 7 Kedar Bidarkar 2019-06-12 13:07:48 UTC
Yes, CPUManager feature works fine with a workaround [1]. But using the official 4.X docs [2], the auto-labeling of the node to "cpumanager=true" no longer works.

And the workaround is not suitable as we need to configure the cpu-manager-policy on a per-node basis.

[1] - https://stackoverflow.com/questions/54227755/changing-the-cpu-manager-policy-in-kubernetes
[2] - https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html


Additional Info: Currently I am using the workaround to just test the kubevirt side stuff related to CPUManager and to automate the tests around it by using the workaround [1].

Comment 8 Israel Pinto 2019-06-13 02:48:19 UTC
(In reply to Kedar Bidarkar from comment #7)
> Yes, CPUManager feature works fine with a workaround [1]. But using the
> official 4.X docs [2], the auto-labeling of the node to "cpumanager=true" no
> longer works.
> 
> And the workaround is not suitable as we need to configure the
> cpu-manager-policy on a per-node basis.
> 
> [1] -
> https://stackoverflow.com/questions/54227755/changing-the-cpu-manager-policy-
> in-kubernetes
> [2] -
> https://docs.openshift.com/container-platform/4.1/
> scalability_and_performance/using-cpu-manager.html
> 
> 
> Additional Info: Currently I am using the workaround to just test the
> kubevirt side stuff related to CPUManager and to automate the tests around
> it by using the workaround [1].
Since is not enable out of the box after installation,  we can fix the W/A in your code to enable it?
or document it and fix in 2.1?

Comment 9 Fabian Deutsch 2019-06-13 11:51:16 UTC
Kedar, please create a new bug on OCP4 noting that CPUManager is not working if we are following the OCP4 docs.
Please make this bug depending on the OCP4 one.

It might "just" be an issue in OCP4, and CNV is behaving correctly (not labeling the nodes).

We might need a known-issue for this.

Comment 12 Kedar Bidarkar 2019-06-19 11:56:26 UTC
Tested with an OCP4 setup 


1. okd4 cpumanager is working fine ( Tested with pods, cpu-pinning works fine )
2. Auto-labeling fails to set the label "cpumanager=true" for the nodes( which is brought in by KubeVirt side), which implies that VMs which use guaranteed CPUs cannot be scheduled.

The workaround would be to

1) Remove the "CPUManager" entry from kubevirt-config ConfigMap and
2) manually label the nodes with "cpumanager=true" and then run the VMs with guaranteed CPUs on the setup.

Comment 13 Fabian Deutsch 2019-06-20 11:47:16 UTC
Ok, we go with a documentation item

Comment 14 Fabian Deutsch 2019-06-20 11:49:54 UTC
Docs bug: bug #1722451

Comment 15 Fabian Deutsch 2019-07-16 12:10:18 UTC
For openshift the status cna be detected on the cluster: https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html

Comment 16 Marcin Franczyk 2019-08-27 13:07:35 UTC
I created a bugfix PR https://github.com/kubevirt/kubevirt/pull/2639/files, before merge I need to verify if it works as it should

Comment 17 Marcin Franczyk 2019-08-30 12:37:05 UTC
I am going to change approach how we check CPU manager policy, currently, we get kubelet proc cmdline, it seems the best will be to get the policy from files:

/var/lib/origin/openshift.local.volumes/cpu_manager_state - OpenShift 3.11
/var/lib/kubelet/cpu_manager_state - k8s and OpenShift 4

the previous PR must be updated

Comment 18 Marcin Franczyk 2019-09-10 13:21:20 UTC
After speaking with Fabian and Vladik, since we don't want to add any additional hostPath(s) to virt-handler, I am going to check openshift-node configmap or machineconfigpool to discover a cpu-manager policy.

Comment 19 Marcin Franczyk 2019-09-13 12:51:51 UTC
bugfix prepared https://github.com/kubevirt/kubevirt/pull/2639

Comment 24 Kedar Bidarkar 2019-12-17 14:39:28 UTC
Containers:
  virt-operator:
    Container ID:  cri-o://de2efaf5648ba451103a879efdfe501e029d5665ed0e6422ab883bfa9ea2a073
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-operator:v2.2.0-10
    Image ID:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-operator@sha256:02df2ae5b35e57828f8242c47a46e51f9e7b6e6b773455384137990ea75a861b


Containers:
  hyperconverged-cluster-operator:
    Container ID:  cri-o://f99dd957a75149be026aa82709b3a7ca8d31de6e75686489f82cf9e132135a20
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator:v2.2.0-9
    Image ID:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator@sha256:0b240f7cfc1706668da80c617512ff1e09e463300615a6129401759494f041cf

Ran all the tests related to CPUManager from the kubevirt repo, https://github.com/kubevirt/kubevirt/blob/release-0.23/tests/vmi_configuration_test.go#L1573

Things look good and tests PASSED.

Will attach the logs shortly.

Comment 25 Kedar Bidarkar 2019-12-17 14:40:53 UTC
Created attachment 1645885 [details]
Logs for CPUManager test runs

Tests PASSED and things look good.

Comment 27 errata-xmlrpc 2020-01-30 16:27:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0307