1718944 – CPUManager detection on OCP 4.1 fails

Bug 1718944 - CPUManager detection on OCP 4.1 fails

Summary: CPUManager detection on OCP 4.1 fails

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Virtualization
Sub Component:
Version:	2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.2.0
Assignee:	Fabian Deutsch
QA Contact:	Kedar Bidarkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-06-10 15:13 UTC by Kedar Bidarkar
Modified:	2020-01-30 16:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:	hyperconverged-cluster-operator-container-v2.2.0-3 virt-operator-container-v2.2.0-2
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-01-30 16:27:11 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Logs for CPUManager test runs (11.44 KB, text/plain) 2019-12-17 14:40 UTC, Kedar Bidarkar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2020:0307	0	None	None	None	2020-01-30 16:27:21 UTC

Description Kedar Bidarkar 2019-06-10 15:13:40 UTC

Description of problem:

1) kubevirt-config CM was already configured with CPUManager feature-gate.

[kbidarka@localhost tests]$ oc get cm kubevirt-config -n kubevirt-hyperconverged -o yaml | grep -i feature-gates
  feature-gates: DataVolumes,SRIOV,LiveMigration,CPUManager,CPUNodeDiscovery

2) Followed the OCP4.x guide to set up the "cpuManagerPolicy"  as per this link, 
https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html


After configuring as per the above guide, the "cpumanager=true" is not set on the nodes for the OCP4 setup.


Version-Release number of selected component (if applicable):

OCP4.1 + CNV2.0


How reproducible:

Follow the OCP4.1 guide to set up the "cpuManagerPolicy"  as per this link, 
https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html


Steps to Reproduce:
1. Follow OCP4.1 guide to enable "cpuManagerPolicy"
2. Ensure the "kubevirt-config" cm has included "CPUManager" as the feature-gate.
3. 

Actual results:
cpumanager=true is not set on the nodes.


Expected results:

cpumanager=true should be set successfully if for the nodes.


Additional info:

Comment 1 Kedar Bidarkar 2019-06-10 15:16:44 UTC

Vladik had pointed the following links related to this issue/bug,

a) https://github.com/kubernetes-sigs/node-feature-discovery/issues/165#issuecomment-426530948
b) https://github.com/kubernetes/kubernetes/issues/66525

Comment 2 Fabian Deutsch 2019-06-11 09:31:05 UTC

Kedar, are you saying that the OCP 4.1 description is wrong?

Comment 4 Kedar Bidarkar 2019-06-11 11:55:52 UTC

It appears that things changed with how cpumanager policy config is set for kubelet and because of this the kubelet process no longer shows the cpumanager setting through "ps -ef | grep -i cpu".

Vladik can provide more precise info around this.

He mentioned about the links, that is mentioned in comment1.

Comment 5 Kedar Bidarkar 2019-06-11 11:56:57 UTC

The correct command, I was referring to was, "ps -ef | grep kubelet | grep -i cpu"

Comment 6 Fabian Deutsch 2019-06-12 12:36:14 UTC

Kedar, are you saying that a) CPUManager is running but b) it can not be detected anymore using the command in comment 5?

Comment 7 Kedar Bidarkar 2019-06-12 13:07:48 UTC

Yes, CPUManager feature works fine with a workaround [1]. But using the official 4.X docs [2], the auto-labeling of the node to "cpumanager=true" no longer works.

And the workaround is not suitable as we need to configure the cpu-manager-policy on a per-node basis.

[1] - https://stackoverflow.com/questions/54227755/changing-the-cpu-manager-policy-in-kubernetes
[2] - https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html


Additional Info: Currently I am using the workaround to just test the kubevirt side stuff related to CPUManager and to automate the tests around it by using the workaround [1].

Comment 8 Israel Pinto 2019-06-13 02:48:19 UTC

(In reply to Kedar Bidarkar from comment #7)
> Yes, CPUManager feature works fine with a workaround [1]. But using the
> official 4.X docs [2], the auto-labeling of the node to "cpumanager=true" no
> longer works.
> 
> And the workaround is not suitable as we need to configure the
> cpu-manager-policy on a per-node basis.
> 
> [1] -
> https://stackoverflow.com/questions/54227755/changing-the-cpu-manager-policy-
> in-kubernetes
> [2] -
> https://docs.openshift.com/container-platform/4.1/
> scalability_and_performance/using-cpu-manager.html
> 
> 
> Additional Info: Currently I am using the workaround to just test the
> kubevirt side stuff related to CPUManager and to automate the tests around
> it by using the workaround [1].
Since is not enable out of the box after installation,  we can fix the W/A in your code to enable it?
or document it and fix in 2.1?

Comment 9 Fabian Deutsch 2019-06-13 11:51:16 UTC

Kedar, please create a new bug on OCP4 noting that CPUManager is not working if we are following the OCP4 docs.
Please make this bug depending on the OCP4 one.

It might "just" be an issue in OCP4, and CNV is behaving correctly (not labeling the nodes).

We might need a known-issue for this.

Comment 12 Kedar Bidarkar 2019-06-19 11:56:26 UTC

Tested with an OCP4 setup 


1. okd4 cpumanager is working fine ( Tested with pods, cpu-pinning works fine )
2. Auto-labeling fails to set the label "cpumanager=true" for the nodes( which is brought in by KubeVirt side), which implies that VMs which use guaranteed CPUs cannot be scheduled.

The workaround would be to

1) Remove the "CPUManager" entry from kubevirt-config ConfigMap and
2) manually label the nodes with "cpumanager=true" and then run the VMs with guaranteed CPUs on the setup.

Comment 13 Fabian Deutsch 2019-06-20 11:47:16 UTC

Ok, we go with a documentation item

Comment 14 Fabian Deutsch 2019-06-20 11:49:54 UTC

Docs bug: bug #1722451

Comment 15 Fabian Deutsch 2019-07-16 12:10:18 UTC

For openshift the status cna be detected on the cluster: https://docs.openshift.com/container-platform/4.1/scalability_and_performance/using-cpu-manager.html

Comment 16 Marcin Franczyk 2019-08-27 13:07:35 UTC

I created a bugfix PR https://github.com/kubevirt/kubevirt/pull/2639/files, before merge I need to verify if it works as it should

Comment 17 Marcin Franczyk 2019-08-30 12:37:05 UTC

I am going to change approach how we check CPU manager policy, currently, we get kubelet proc cmdline, it seems the best will be to get the policy from files:

/var/lib/origin/openshift.local.volumes/cpu_manager_state - OpenShift 3.11
/var/lib/kubelet/cpu_manager_state - k8s and OpenShift 4

the previous PR must be updated

Comment 18 Marcin Franczyk 2019-09-10 13:21:20 UTC

After speaking with Fabian and Vladik, since we don't want to add any additional hostPath(s) to virt-handler, I am going to check openshift-node configmap or machineconfigpool to discover a cpu-manager policy.

Comment 19 Marcin Franczyk 2019-09-13 12:51:51 UTC

bugfix prepared https://github.com/kubevirt/kubevirt/pull/2639

Comment 24 Kedar Bidarkar 2019-12-17 14:39:28 UTC

Containers:
  virt-operator:
    Container ID:  cri-o://de2efaf5648ba451103a879efdfe501e029d5665ed0e6422ab883bfa9ea2a073
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-operator:v2.2.0-10
    Image ID:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-virt-operator@sha256:02df2ae5b35e57828f8242c47a46e51f9e7b6e6b773455384137990ea75a861b


Containers:
  hyperconverged-cluster-operator:
    Container ID:  cri-o://f99dd957a75149be026aa82709b3a7ca8d31de6e75686489f82cf9e132135a20
    Image:         registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator:v2.2.0-9
    Image ID:      registry-proxy.engineering.redhat.com/rh-osbs/container-native-virtualization-hyperconverged-cluster-operator@sha256:0b240f7cfc1706668da80c617512ff1e09e463300615a6129401759494f041cf

Ran all the tests related to CPUManager from the kubevirt repo, https://github.com/kubevirt/kubevirt/blob/release-0.23/tests/vmi_configuration_test.go#L1573

Things look good and tests PASSED.

Will attach the logs shortly.

Comment 25 Kedar Bidarkar 2019-12-17 14:40:53 UTC

Created attachment 1645885 [details]
Logs for CPUManager test runs

Tests PASSED and things look good.

Comment 27 errata-xmlrpc 2020-01-30 16:27:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0307

Note You need to log in before you can comment on or make changes to this bug.