Created attachment 1769545 [details] functests-cn-ran-du.log Description of problem: Performance CPU affinity mask, CPU reservation and CPU isolation verification on worker node is failing when running cnf-tests in discovery mode. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-03-092337 CNF tests image: registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.8.0-19 How reproducible: Run containerised tests in discovery mode against existing 4.8 clusters. Steps to Reproduce: Run containerised tests in discovery mode against existing 4.8 clusters for performance addon operator feature: podman run -v /root/cnf_ci/cnf-internal-deploy/_cache/:/kubeconfig:Z -v /tmp/artifacts/cn-ran-du:/reports:Z -e CLEAN_PERFORMANCE_PROFILE=false -e CNF_TESTS_IMAGE=openshift4-cnf-tests:v4.8.0-19 -e DPDK_TESTS_IMAGE=dpdk-base:v4.8.0-2 -e IMAGE_REGISTRY=registry-proxy.engineering.redhat.com/rh-osbs/ -e KUBECONFIG=/kubeconfig/kubeconfig -e SCTPTEST_HAS_NON_CNF_WORKERS=false -e DISCOVERY_MODE=true -e NODES_SELECTOR=node-role.kubernetes.io/worker-duprofile= -e ROLE_WORKER_CNF=worker-duprofile -e LATENCY_TEST_RUN=false -e LATENCY_TEST_RUNTIME=600 -e OSLAT_MAXIMUM_LATENCY=200 registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.8.0-19 /usr/bin/test-run.sh -ginkgo.focus 'performance|ptp|sriov|sctp|dpdk|ovn' -junit /reports/ -report /reports/ Actual results: STEP: Checking the profile perf-example with cpus &v2.CPU{Reserved:(*v2.CPUSet)(0xc0003c68b0), Isolated:(*v2.CPUSet)(0xc0003c6890), BalanceIsolated:(*bool)(nil)} STEP: Allocatable CPU should be less then capacity by 5 • Failure [0.574 seconds] [rfe_id:27363][performance] CPU Management /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:38 Verification of configuration on the worker node /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:74 [test_id:37862][crit:high][vendor:cnf-qe][level:acceptance] Verify CPU affinity mask, CPU reservation and CPU isolation on worker node [It] /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:82 Expected <string>: kind: KubeletConfiguration apiVersion: kubelet.config.k8s.io/v1beta1 authentication: x509: clientCAFile: /etc/kubernetes/kubelet-ca.crt anonymous: enabled: false cgroupDriver: systemd cgroupRoot: / clusterDNS: - 172.30.0.10 clusterDomain: cluster.local containerLogMaxSize: 50Mi maxPods: 250 kubeAPIQPS: 50 kubeAPIBurst: 100 rotateCertificates: true serializeImagePulls: false staticPodPath: /etc/kubernetes/manifests systemCgroups: /system.slice systemReserved: ephemeral-storage: 1Gi featureGates: APIPriorityAndFairness: true LegacyNodeRoleBehavior: false # Will be removed in future openshift/api update https://github.com/openshift/api/commit/c8c8f6d0f4a8ac4ff4ad7d1a84b27e1aa7ebf9b4 RemoveSelfLink: false NodeDisruptionExclusion: true RotateKubeletServerCertificate: true SCTPSupport: true ServiceNodeExclusion: true SupportPodPidsLimit: true serverTLSBootstrap: true tlsMinVersion: VersionTLS12 tlsCipherSuites: - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 to match regular expression <string>: "reservedSystemCPUs": ?"0,2,4,6,8" /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:98 Expected results: The test should pass.
@msivak Assigned to you since it's a compute test failing
Created attachment 1769560 [details] tests-artifacts
Created attachment 1770195 [details] PAO must gather
There are three worker nodes in the cluster cnfdd5 - with labels node-role.kubernetes.io/worker: "" and node-role.kubernetes.io/worker-cnf: "" cnfdd6 - with labels node-role.kubernetes.io/worker: "" and node-role.kubernetes.io/worker-cnf: "" cnfdd7 - with labels node-role.kubernetes.io/worker: "" and node-role.kubernetes.io/worker-duprofile: "" 2 performance profiles perf-example.yaml with nodeSelector as node-role.kubernetes.io/worker-duprofile: "" performance.yaml with nodeSelector as node-role.kubernetes.io/worker-cnf: "" Which means that cnfdd7 should be targeted and profile should be applied to it.
*** Bug 1946591 has been marked as a duplicate of this bug. ***
In performance-perf-example.yaml I noticed the following: status: conditions: - lastTransitionTime: "2021-04-07T14:50:58Z" message: 'could not get kubelet config key: error converting kubelet to int: strconv.Atoi: parsing "kubelet": invalid syntax' status: "False" type: Failure
*** Bug 1946589 has been marked as a duplicate of this bug. ***
1. I checked the node labels, mcp selectors in the performance profile and node selectors in the target MCP and based on all that profile should have been applied to a node -> cnfdd7 2. I tired to recreate the issue in my environment. Installed the same mcp, configured a node with the same labels etc etc, created the performance profile which gets applied without any issue. Ran the cnf-tests in the discovery mode and the only tests that failed was because stalld daemon was running on the host. This is a known issue and is being looked into here: https://bugzilla.redhat.com/show_bug.cgi?id=1949027
It looks like some bug under the machine config daemon, I saw the same issue on the different machine. It where it fails https://github.com/openshift/machine-config-operator/blob/0c69300057bac1ea65d544ab0e22b378690b2488/pkg/controller/kubelet-config/helpers.go#L179, so it worth checking the relevant annotations under the kubelet CR and generated MC.
1. once we create the first pool with dash worker-cnf and PAO creates KubeletConfig for it, all good 2. we create a second pool, PAO creates KubeletConfig for it a. the generated name for the first MC will be 99-worker-cnf-generate-kubelet, the code thinking that we have some suffix(because our pool name with a dash) - https://github.com/openshift/machine-config-operator/blob/0c69300057bac1ea65d544ab0e22b378690b2488/pkg/controller/kubelet-config/kubelet_config_controller.go#L572 and creates an additional MC for it with the name 99-worker-cnf-generate-kubelet-kubelet (kubelet is prefix annotations: machineconfiguration.openshift.io/mc-name-suffix: kubelet b.once it tries to generate the MC for the new kubelet config it fails under https://github.com/openshift/machine-config-operator/blob/0c69300057bac1ea65d544ab0e22b378690b2488/pkg/controller/kubelet-config/helpers.go#L179
Given the relevant code lives in the kubeletconfigcontroller, moving over to the node team to take a look
*** Bug 1946588 has been marked as a duplicate of this bug. ***
Hello, will this be addressed in the 4.8 GA? Not having dashes seems like a regression, as this worked in previous releases? example: this used to work: ran-du-eng1-smci00-profile0, confirmed on 4.6
The target release is set to 4.8.0 so it will be in the 4.8 GA.
Can you confirm that statement? I tried this in the latest 4.8.0-rc.3 and it for sure did not work. Which RC will the fix be contained in? ran-du-eng1-smci00-profile0 did not work. ran.du.fec3.dell03.profile0 worked.
Correction: ran-du-eng1-smci00-profile0 did not work. ran.du.eng1.smci00.profile0 worked. Cluster version is 4.8.0-rc.3.
@schoudha Do you know if the fix is going to be in 4.8 GA? I verified the fix is in 4.8.0-0.ci. Created two kubeletconfig and the suffix is as expected. $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 99-worker-cnf-generated-kubelet 29813c845a4a3ee8e6856713c585aca834e0bf1e 3.2.0 3m21s 99-worker-cnf-generated-kubelet-1 29813c845a4a3ee8e6856713c585aca834e0bf1e 3.2.0 4s $ oc describe kubeletconfig.machineconfiguration.openshift.io/worker-cnf Status: Conditions: Last Transition Time: 2021-07-09T18:01:55Z Message: Success Status: True Type: Success Events: <none>
@Qi Wang it will be released in 4.8 GA as the target release is 4.8
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438