Bug 1946584 - Machine-config controller fails to generate MC, when machine config pool with dashes in name presents under the cluster
Summary: Machine-config controller fails to generate MC, when machine config pool with...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.8.0
Assignee: Qi Wang
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1946588 1946589 1946591 (view as bug list)
Depends On:
Blocks: 2008588
TreeView+ depends on / blocked
 
Reported: 2021-04-06 12:59 UTC by Sabina Aledort
Modified: 2021-09-28 15:48 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2008588 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:57:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
functests-cn-ran-du.log (80.48 KB, text/plain)
2021-04-06 12:59 UTC, Sabina Aledort
no flags Details
tests-artifacts (234.94 KB, application/x-xz)
2021-04-06 13:05 UTC, Sabina Aledort
no flags Details
PAO must gather (760.09 KB, application/x-xz)
2021-04-08 10:40 UTC, Sabina Aledort
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2598 0 None open Bug 1946584: Check suffix annotation is a number 2021-06-04 04:47:16 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:58:04 UTC

Internal Links: 1946589 1946591

Description Sabina Aledort 2021-04-06 12:59:46 UTC
Created attachment 1769545 [details]
functests-cn-ran-du.log

Description of problem:
Performance CPU affinity mask, CPU reservation and CPU isolation verification on worker node is failing when running cnf-tests in discovery mode.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-03-092337
CNF tests image: registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.8.0-19

How reproducible:
Run containerised tests in discovery mode against existing 4.8 clusters.

Steps to Reproduce:
Run containerised tests in discovery mode against existing 4.8 clusters for performance addon operator feature:

podman run -v /root/cnf_ci/cnf-internal-deploy/_cache/:/kubeconfig:Z -v /tmp/artifacts/cn-ran-du:/reports:Z -e CLEAN_PERFORMANCE_PROFILE=false -e CNF_TESTS_IMAGE=openshift4-cnf-tests:v4.8.0-19 -e DPDK_TESTS_IMAGE=dpdk-base:v4.8.0-2 -e IMAGE_REGISTRY=registry-proxy.engineering.redhat.com/rh-osbs/ -e KUBECONFIG=/kubeconfig/kubeconfig -e SCTPTEST_HAS_NON_CNF_WORKERS=false -e DISCOVERY_MODE=true -e NODES_SELECTOR=node-role.kubernetes.io/worker-duprofile= -e ROLE_WORKER_CNF=worker-duprofile -e LATENCY_TEST_RUN=false -e LATENCY_TEST_RUNTIME=600 -e OSLAT_MAXIMUM_LATENCY=200 registry-proxy.engineering.redhat.com/rh-osbs/openshift4-cnf-tests:v4.8.0-19 /usr/bin/test-run.sh -ginkgo.focus 'performance|ptp|sriov|sctp|dpdk|ovn' -junit /reports/ -report /reports/

Actual results:
STEP: Checking the profile perf-example with cpus &v2.CPU{Reserved:(*v2.CPUSet)(0xc0003c68b0), Isolated:(*v2.CPUSet)(0xc0003c6890), BalanceIsolated:(*bool)(nil)}
STEP: Allocatable CPU should be less then capacity by 5
• Failure [0.574 seconds]
[rfe_id:27363][performance] CPU Management
/remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:38
  Verification of configuration on the worker node
  /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:74
    [test_id:37862][crit:high][vendor:cnf-qe][level:acceptance] Verify CPU affinity mask, CPU reservation and CPU isolation on worker node [It]
    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:82

    Expected
        <string>: kind: KubeletConfiguration
        apiVersion: kubelet.config.k8s.io/v1beta1
        authentication:
          x509:
            clientCAFile: /etc/kubernetes/kubelet-ca.crt
          anonymous:
            enabled: false
        cgroupDriver: systemd
        cgroupRoot: /
        clusterDNS:
          - 172.30.0.10
        clusterDomain: cluster.local
        containerLogMaxSize: 50Mi
        maxPods: 250
        kubeAPIQPS: 50
        kubeAPIBurst: 100
        rotateCertificates: true
        serializeImagePulls: false
        staticPodPath: /etc/kubernetes/manifests
        systemCgroups: /system.slice
        systemReserved:
          ephemeral-storage: 1Gi
        featureGates:
          APIPriorityAndFairness: true
          LegacyNodeRoleBehavior: false
          # Will be removed in future openshift/api update https://github.com/openshift/api/commit/c8c8f6d0f4a8ac4ff4ad7d1a84b27e1aa7ebf9b4
          RemoveSelfLink: false
          NodeDisruptionExclusion: true
          RotateKubeletServerCertificate: true
          SCTPSupport: true
          ServiceNodeExclusion: true
          SupportPodPidsLimit: true
        serverTLSBootstrap: true
        tlsMinVersion: VersionTLS12
        tlsCipherSuites:
          - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
          - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
          - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
          - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
          - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
          - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
    to match regular expression
        <string>: "reservedSystemCPUs": ?"0,2,4,6,8"

    /remote-source/app/vendor/github.com/openshift-kni/performance-addon-operators/functests/1_performance/cpu_management.go:98

Expected results:
The test should pass.

Comment 1 Federico Paolinelli 2021-04-06 13:04:20 UTC
@msivak Assigned to you since it's a compute test failing

Comment 2 Sabina Aledort 2021-04-06 13:05:06 UTC
Created attachment 1769560 [details]
tests-artifacts

Comment 4 Sabina Aledort 2021-04-08 10:40:10 UTC
Created attachment 1770195 [details]
PAO must gather

Comment 5 Swati Sehgal 2021-04-08 12:08:46 UTC
There are three worker nodes in the cluster
cnfdd5 - with labels node-role.kubernetes.io/worker: "" and node-role.kubernetes.io/worker-cnf: ""
cnfdd6 - with labels node-role.kubernetes.io/worker: "" and node-role.kubernetes.io/worker-cnf: ""
cnfdd7 - with labels node-role.kubernetes.io/worker: "" and node-role.kubernetes.io/worker-duprofile: ""

2 performance profiles 
perf-example.yaml with nodeSelector as node-role.kubernetes.io/worker-duprofile: ""
performance.yaml with nodeSelector as node-role.kubernetes.io/worker-cnf: ""

Which means that cnfdd7 should be targeted and profile should be applied to it.

Comment 6 Swati Sehgal 2021-04-08 12:10:35 UTC
*** Bug 1946591 has been marked as a duplicate of this bug. ***

Comment 7 Swati Sehgal 2021-04-08 12:46:07 UTC
In performance-perf-example.yaml I noticed the following:

status:
  conditions:
  - lastTransitionTime: "2021-04-07T14:50:58Z"
    message: 'could not get kubelet config key: error converting kubelet to int: strconv.Atoi:
      parsing "kubelet": invalid syntax'
    status: "False"
    type: Failure

Comment 8 Swati Sehgal 2021-04-08 12:47:46 UTC
*** Bug 1946589 has been marked as a duplicate of this bug. ***

Comment 9 Swati Sehgal 2021-04-16 10:59:01 UTC
1. I checked the node labels, mcp selectors in the performance profile and node selectors in the target MCP and based on all that profile should have been applied to a node -> cnfdd7
2. I tired to recreate the issue in my environment. Installed the same mcp, configured a node with the same labels etc etc, created the performance profile which gets applied without any issue. Ran the cnf-tests in the discovery mode and the only tests that failed was because stalld daemon was running on the host. This is a known issue and is being looked into here: https://bugzilla.redhat.com/show_bug.cgi?id=1949027

Comment 10 Artyom 2021-05-25 10:22:15 UTC
It looks like some bug under the machine config daemon, I saw the same issue on the different machine.
It where it fails https://github.com/openshift/machine-config-operator/blob/0c69300057bac1ea65d544ab0e22b378690b2488/pkg/controller/kubelet-config/helpers.go#L179, so it worth checking the relevant annotations under the kubelet CR and generated MC.

Comment 11 Artyom 2021-05-25 11:22:32 UTC
1. once we create the first pool with dash worker-cnf and PAO creates KubeletConfig for it, all good
2. we create a second pool, PAO creates KubeletConfig for it
   a. the generated name for the first MC will be 99-worker-cnf-generate-kubelet, the code thinking that we have some 
      suffix(because our pool name with a dash) - https://github.com/openshift/machine-config-operator/blob/0c69300057bac1ea65d544ab0e22b378690b2488/pkg/controller/kubelet-config/kubelet_config_controller.go#L572 and creates an additional MC for it with the name 99-worker-cnf-generate-kubelet-kubelet (kubelet is prefix annotations: 
      machineconfiguration.openshift.io/mc-name-suffix: kubelet
  b.once it tries to generate the MC for the new kubelet config it fails under https://github.com/openshift/machine-config-operator/blob/0c69300057bac1ea65d544ab0e22b378690b2488/pkg/controller/kubelet-config/helpers.go#L179

Comment 12 Yu Qi Zhang 2021-05-26 18:34:46 UTC
Given the relevant code lives in the kubeletconfigcontroller, moving over to the node team to take a look

Comment 14 Artyom 2021-06-01 13:49:24 UTC
*** Bug 1946588 has been marked as a duplicate of this bug. ***

Comment 18 Dave Cain 2021-07-08 02:12:59 UTC
Hello, will this be addressed in the 4.8 GA?  Not having dashes seems like a regression, as this worked in previous releases?

example: this used to work: ran-du-eng1-smci00-profile0, confirmed on 4.6

Comment 19 Qi Wang 2021-07-09 14:51:17 UTC
The target release is set to 4.8.0 so it will be in the 4.8 GA.

Comment 20 Dave Cain 2021-07-09 14:57:41 UTC
Can you confirm that statement?  I tried this in the latest 4.8.0-rc.3 and it for sure did not work.  Which RC will the fix be contained in?

ran-du-eng1-smci00-profile0 did not work.
ran.du.fec3.dell03.profile0 worked.

Comment 21 Dave Cain 2021-07-09 14:58:42 UTC
Correction:

ran-du-eng1-smci00-profile0 did not work.
ran.du.eng1.smci00.profile0 worked.

Cluster version is 4.8.0-rc.3.

Comment 22 Qi Wang 2021-07-09 18:12:03 UTC
@schoudha Do you know if the fix is going to be in 4.8 GA?

I verified the fix is in 4.8.0-0.ci. Created two kubeletconfig and the suffix is as expected.
$ oc get mc
NAME                                                   GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
99-worker-cnf-generated-kubelet                        29813c845a4a3ee8e6856713c585aca834e0bf1e   3.2.0             3m21s
99-worker-cnf-generated-kubelet-1                      29813c845a4a3ee8e6856713c585aca834e0bf1e   3.2.0             4s

$ oc describe kubeletconfig.machineconfiguration.openshift.io/worker-cnf
Status:
  Conditions:
    Last Transition Time:  2021-07-09T18:01:55Z
    Message:               Success
    Status:                True
    Type:                  Success
Events:                    <none>

Comment 23 Sunil Choudhary 2021-07-16 06:27:51 UTC
@Qi Wang it will be released in 4.8 GA as the target release is 4.8

Comment 25 errata-xmlrpc 2021-07-27 22:57:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.