Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1788061

Summary: KubeletConfig controller should respect all feature gates resources
Product: OpenShift Container Platform Reporter: Artyom <alukiano>
Component: NodeAssignee: Ryan Phillips <rphillips>
Status: CLOSED NOTABUG QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, augol, jokerman, rphillips, scuppett, william.caban
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-27 15:29:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1771572    

Description Artyom 2020-01-06 09:49:54 UTC
Description of problem:

Under my cluster, I have two feature gates resources

# oc describe featuregate
Name:         cluster
Namespace:    
Labels:       <none>
Annotations:  release.openshift.io/create-only: true
API Version:  config.openshift.io/v1
Kind:         FeatureGate
Metadata:
  Creation Timestamp:  2020-01-05T15:00:37Z
  Generation:          1
  Resource Version:    1608
  Self Link:           /apis/config.openshift.io/v1/featuregates/cluster
  UID:                 f8b5d7e0-30ea-4aa4-ba28-037477430806
Spec:
Events:  <none>


Name:         latency-sensetive
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         FeatureGate
Metadata:
  Creation Timestamp:  2020-01-06T08:42:29Z
  Generation:          1
  Owner References:
    API Version:           performance.openshift.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  PerformanceProfile
    Name:                  ci
    UID:                   91e3712a-0018-4c46-b4bc-60214443c62d
  Resource Version:        334393
  Self Link:               /apis/config.openshift.io/v1/featuregates/latency-sensetive
  UID:                     c3046163-4d2b-4d68-a1c0-5ae05d6615d4
Spec:
  Feature Set:  LatencySensitive
Events:         <none>

and one kubelet config 

# oc describe kubeletconfig
Name:         performance-ci
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         KubeletConfig
Metadata:
  Creation Timestamp:  2020-01-06T08:42:29Z
  Finalizers:
    99-performance-ci-819adb4c-a908-45b9-a6f9-bbfd651320a9-kubelet
  Generation:  1
  Owner References:
    API Version:           performance.openshift.io/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  PerformanceProfile
    Name:                  ci
    UID:                   91e3712a-0018-4c46-b4bc-60214443c62d
  Resource Version:        338475
  Self Link:               /apis/machineconfiguration.openshift.io/v1/kubeletconfigs/performance-ci
  UID:                     2ddb3962-db6f-482a-9a1b-4bd51dadc714
Spec:
  Kubelet Config:
    Authentication:
      Anonymous:
      Webhook:
        Cache TTL:  0s
      x509:
    Authorization:
      Webhook:
        Cache Authorized TTL:             0s
        Cache Unauthorized TTL:           0s
    Cpu Manager Policy:                   static
    Cpu Manager Reconcile Period:         5s
    Eviction Pressure Transition Period:  0s
    File Check Frequency:                 0s
    Http Check Frequency:                 0s
    Image Minimum GC Age:                 0s
    Kube Reserved:
      Cpu:                              1000m
      Memory:                           500Mi
    Node Status Report Frequency:       0s
    Node Status Update Frequency:       0s
    Reserved System CP Us:              0-1
    Runtime Request Timeout:            0s
    Streaming Connection Idle Timeout:  0s
    Sync Frequency:                     0s
    System Reserved:
      Cpu:                    1000m
      Memory:                 500Mi
    Topology Manager Policy:  best-effort
    Volume Stats Agg Period:  0s
  Machine Config Pool Selector:
    Match Labels:
      machineconfigpool.openshift.io/role:  performance-ci
Status:
  Conditions:
    Last Transition Time:  2020-01-06T08:42:30Z
    Message:               Success
    Status:                True
    Type:                  Success
    Last Transition Time:  2020-01-06T08:48:35Z
    Message:               Success
    Status:                True
    Type:                  Success
Events:                    <none>

And I have a situation when the kubelet config was created after the new feature gate resource, so the kubelet_config_controller rendered the machine config with kubelet topology manager parameters, but without appropriate TopologyManager feature gate.

# oc describe mcp performance-ci
...
Configuration:
    Name:  rendered-performance-ci-0779df28fd2847676ff858646dd7bf03
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker-chronyd-custom
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   98-performance-ci-819adb4c-a908-45b9-a6f9-bbfd651320a9-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   98-worker-6a886427-d155-4c29-9eef-f9f9c56274e2-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-performance-ci-819adb4c-a908-45b9-a6f9-bbfd651320a9-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-6a886427-d155-4c29-9eef-f9f9c56274e2-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-ssh
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   performance-ci
...

# oc describe mc 98-worker-6a886427-d155-4c29-9eef-f9f9c56274e2-kubelet
...
Source: data:text/plain,...SupportPodPidsLimit%22%3Atrue%2C%22TopologyManager%22%3Atrue%7D%2C%22containerLogMaxSize
...

# oc describe mc 99-performance-ci-819adb4c-a908-45b9-a6f9-bbfd651320a9-kubelet
...
data:text/plain,...SupportPodPidsLimit%22%3Atrue%7D%2C%22containerLogMaxSize
...

Version-Release number of selected component (if applicable):
# oc version
Client Version: 4.4.0-0.ci-2020-01-05-092524
Server Version: 4.4.0-0.ci-2020-01-05-092524
Kubernetes Version: v1.17.0

How reproducible:
Always

Steps to Reproduce:
1. Create additional topology manager
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: latency-sensetive
spec:
  featureSet: LatencySensitive

2. Wait for the update of master and worker mcp's

3. label worker mcp with "machineconfigpool.openshift.io/role: worker"

4. Create KubeletConfig
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: performance
spec:
  kubeletConfig:
    cpuManagerPolicy: static
    cpuManagerReconcilePeriod: 5s
    topologyManagerPolicy: best-effort
  machineConfigPoolSelector:
    matchLabels:
      machineconfigpool.openshift.io/role: worker

5. Wait for the update of worker mcp

Actual results:
MCP never have the updated condition, because node fails to start kubelet

06 09:47:36 worker-0 hyperkube[4972]: F0106 09:47:36.520874    4972 server.go:215] invalid configuration: TopologyManager best-effort requires feature gate TopologyManager

Expected results:
Worker MCP should achieve updated status and all worker nodes should be up.

Additional info:
W/A did not check it, but probably once I will re-create feature gate, it will initiate update of the configuration under the MCP.

The problematic part of the code - https://github.com/openshift/machine-config-operator/blob/04cd2198cae247fabcd3154669618d74f124f27f/pkg/controller/kubelet-config/kubelet_config_controller.go#L409

Comment 1 Ryan Phillips 2020-01-06 16:53:21 UTC
The feature gate CR name should be 'cluster'. Please retry with this name.

Comment 2 Artyom 2020-01-07 08:22:06 UTC
It exacts the reason why I opened the bug, I can see that https://github.com/openshift/machine-config-operator/blob/04cd2198cae247fabcd3154669618d74f124f27f/pkg/controller/kubelet-config/kubelet_config_features.go#L47 respects all feature gate resources, why the kubelet controller can not do the same?

I was expecting that the controller will check all feature gate resources and merge it under the rendered machine config.

Comment 3 Artyom 2020-01-07 15:16:50 UTC
Also an additional question, under machine-config-pool I can see a number of kubelet resources

Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   98-performance-ci-820350b6-4c52-41ec-9d9a-0a88b34e97ef-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   98-worker-247dbe02-ec39-4ea9-b514-e258aecf0729-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-performance-ci-820350b6-4c52-41ec-9d9a-0a88b34e97ef-kubelet

I expected that machine-config-pool will render machine-config that will include the merge of all these configs, but it does not look so.
From what I can see https://github.com/openshift/machine-config-operator/blob/04cd2198cae247fabcd3154669618d74f124f27f/pkg/apis/machineconfiguration.openshift.io/v1/helpers.go#L17 merge method does not really do merge, it just appends ignition configs, so in our case, the last ignition config with the /etc/kubernetes/kubelet.conf will override the whole configuration.

So also for the `cluster` feature gate, if we will first create the KubeletConfig resource and after update cluster feature gate resource, we will have the same problem, because KubeletConfig will be generated without TopologyManager feature gate, and we do not enter to kubelet config sync method on the feature gate resource change.

Comment 4 Ryan Phillips 2020-01-07 15:34:07 UTC
The config objects are singletons and so the 'cluster' KubeletConfig is the only supported config.

Comment 5 Ryan Phillips 2020-02-27 15:29:38 UTC
Closing since configs are singletons.

Comment 6 Red Hat Bugzilla 2023-09-14 05:49:25 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days