Bug 2087684

Summary: KCMO should not be able to apply LowUpdateSlowReaction from Default WorkerLatencyProfile
Product: OpenShift Container Platform Reporter: Harshal Patil <harpatil>
Component: kube-controller-managerAssignee: Swarup Ghosh <swghosh>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: high Docs Contact:
Priority: high    
Version: 4.11CC: fkrepins, mfojtik, nagrawal, rphillips, weinliu
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 11:12:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Harshal Patil 2022-05-18 08:53:02 UTC
During the testing with the latest nightly it appears that Kube Controller Manager Operator is able to apply LowUpdateSlowReaction even though the cluster is at Default WorkerLatencyProfile. 


This violates the cluster stability analysis [1]. Transition from Default to LowUpdateSlowReaction and vice-versa should be prohibited. 


[1] https://github.com/openshift/enhancements/blob/master/enhancements/worker-latency-profile/worker-latency-profile.md#default---lowupdateslowreaction

Comment 1 Swarup Ghosh 2022-05-27 11:04:12 UTC
The proposed fix in the linked PR is able to reject when user tries to update from Default to LowUpdateSlowReaction profile. [1]

Initially, while at Default profile:

```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
    workerLatencyProfile: Default
$ oc get KubeControllerManager -o yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
  kind: KubeControllerManager
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
      release.openshift.io/create-only: "true"
    creationTimestamp: "2022-05-26T07:56:51Z"
    generation: 11
    name: cluster
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 301ae517-73af-4a49-a197-28028d6c4761
    resourceVersion: "614299"
    uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3
  spec:
    logLevel: Normal
    managementState: Managed
    observedConfig:
      extendedArguments:
        cloud-config:
        - /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf
        cloud-provider:
        - gce
        cluster-cidr:
        - 10.128.0.0/14
        cluster-name:
        - swghosh-20220526-361d-m9ztz
        feature-gates:
        - APIPriorityAndFairness=true
        - RotateKubeletServerCertificate=true
        - DownwardAPIHugePages=true
        - PodSecurity=true
        - CSIMigrationAWS=false
        - CSIMigrationGCE=false
        - CSIMigrationAzureFile=false
        - CSIMigrationvSphere=false
        node-monitor-grace-period:
        - 40s
        service-cluster-ip-range:
        - 172.30.0.0/16
      serviceServingCert:
        certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt
      servingInfo:
        cipherSuites:
        - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
        - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
        minTLSVersion: VersionTLS12
    operatorLogLevel: Normal
    unsupportedConfigOverrides: null
    useMoreSecureServiceCA: true
  status:
    conditions:
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      reason: NoUnsupportedConfigOverrides
      status: "True"
      type: UnsupportedConfigOverridesUpgradeable
    - lastTransitionTime: "2022-05-26T09:24:29Z"
      status: "False"
      type: InstallerControllerDegraded
    - lastTransitionTime: "2022-05-26T08:05:22Z"
      message: 3 nodes are active; 3 nodes are at revision 28
      status: "True"
      type: StaticPodsAvailable
    - lastTransitionTime: "2022-05-27T10:55:16Z"
      message: 3 nodes are at revision 28
      reason: AllNodesAtLatestRevision
      status: "False"
      type: NodeInstallerProgressing
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      status: "False"
      type: NodeInstallerDegraded
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      message: All master nodes are ready
      reason: MasterNodesReady
      status: "False"
      type: NodeControllerDegraded
    - lastTransitionTime: "2022-05-27T10:55:16Z"
      reason: ProfileUpdated
      status: "False"
      type: WorkerLatencyProfileProgressing
    - lastTransitionTime: "2022-05-27T10:55:16Z"
      message: all static pod revision(s) have updated latency profile
      reason: ProfileUpdated
      status: "True"
      type: WorkerLatencyProfileComplete
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      reason: AsExpected
      status: "False"
      type: MissingStaticPodControllerDegraded
    - lastTransitionTime: "2022-05-27T10:43:42Z"
      status: "False"
      type: RevisionControllerDegraded
    - lastTransitionTime: "2022-05-26T08:09:52Z"
      status: "False"
      type: ConfigObservationDegraded
    - lastTransitionTime: "2022-05-26T08:01:32Z"
      status: "False"
      type: InstallerPodPendingDegraded
    - lastTransitionTime: "2022-05-26T08:01:32Z"
      status: "False"
      type: InstallerPodContainerWaitingDegraded
    - lastTransitionTime: "2022-05-26T08:01:32Z"
      status: "False"
      type: InstallerPodNetworkingDegraded
    - lastTransitionTime: "2022-05-26T08:11:52Z"
      reason: AsExpected
      status: "False"
      type: BackingResourceControllerDegraded
    - lastTransitionTime: "2022-05-26T08:01:39Z"
      status: "False"
      type: SATokenSignerDegraded
    - lastTransitionTime: "2022-05-26T08:12:00Z"
      reason: AsExpected
      status: "False"
      type: KubeControllerManagerStaticResourcesDegraded
    - lastTransitionTime: "2022-05-27T10:45:32Z"
      reason: AsExpected
      status: "False"
      type: GuardControllerDegraded
    - lastTransitionTime: "2022-05-26T08:01:44Z"
      reason: AsExpected
      status: "False"
      type: WorkerLatencyProfileDegraded
    - lastTransitionTime: "2022-05-27T10:52:50Z"
      status: "False"
      type: StaticPodsDegraded
    - lastTransitionTime: "2022-05-27T07:42:12Z"
      status: "False"
      type: CertRotation_CSRSigningCert_Degraded
    - lastTransitionTime: "2022-05-27T08:02:26Z"
      status: "False"
      type: ResourceSyncControllerDegraded
    - lastTransitionTime: "2022-05-26T08:02:06Z"
      status: "True"
      type: Upgradeable
    - lastTransitionTime: "2022-05-26T08:02:06Z"
      status: "True"
      type: CloudControllerOwner
    - lastTransitionTime: "2022-05-26T08:12:20Z"
      status: "False"
      type: TargetConfigControllerDegraded
    latestAvailableRevision: 28
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 28
      nodeName: swghosh-20220526-361d-m9ztz-master-0
    - currentRevision: 28
      nodeName: swghosh-20220526-361d-m9ztz-master-1
    - currentRevision: 28
      nodeName: swghosh-20220526-361d-m9ztz-master-2
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
```

After user's updating the profile to MediumUpdateAverageReaction profile:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
    workerLatencyProfile: LowUpdateSlowReaction
$ oc get KubeControllerManager -o yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
  kind: KubeControllerManager
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
      include.release.openshift.io/single-node-developer: "true"
      release.openshift.io/create-only: "true"
    creationTimestamp: "2022-05-26T07:56:51Z"
    generation: 11
    name: cluster
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 301ae517-73af-4a49-a197-28028d6c4761
    resourceVersion: "614798"
    uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3
  spec:
    logLevel: Normal
    managementState: Managed
    observedConfig:
      extendedArguments:
        cloud-config:
        - /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf
        cloud-provider:
        - gce
        cluster-cidr:
        - 10.128.0.0/14
        cluster-name:
        - swghosh-20220526-361d-m9ztz
        feature-gates:
        - APIPriorityAndFairness=true
        - RotateKubeletServerCertificate=true
        - DownwardAPIHugePages=true
        - PodSecurity=true
        - CSIMigrationAWS=false
        - CSIMigrationGCE=false
        - CSIMigrationAzureFile=false
        - CSIMigrationvSphere=false
        node-monitor-grace-period:
        - 40s
        service-cluster-ip-range:
        - 172.30.0.0/16
      serviceServingCert:
        certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt
      servingInfo:
        cipherSuites:
        - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
        - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
        - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
        - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
        minTLSVersion: VersionTLS12
    operatorLogLevel: Normal
    unsupportedConfigOverrides: null
    useMoreSecureServiceCA: true
  status:
    conditions:
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      reason: NoUnsupportedConfigOverrides
      status: "True"
      type: UnsupportedConfigOverridesUpgradeable
    - lastTransitionTime: "2022-05-26T09:24:29Z"
      status: "False"
      type: InstallerControllerDegraded
    - lastTransitionTime: "2022-05-26T08:05:22Z"
      message: 3 nodes are active; 3 nodes are at revision 28
      status: "True"
      type: StaticPodsAvailable
    - lastTransitionTime: "2022-05-27T10:55:16Z"
      message: 3 nodes are at revision 28
      reason: AllNodesAtLatestRevision
      status: "False"
      type: NodeInstallerProgressing
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      status: "False"
      type: NodeInstallerDegraded
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      message: All master nodes are ready
      reason: MasterNodesReady
      status: "False"
      type: NodeControllerDegraded
    - lastTransitionTime: "2022-05-27T10:55:16Z"
      reason: ProfileUpdateProhibited
      status: "False"
      type: WorkerLatencyProfileProgressing
    - lastTransitionTime: "2022-05-27T10:55:16Z"
      message: rejected update from "Default" to "LowUpdateSlowReaction" latency profile
        as extreme profile transition is unsupported
      reason: ProfileUpdateProhibited
      status: "True"
      type: WorkerLatencyProfileComplete
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      reason: AsExpected
      status: "False"
      type: MissingStaticPodControllerDegraded
    - lastTransitionTime: "2022-05-27T10:43:42Z"
      status: "False"
      type: RevisionControllerDegraded
    - lastTransitionTime: "2022-05-26T08:09:52Z"
      status: "False"
      type: ConfigObservationDegraded
    - lastTransitionTime: "2022-05-26T08:01:32Z"
      status: "False"
      type: InstallerPodPendingDegraded
    - lastTransitionTime: "2022-05-26T08:01:32Z"
      status: "False"
      type: InstallerPodContainerWaitingDegraded
    - lastTransitionTime: "2022-05-26T08:01:32Z"
      status: "False"
      type: InstallerPodNetworkingDegraded
    - lastTransitionTime: "2022-05-26T08:11:52Z"
      reason: AsExpected
      status: "False"
      type: BackingResourceControllerDegraded
    - lastTransitionTime: "2022-05-26T08:01:39Z"
      status: "False"
      type: SATokenSignerDegraded
    - lastTransitionTime: "2022-05-26T08:12:00Z"
      reason: AsExpected
      status: "False"
      type: KubeControllerManagerStaticResourcesDegraded
    - lastTransitionTime: "2022-05-27T10:45:32Z"
      reason: AsExpected
      status: "False"
      type: GuardControllerDegraded
    - lastTransitionTime: "2022-05-26T08:01:44Z"
      reason: AsExpected
      status: "False"
      type: WorkerLatencyProfileDegraded
    - lastTransitionTime: "2022-05-27T10:52:50Z"
      status: "False"
      type: StaticPodsDegraded
    - lastTransitionTime: "2022-05-27T07:42:12Z"
      status: "False"
      type: CertRotation_CSRSigningCert_Degraded
    - lastTransitionTime: "2022-05-27T08:02:26Z"
      status: "False"
      type: ResourceSyncControllerDegraded
    - lastTransitionTime: "2022-05-26T08:02:06Z"
      status: "True"
      type: Upgradeable
    - lastTransitionTime: "2022-05-26T08:02:06Z"
      status: "True"
      type: CloudControllerOwner
    - lastTransitionTime: "2022-05-26T08:12:20Z"
      status: "False"
      type: TargetConfigControllerDegraded
    latestAvailableRevision: 28
    latestAvailableRevisionReason: ""
    nodeStatuses:
    - currentRevision: 28
      nodeName: swghosh-20220526-361d-m9ztz-master-0
    - currentRevision: 28
      nodeName: swghosh-20220526-361d-m9ztz-master-1
    - currentRevision: 28
      nodeName: swghosh-20220526-361d-m9ztz-master-2
    readyReplicas: 0
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""
```

As evident from the status, the "WorkerLatencyProfileComplete" shows up the message: 'rejected update from "Default" to "LowUpdateSlowReaction" latency profile as extreme profile transition is unsupported'.
Also, observedConfig.extendedArguments.node-monitor-grace-period is still at 40s, indicating that the args for Default profile were retained as update to LowUpdateSlowReaction profile was rejected.

[1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629

Comment 2 Swarup Ghosh 2022-05-27 11:16:05 UTC
Similarly, in the case of the scenario where profile is being update from LowUpdateSlowReaction to Default.

While in LowUpdateAverageReaction profile:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
    workerLatencyProfile: LowUpdateSlowReaction
$ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4
      type: NodeControllerDegraded
    - lastTransitionTime: "2022-05-27T10:19:26Z"
      reason: ProfileUpdated
      status: "False"
      type: WorkerLatencyProfileProgressing
    - lastTransitionTime: "2022-05-27T10:19:26Z"
      message: all static pod revision(s) have updated latency profile
      reason: ProfileUpdated
      status: "True"
      type: WorkerLatencyProfileComplete
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      reason: AsExpected
      status: "False"
      type: MissingStaticPodControllerDegraded
--
      type: GuardControllerDegraded
    - lastTransitionTime: "2022-05-26T08:01:44Z"
      reason: AsExpected
      status: "False"
      type: WorkerLatencyProfileDegraded
    - lastTransitionTime: "2022-05-27T05:51:26Z"
      status: "False"
      type: StaticPodsDegraded
    - lastTransitionTime: "2022-05-27T07:42:12Z"
$ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1
        node-monitor-grace-period:
        - 5m0s
```

After applying Default, with above fix [1]:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
    workerLatencyProfile: Default
$ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4
      type: NodeControllerDegraded
    - lastTransitionTime: "2022-05-27T10:19:26Z"
      reason: ProfileUpdateProhibited
      status: "False"
      type: WorkerLatencyProfileProgressing
    - lastTransitionTime: "2022-05-27T10:19:26Z"
      message: rejected update from "LowUpdateSlowReaction" to "Default" latency profile
        as extreme profile transition is unsupported
      reason: ProfileUpdateProhibited
      status: "True"
      type: WorkerLatencyProfileComplete
    - lastTransitionTime: "2022-05-26T08:01:23Z"
      reason: AsExpected
      status: "False"
      type: MissingStaticPodControllerDegraded
--
      type: GuardControllerDegraded
    - lastTransitionTime: "2022-05-26T08:01:44Z"
      reason: AsExpected
      status: "False"
      type: WorkerLatencyProfileDegraded
    - lastTransitionTime: "2022-05-27T05:51:26Z"
      status: "False"
      type: StaticPodsDegraded
    - lastTransitionTime: "2022-05-27T07:42:12Z"
$ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1
        node-monitor-grace-period:
        - 5m0s
```

It has the expected outcome in status as: 'rejected update from "LowUpdateSlowReaction" to "Default" latency profile as extreme profile transition is unsupported', plus the value of node-monitor-grace-period remains unchanged at 5m0s, which is indicative of LowUpdateSlowReaction profile (as Default profile is rejected).

[1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629

Comment 6 Weinan Liu 2022-06-28 11:06:59 UTC
$ oc edit nodes.config/cluster
error: nodes.config.openshift.io "cluster" is invalid
A copy of your changes has been stored to "/tmp/oc-edit-3314310019.yaml"
error: Edit cancelled, no valid changes were saved
oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-06-25-132614   True        False         8h      Cluster version is 4.11.0-0.nightly-2022-06-25-132614
Verified

Comment 7 errata-xmlrpc 2022-08-10 11:12:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069