Hide Forgot
During the testing with the latest nightly it appears that Kube Controller Manager Operator is able to apply LowUpdateSlowReaction even though the cluster is at Default WorkerLatencyProfile. This violates the cluster stability analysis [1]. Transition from Default to LowUpdateSlowReaction and vice-versa should be prohibited. [1] https://github.com/openshift/enhancements/blob/master/enhancements/worker-latency-profile/worker-latency-profile.md#default---lowupdateslowreaction
The proposed fix in the linked PR is able to reject when user tries to update from Default to LowUpdateSlowReaction profile. [1] Initially, while at Default profile: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: Default $ oc get KubeControllerManager -o yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: KubeControllerManager metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-05-26T07:56:51Z" generation: 11 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 301ae517-73af-4a49-a197-28028d6c4761 resourceVersion: "614299" uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3 spec: logLevel: Normal managementState: Managed observedConfig: extendedArguments: cloud-config: - /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf cloud-provider: - gce cluster-cidr: - 10.128.0.0/14 cluster-name: - swghosh-20220526-361d-m9ztz feature-gates: - APIPriorityAndFairness=true - RotateKubeletServerCertificate=true - DownwardAPIHugePages=true - PodSecurity=true - CSIMigrationAWS=false - CSIMigrationGCE=false - CSIMigrationAzureFile=false - CSIMigrationvSphere=false node-monitor-grace-period: - 40s service-cluster-ip-range: - 172.30.0.0/16 serviceServingCert: certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt servingInfo: cipherSuites: - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 minTLSVersion: VersionTLS12 operatorLogLevel: Normal unsupportedConfigOverrides: null useMoreSecureServiceCA: true status: conditions: - lastTransitionTime: "2022-05-26T08:01:23Z" reason: NoUnsupportedConfigOverrides status: "True" type: UnsupportedConfigOverridesUpgradeable - lastTransitionTime: "2022-05-26T09:24:29Z" status: "False" type: InstallerControllerDegraded - lastTransitionTime: "2022-05-26T08:05:22Z" message: 3 nodes are active; 3 nodes are at revision 28 status: "True" type: StaticPodsAvailable - lastTransitionTime: "2022-05-27T10:55:16Z" message: 3 nodes are at revision 28 reason: AllNodesAtLatestRevision status: "False" type: NodeInstallerProgressing - lastTransitionTime: "2022-05-26T08:01:23Z" status: "False" type: NodeInstallerDegraded - lastTransitionTime: "2022-05-26T08:01:23Z" message: All master nodes are ready reason: MasterNodesReady status: "False" type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:55:16Z" reason: ProfileUpdated status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:55:16Z" message: all static pod revision(s) have updated latency profile reason: ProfileUpdated status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded - lastTransitionTime: "2022-05-27T10:43:42Z" status: "False" type: RevisionControllerDegraded - lastTransitionTime: "2022-05-26T08:09:52Z" status: "False" type: ConfigObservationDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodPendingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodContainerWaitingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodNetworkingDegraded - lastTransitionTime: "2022-05-26T08:11:52Z" reason: AsExpected status: "False" type: BackingResourceControllerDegraded - lastTransitionTime: "2022-05-26T08:01:39Z" status: "False" type: SATokenSignerDegraded - lastTransitionTime: "2022-05-26T08:12:00Z" reason: AsExpected status: "False" type: KubeControllerManagerStaticResourcesDegraded - lastTransitionTime: "2022-05-27T10:45:32Z" reason: AsExpected status: "False" type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T10:52:50Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" status: "False" type: CertRotation_CSRSigningCert_Degraded - lastTransitionTime: "2022-05-27T08:02:26Z" status: "False" type: ResourceSyncControllerDegraded - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: Upgradeable - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: CloudControllerOwner - lastTransitionTime: "2022-05-26T08:12:20Z" status: "False" type: TargetConfigControllerDegraded latestAvailableRevision: 28 latestAvailableRevisionReason: "" nodeStatuses: - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-0 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-1 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-2 readyReplicas: 0 kind: List metadata: resourceVersion: "" selfLink: "" ``` After user's updating the profile to MediumUpdateAverageReaction profile: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: LowUpdateSlowReaction $ oc get KubeControllerManager -o yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: KubeControllerManager metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-05-26T07:56:51Z" generation: 11 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 301ae517-73af-4a49-a197-28028d6c4761 resourceVersion: "614798" uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3 spec: logLevel: Normal managementState: Managed observedConfig: extendedArguments: cloud-config: - /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf cloud-provider: - gce cluster-cidr: - 10.128.0.0/14 cluster-name: - swghosh-20220526-361d-m9ztz feature-gates: - APIPriorityAndFairness=true - RotateKubeletServerCertificate=true - DownwardAPIHugePages=true - PodSecurity=true - CSIMigrationAWS=false - CSIMigrationGCE=false - CSIMigrationAzureFile=false - CSIMigrationvSphere=false node-monitor-grace-period: - 40s service-cluster-ip-range: - 172.30.0.0/16 serviceServingCert: certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt servingInfo: cipherSuites: - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 minTLSVersion: VersionTLS12 operatorLogLevel: Normal unsupportedConfigOverrides: null useMoreSecureServiceCA: true status: conditions: - lastTransitionTime: "2022-05-26T08:01:23Z" reason: NoUnsupportedConfigOverrides status: "True" type: UnsupportedConfigOverridesUpgradeable - lastTransitionTime: "2022-05-26T09:24:29Z" status: "False" type: InstallerControllerDegraded - lastTransitionTime: "2022-05-26T08:05:22Z" message: 3 nodes are active; 3 nodes are at revision 28 status: "True" type: StaticPodsAvailable - lastTransitionTime: "2022-05-27T10:55:16Z" message: 3 nodes are at revision 28 reason: AllNodesAtLatestRevision status: "False" type: NodeInstallerProgressing - lastTransitionTime: "2022-05-26T08:01:23Z" status: "False" type: NodeInstallerDegraded - lastTransitionTime: "2022-05-26T08:01:23Z" message: All master nodes are ready reason: MasterNodesReady status: "False" type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:55:16Z" reason: ProfileUpdateProhibited status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:55:16Z" message: rejected update from "Default" to "LowUpdateSlowReaction" latency profile as extreme profile transition is unsupported reason: ProfileUpdateProhibited status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded - lastTransitionTime: "2022-05-27T10:43:42Z" status: "False" type: RevisionControllerDegraded - lastTransitionTime: "2022-05-26T08:09:52Z" status: "False" type: ConfigObservationDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodPendingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodContainerWaitingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodNetworkingDegraded - lastTransitionTime: "2022-05-26T08:11:52Z" reason: AsExpected status: "False" type: BackingResourceControllerDegraded - lastTransitionTime: "2022-05-26T08:01:39Z" status: "False" type: SATokenSignerDegraded - lastTransitionTime: "2022-05-26T08:12:00Z" reason: AsExpected status: "False" type: KubeControllerManagerStaticResourcesDegraded - lastTransitionTime: "2022-05-27T10:45:32Z" reason: AsExpected status: "False" type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T10:52:50Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" status: "False" type: CertRotation_CSRSigningCert_Degraded - lastTransitionTime: "2022-05-27T08:02:26Z" status: "False" type: ResourceSyncControllerDegraded - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: Upgradeable - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: CloudControllerOwner - lastTransitionTime: "2022-05-26T08:12:20Z" status: "False" type: TargetConfigControllerDegraded latestAvailableRevision: 28 latestAvailableRevisionReason: "" nodeStatuses: - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-0 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-1 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-2 readyReplicas: 0 kind: List metadata: resourceVersion: "" selfLink: "" ``` As evident from the status, the "WorkerLatencyProfileComplete" shows up the message: 'rejected update from "Default" to "LowUpdateSlowReaction" latency profile as extreme profile transition is unsupported'. Also, observedConfig.extendedArguments.node-monitor-grace-period is still at 40s, indicating that the args for Default profile were retained as update to LowUpdateSlowReaction profile was rejected. [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629
Similarly, in the case of the scenario where profile is being update from LowUpdateSlowReaction to Default. While in LowUpdateAverageReaction profile: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: LowUpdateSlowReaction $ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4 type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:19:26Z" reason: ProfileUpdated status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:19:26Z" message: all static pod revision(s) have updated latency profile reason: ProfileUpdated status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded -- type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T05:51:26Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" $ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1 node-monitor-grace-period: - 5m0s ``` After applying Default, with above fix [1]: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: Default $ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4 type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:19:26Z" reason: ProfileUpdateProhibited status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:19:26Z" message: rejected update from "LowUpdateSlowReaction" to "Default" latency profile as extreme profile transition is unsupported reason: ProfileUpdateProhibited status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded -- type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T05:51:26Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" $ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1 node-monitor-grace-period: - 5m0s ``` It has the expected outcome in status as: 'rejected update from "LowUpdateSlowReaction" to "Default" latency profile as extreme profile transition is unsupported', plus the value of node-monitor-grace-period remains unchanged at 5m0s, which is indicative of LowUpdateSlowReaction profile (as Default profile is rejected). [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629
$ oc edit nodes.config/cluster error: nodes.config.openshift.io "cluster" is invalid A copy of your changes has been stored to "/tmp/oc-edit-3314310019.yaml" error: Edit cancelled, no valid changes were saved oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-132614 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-25-132614 Verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069