Bug 2087684
Summary: | KCMO should not be able to apply LowUpdateSlowReaction from Default WorkerLatencyProfile | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Harshal Patil <harpatil> |
Component: | kube-controller-manager | Assignee: | Swarup Ghosh <swghosh> |
Status: | CLOSED ERRATA | QA Contact: | Weinan Liu <weinliu> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4.11 | CC: | fkrepins, mfojtik, nagrawal, rphillips, weinliu |
Target Milestone: | --- | ||
Target Release: | 4.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-08-10 11:12:53 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Harshal Patil
2022-05-18 08:53:02 UTC
The proposed fix in the linked PR is able to reject when user tries to update from Default to LowUpdateSlowReaction profile. [1] Initially, while at Default profile: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: Default $ oc get KubeControllerManager -o yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: KubeControllerManager metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-05-26T07:56:51Z" generation: 11 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 301ae517-73af-4a49-a197-28028d6c4761 resourceVersion: "614299" uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3 spec: logLevel: Normal managementState: Managed observedConfig: extendedArguments: cloud-config: - /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf cloud-provider: - gce cluster-cidr: - 10.128.0.0/14 cluster-name: - swghosh-20220526-361d-m9ztz feature-gates: - APIPriorityAndFairness=true - RotateKubeletServerCertificate=true - DownwardAPIHugePages=true - PodSecurity=true - CSIMigrationAWS=false - CSIMigrationGCE=false - CSIMigrationAzureFile=false - CSIMigrationvSphere=false node-monitor-grace-period: - 40s service-cluster-ip-range: - 172.30.0.0/16 serviceServingCert: certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt servingInfo: cipherSuites: - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 minTLSVersion: VersionTLS12 operatorLogLevel: Normal unsupportedConfigOverrides: null useMoreSecureServiceCA: true status: conditions: - lastTransitionTime: "2022-05-26T08:01:23Z" reason: NoUnsupportedConfigOverrides status: "True" type: UnsupportedConfigOverridesUpgradeable - lastTransitionTime: "2022-05-26T09:24:29Z" status: "False" type: InstallerControllerDegraded - lastTransitionTime: "2022-05-26T08:05:22Z" message: 3 nodes are active; 3 nodes are at revision 28 status: "True" type: StaticPodsAvailable - lastTransitionTime: "2022-05-27T10:55:16Z" message: 3 nodes are at revision 28 reason: AllNodesAtLatestRevision status: "False" type: NodeInstallerProgressing - lastTransitionTime: "2022-05-26T08:01:23Z" status: "False" type: NodeInstallerDegraded - lastTransitionTime: "2022-05-26T08:01:23Z" message: All master nodes are ready reason: MasterNodesReady status: "False" type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:55:16Z" reason: ProfileUpdated status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:55:16Z" message: all static pod revision(s) have updated latency profile reason: ProfileUpdated status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded - lastTransitionTime: "2022-05-27T10:43:42Z" status: "False" type: RevisionControllerDegraded - lastTransitionTime: "2022-05-26T08:09:52Z" status: "False" type: ConfigObservationDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodPendingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodContainerWaitingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodNetworkingDegraded - lastTransitionTime: "2022-05-26T08:11:52Z" reason: AsExpected status: "False" type: BackingResourceControllerDegraded - lastTransitionTime: "2022-05-26T08:01:39Z" status: "False" type: SATokenSignerDegraded - lastTransitionTime: "2022-05-26T08:12:00Z" reason: AsExpected status: "False" type: KubeControllerManagerStaticResourcesDegraded - lastTransitionTime: "2022-05-27T10:45:32Z" reason: AsExpected status: "False" type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T10:52:50Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" status: "False" type: CertRotation_CSRSigningCert_Degraded - lastTransitionTime: "2022-05-27T08:02:26Z" status: "False" type: ResourceSyncControllerDegraded - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: Upgradeable - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: CloudControllerOwner - lastTransitionTime: "2022-05-26T08:12:20Z" status: "False" type: TargetConfigControllerDegraded latestAvailableRevision: 28 latestAvailableRevisionReason: "" nodeStatuses: - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-0 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-1 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-2 readyReplicas: 0 kind: List metadata: resourceVersion: "" selfLink: "" ``` After user's updating the profile to MediumUpdateAverageReaction profile: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: LowUpdateSlowReaction $ oc get KubeControllerManager -o yaml apiVersion: v1 items: - apiVersion: operator.openshift.io/v1 kind: KubeControllerManager metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" release.openshift.io/create-only: "true" creationTimestamp: "2022-05-26T07:56:51Z" generation: 11 name: cluster ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 301ae517-73af-4a49-a197-28028d6c4761 resourceVersion: "614798" uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3 spec: logLevel: Normal managementState: Managed observedConfig: extendedArguments: cloud-config: - /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf cloud-provider: - gce cluster-cidr: - 10.128.0.0/14 cluster-name: - swghosh-20220526-361d-m9ztz feature-gates: - APIPriorityAndFairness=true - RotateKubeletServerCertificate=true - DownwardAPIHugePages=true - PodSecurity=true - CSIMigrationAWS=false - CSIMigrationGCE=false - CSIMigrationAzureFile=false - CSIMigrationvSphere=false node-monitor-grace-period: - 40s service-cluster-ip-range: - 172.30.0.0/16 serviceServingCert: certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt servingInfo: cipherSuites: - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 - TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256 - TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256 minTLSVersion: VersionTLS12 operatorLogLevel: Normal unsupportedConfigOverrides: null useMoreSecureServiceCA: true status: conditions: - lastTransitionTime: "2022-05-26T08:01:23Z" reason: NoUnsupportedConfigOverrides status: "True" type: UnsupportedConfigOverridesUpgradeable - lastTransitionTime: "2022-05-26T09:24:29Z" status: "False" type: InstallerControllerDegraded - lastTransitionTime: "2022-05-26T08:05:22Z" message: 3 nodes are active; 3 nodes are at revision 28 status: "True" type: StaticPodsAvailable - lastTransitionTime: "2022-05-27T10:55:16Z" message: 3 nodes are at revision 28 reason: AllNodesAtLatestRevision status: "False" type: NodeInstallerProgressing - lastTransitionTime: "2022-05-26T08:01:23Z" status: "False" type: NodeInstallerDegraded - lastTransitionTime: "2022-05-26T08:01:23Z" message: All master nodes are ready reason: MasterNodesReady status: "False" type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:55:16Z" reason: ProfileUpdateProhibited status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:55:16Z" message: rejected update from "Default" to "LowUpdateSlowReaction" latency profile as extreme profile transition is unsupported reason: ProfileUpdateProhibited status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded - lastTransitionTime: "2022-05-27T10:43:42Z" status: "False" type: RevisionControllerDegraded - lastTransitionTime: "2022-05-26T08:09:52Z" status: "False" type: ConfigObservationDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodPendingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodContainerWaitingDegraded - lastTransitionTime: "2022-05-26T08:01:32Z" status: "False" type: InstallerPodNetworkingDegraded - lastTransitionTime: "2022-05-26T08:11:52Z" reason: AsExpected status: "False" type: BackingResourceControllerDegraded - lastTransitionTime: "2022-05-26T08:01:39Z" status: "False" type: SATokenSignerDegraded - lastTransitionTime: "2022-05-26T08:12:00Z" reason: AsExpected status: "False" type: KubeControllerManagerStaticResourcesDegraded - lastTransitionTime: "2022-05-27T10:45:32Z" reason: AsExpected status: "False" type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T10:52:50Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" status: "False" type: CertRotation_CSRSigningCert_Degraded - lastTransitionTime: "2022-05-27T08:02:26Z" status: "False" type: ResourceSyncControllerDegraded - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: Upgradeable - lastTransitionTime: "2022-05-26T08:02:06Z" status: "True" type: CloudControllerOwner - lastTransitionTime: "2022-05-26T08:12:20Z" status: "False" type: TargetConfigControllerDegraded latestAvailableRevision: 28 latestAvailableRevisionReason: "" nodeStatuses: - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-0 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-1 - currentRevision: 28 nodeName: swghosh-20220526-361d-m9ztz-master-2 readyReplicas: 0 kind: List metadata: resourceVersion: "" selfLink: "" ``` As evident from the status, the "WorkerLatencyProfileComplete" shows up the message: 'rejected update from "Default" to "LowUpdateSlowReaction" latency profile as extreme profile transition is unsupported'. Also, observedConfig.extendedArguments.node-monitor-grace-period is still at 40s, indicating that the args for Default profile were retained as update to LowUpdateSlowReaction profile was rejected. [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629 Similarly, in the case of the scenario where profile is being update from LowUpdateSlowReaction to Default. While in LowUpdateAverageReaction profile: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: LowUpdateSlowReaction $ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4 type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:19:26Z" reason: ProfileUpdated status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:19:26Z" message: all static pod revision(s) have updated latency profile reason: ProfileUpdated status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded -- type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T05:51:26Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" $ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1 node-monitor-grace-period: - 5m0s ``` After applying Default, with above fix [1]: ``` $ oc get nodes.config -o yaml | grep workerLatencyProfile workerLatencyProfile: Default $ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4 type: NodeControllerDegraded - lastTransitionTime: "2022-05-27T10:19:26Z" reason: ProfileUpdateProhibited status: "False" type: WorkerLatencyProfileProgressing - lastTransitionTime: "2022-05-27T10:19:26Z" message: rejected update from "LowUpdateSlowReaction" to "Default" latency profile as extreme profile transition is unsupported reason: ProfileUpdateProhibited status: "True" type: WorkerLatencyProfileComplete - lastTransitionTime: "2022-05-26T08:01:23Z" reason: AsExpected status: "False" type: MissingStaticPodControllerDegraded -- type: GuardControllerDegraded - lastTransitionTime: "2022-05-26T08:01:44Z" reason: AsExpected status: "False" type: WorkerLatencyProfileDegraded - lastTransitionTime: "2022-05-27T05:51:26Z" status: "False" type: StaticPodsDegraded - lastTransitionTime: "2022-05-27T07:42:12Z" $ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1 node-monitor-grace-period: - 5m0s ``` It has the expected outcome in status as: 'rejected update from "LowUpdateSlowReaction" to "Default" latency profile as extreme profile transition is unsupported', plus the value of node-monitor-grace-period remains unchanged at 5m0s, which is indicative of LowUpdateSlowReaction profile (as Default profile is rejected). [1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629 $ oc edit nodes.config/cluster error: nodes.config.openshift.io "cluster" is invalid A copy of your changes has been stored to "/tmp/oc-edit-3314310019.yaml" error: Edit cancelled, no valid changes were saved oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-132614 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-25-132614 Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |