Bug 2087684
| Summary: | KCMO should not be able to apply LowUpdateSlowReaction from Default WorkerLatencyProfile | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Harshal Patil <harpatil> |
| Component: | kube-controller-manager | Assignee: | Swarup Ghosh <swghosh> |
| Status: | CLOSED ERRATA | QA Contact: | Weinan Liu <weinliu> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 4.11 | CC: | fkrepins, mfojtik, nagrawal, rphillips, weinliu |
| Target Milestone: | --- | ||
| Target Release: | 4.11.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-08-10 11:12:53 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Harshal Patil
2022-05-18 08:53:02 UTC
The proposed fix in the linked PR is able to reject when user tries to update from Default to LowUpdateSlowReaction profile. [1]
Initially, while at Default profile:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
workerLatencyProfile: Default
$ oc get KubeControllerManager -o yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
kind: KubeControllerManager
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
release.openshift.io/create-only: "true"
creationTimestamp: "2022-05-26T07:56:51Z"
generation: 11
name: cluster
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 301ae517-73af-4a49-a197-28028d6c4761
resourceVersion: "614299"
uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3
spec:
logLevel: Normal
managementState: Managed
observedConfig:
extendedArguments:
cloud-config:
- /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf
cloud-provider:
- gce
cluster-cidr:
- 10.128.0.0/14
cluster-name:
- swghosh-20220526-361d-m9ztz
feature-gates:
- APIPriorityAndFairness=true
- RotateKubeletServerCertificate=true
- DownwardAPIHugePages=true
- PodSecurity=true
- CSIMigrationAWS=false
- CSIMigrationGCE=false
- CSIMigrationAzureFile=false
- CSIMigrationvSphere=false
node-monitor-grace-period:
- 40s
service-cluster-ip-range:
- 172.30.0.0/16
serviceServingCert:
certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt
servingInfo:
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
minTLSVersion: VersionTLS12
operatorLogLevel: Normal
unsupportedConfigOverrides: null
useMoreSecureServiceCA: true
status:
conditions:
- lastTransitionTime: "2022-05-26T08:01:23Z"
reason: NoUnsupportedConfigOverrides
status: "True"
type: UnsupportedConfigOverridesUpgradeable
- lastTransitionTime: "2022-05-26T09:24:29Z"
status: "False"
type: InstallerControllerDegraded
- lastTransitionTime: "2022-05-26T08:05:22Z"
message: 3 nodes are active; 3 nodes are at revision 28
status: "True"
type: StaticPodsAvailable
- lastTransitionTime: "2022-05-27T10:55:16Z"
message: 3 nodes are at revision 28
reason: AllNodesAtLatestRevision
status: "False"
type: NodeInstallerProgressing
- lastTransitionTime: "2022-05-26T08:01:23Z"
status: "False"
type: NodeInstallerDegraded
- lastTransitionTime: "2022-05-26T08:01:23Z"
message: All master nodes are ready
reason: MasterNodesReady
status: "False"
type: NodeControllerDegraded
- lastTransitionTime: "2022-05-27T10:55:16Z"
reason: ProfileUpdated
status: "False"
type: WorkerLatencyProfileProgressing
- lastTransitionTime: "2022-05-27T10:55:16Z"
message: all static pod revision(s) have updated latency profile
reason: ProfileUpdated
status: "True"
type: WorkerLatencyProfileComplete
- lastTransitionTime: "2022-05-26T08:01:23Z"
reason: AsExpected
status: "False"
type: MissingStaticPodControllerDegraded
- lastTransitionTime: "2022-05-27T10:43:42Z"
status: "False"
type: RevisionControllerDegraded
- lastTransitionTime: "2022-05-26T08:09:52Z"
status: "False"
type: ConfigObservationDegraded
- lastTransitionTime: "2022-05-26T08:01:32Z"
status: "False"
type: InstallerPodPendingDegraded
- lastTransitionTime: "2022-05-26T08:01:32Z"
status: "False"
type: InstallerPodContainerWaitingDegraded
- lastTransitionTime: "2022-05-26T08:01:32Z"
status: "False"
type: InstallerPodNetworkingDegraded
- lastTransitionTime: "2022-05-26T08:11:52Z"
reason: AsExpected
status: "False"
type: BackingResourceControllerDegraded
- lastTransitionTime: "2022-05-26T08:01:39Z"
status: "False"
type: SATokenSignerDegraded
- lastTransitionTime: "2022-05-26T08:12:00Z"
reason: AsExpected
status: "False"
type: KubeControllerManagerStaticResourcesDegraded
- lastTransitionTime: "2022-05-27T10:45:32Z"
reason: AsExpected
status: "False"
type: GuardControllerDegraded
- lastTransitionTime: "2022-05-26T08:01:44Z"
reason: AsExpected
status: "False"
type: WorkerLatencyProfileDegraded
- lastTransitionTime: "2022-05-27T10:52:50Z"
status: "False"
type: StaticPodsDegraded
- lastTransitionTime: "2022-05-27T07:42:12Z"
status: "False"
type: CertRotation_CSRSigningCert_Degraded
- lastTransitionTime: "2022-05-27T08:02:26Z"
status: "False"
type: ResourceSyncControllerDegraded
- lastTransitionTime: "2022-05-26T08:02:06Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2022-05-26T08:02:06Z"
status: "True"
type: CloudControllerOwner
- lastTransitionTime: "2022-05-26T08:12:20Z"
status: "False"
type: TargetConfigControllerDegraded
latestAvailableRevision: 28
latestAvailableRevisionReason: ""
nodeStatuses:
- currentRevision: 28
nodeName: swghosh-20220526-361d-m9ztz-master-0
- currentRevision: 28
nodeName: swghosh-20220526-361d-m9ztz-master-1
- currentRevision: 28
nodeName: swghosh-20220526-361d-m9ztz-master-2
readyReplicas: 0
kind: List
metadata:
resourceVersion: ""
selfLink: ""
```
After user's updating the profile to MediumUpdateAverageReaction profile:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
workerLatencyProfile: LowUpdateSlowReaction
$ oc get KubeControllerManager -o yaml
apiVersion: v1
items:
- apiVersion: operator.openshift.io/v1
kind: KubeControllerManager
metadata:
annotations:
include.release.openshift.io/ibm-cloud-managed: "true"
include.release.openshift.io/self-managed-high-availability: "true"
include.release.openshift.io/single-node-developer: "true"
release.openshift.io/create-only: "true"
creationTimestamp: "2022-05-26T07:56:51Z"
generation: 11
name: cluster
ownerReferences:
- apiVersion: config.openshift.io/v1
kind: ClusterVersion
name: version
uid: 301ae517-73af-4a49-a197-28028d6c4761
resourceVersion: "614798"
uid: d2af356f-2223-4773-ac13-09ef6fe5e5b3
spec:
logLevel: Normal
managementState: Managed
observedConfig:
extendedArguments:
cloud-config:
- /etc/kubernetes/static-pod-resources/configmaps/cloud-config/cloud.conf
cloud-provider:
- gce
cluster-cidr:
- 10.128.0.0/14
cluster-name:
- swghosh-20220526-361d-m9ztz
feature-gates:
- APIPriorityAndFairness=true
- RotateKubeletServerCertificate=true
- DownwardAPIHugePages=true
- PodSecurity=true
- CSIMigrationAWS=false
- CSIMigrationGCE=false
- CSIMigrationAzureFile=false
- CSIMigrationvSphere=false
node-monitor-grace-period:
- 40s
service-cluster-ip-range:
- 172.30.0.0/16
serviceServingCert:
certFile: /etc/kubernetes/static-pod-resources/configmaps/service-ca/ca-bundle.crt
servingInfo:
cipherSuites:
- TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
- TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
- TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256
- TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
minTLSVersion: VersionTLS12
operatorLogLevel: Normal
unsupportedConfigOverrides: null
useMoreSecureServiceCA: true
status:
conditions:
- lastTransitionTime: "2022-05-26T08:01:23Z"
reason: NoUnsupportedConfigOverrides
status: "True"
type: UnsupportedConfigOverridesUpgradeable
- lastTransitionTime: "2022-05-26T09:24:29Z"
status: "False"
type: InstallerControllerDegraded
- lastTransitionTime: "2022-05-26T08:05:22Z"
message: 3 nodes are active; 3 nodes are at revision 28
status: "True"
type: StaticPodsAvailable
- lastTransitionTime: "2022-05-27T10:55:16Z"
message: 3 nodes are at revision 28
reason: AllNodesAtLatestRevision
status: "False"
type: NodeInstallerProgressing
- lastTransitionTime: "2022-05-26T08:01:23Z"
status: "False"
type: NodeInstallerDegraded
- lastTransitionTime: "2022-05-26T08:01:23Z"
message: All master nodes are ready
reason: MasterNodesReady
status: "False"
type: NodeControllerDegraded
- lastTransitionTime: "2022-05-27T10:55:16Z"
reason: ProfileUpdateProhibited
status: "False"
type: WorkerLatencyProfileProgressing
- lastTransitionTime: "2022-05-27T10:55:16Z"
message: rejected update from "Default" to "LowUpdateSlowReaction" latency profile
as extreme profile transition is unsupported
reason: ProfileUpdateProhibited
status: "True"
type: WorkerLatencyProfileComplete
- lastTransitionTime: "2022-05-26T08:01:23Z"
reason: AsExpected
status: "False"
type: MissingStaticPodControllerDegraded
- lastTransitionTime: "2022-05-27T10:43:42Z"
status: "False"
type: RevisionControllerDegraded
- lastTransitionTime: "2022-05-26T08:09:52Z"
status: "False"
type: ConfigObservationDegraded
- lastTransitionTime: "2022-05-26T08:01:32Z"
status: "False"
type: InstallerPodPendingDegraded
- lastTransitionTime: "2022-05-26T08:01:32Z"
status: "False"
type: InstallerPodContainerWaitingDegraded
- lastTransitionTime: "2022-05-26T08:01:32Z"
status: "False"
type: InstallerPodNetworkingDegraded
- lastTransitionTime: "2022-05-26T08:11:52Z"
reason: AsExpected
status: "False"
type: BackingResourceControllerDegraded
- lastTransitionTime: "2022-05-26T08:01:39Z"
status: "False"
type: SATokenSignerDegraded
- lastTransitionTime: "2022-05-26T08:12:00Z"
reason: AsExpected
status: "False"
type: KubeControllerManagerStaticResourcesDegraded
- lastTransitionTime: "2022-05-27T10:45:32Z"
reason: AsExpected
status: "False"
type: GuardControllerDegraded
- lastTransitionTime: "2022-05-26T08:01:44Z"
reason: AsExpected
status: "False"
type: WorkerLatencyProfileDegraded
- lastTransitionTime: "2022-05-27T10:52:50Z"
status: "False"
type: StaticPodsDegraded
- lastTransitionTime: "2022-05-27T07:42:12Z"
status: "False"
type: CertRotation_CSRSigningCert_Degraded
- lastTransitionTime: "2022-05-27T08:02:26Z"
status: "False"
type: ResourceSyncControllerDegraded
- lastTransitionTime: "2022-05-26T08:02:06Z"
status: "True"
type: Upgradeable
- lastTransitionTime: "2022-05-26T08:02:06Z"
status: "True"
type: CloudControllerOwner
- lastTransitionTime: "2022-05-26T08:12:20Z"
status: "False"
type: TargetConfigControllerDegraded
latestAvailableRevision: 28
latestAvailableRevisionReason: ""
nodeStatuses:
- currentRevision: 28
nodeName: swghosh-20220526-361d-m9ztz-master-0
- currentRevision: 28
nodeName: swghosh-20220526-361d-m9ztz-master-1
- currentRevision: 28
nodeName: swghosh-20220526-361d-m9ztz-master-2
readyReplicas: 0
kind: List
metadata:
resourceVersion: ""
selfLink: ""
```
As evident from the status, the "WorkerLatencyProfileComplete" shows up the message: 'rejected update from "Default" to "LowUpdateSlowReaction" latency profile as extreme profile transition is unsupported'.
Also, observedConfig.extendedArguments.node-monitor-grace-period is still at 40s, indicating that the args for Default profile were retained as update to LowUpdateSlowReaction profile was rejected.
[1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629
Similarly, in the case of the scenario where profile is being update from LowUpdateSlowReaction to Default.
While in LowUpdateAverageReaction profile:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
workerLatencyProfile: LowUpdateSlowReaction
$ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4
type: NodeControllerDegraded
- lastTransitionTime: "2022-05-27T10:19:26Z"
reason: ProfileUpdated
status: "False"
type: WorkerLatencyProfileProgressing
- lastTransitionTime: "2022-05-27T10:19:26Z"
message: all static pod revision(s) have updated latency profile
reason: ProfileUpdated
status: "True"
type: WorkerLatencyProfileComplete
- lastTransitionTime: "2022-05-26T08:01:23Z"
reason: AsExpected
status: "False"
type: MissingStaticPodControllerDegraded
--
type: GuardControllerDegraded
- lastTransitionTime: "2022-05-26T08:01:44Z"
reason: AsExpected
status: "False"
type: WorkerLatencyProfileDegraded
- lastTransitionTime: "2022-05-27T05:51:26Z"
status: "False"
type: StaticPodsDegraded
- lastTransitionTime: "2022-05-27T07:42:12Z"
$ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1
node-monitor-grace-period:
- 5m0s
```
After applying Default, with above fix [1]:
```
$ oc get nodes.config -o yaml | grep workerLatencyProfile
workerLatencyProfile: Default
$ oc get KubeControllerManager -o yaml | grep -i latencyprofile -A 4 -B 4
type: NodeControllerDegraded
- lastTransitionTime: "2022-05-27T10:19:26Z"
reason: ProfileUpdateProhibited
status: "False"
type: WorkerLatencyProfileProgressing
- lastTransitionTime: "2022-05-27T10:19:26Z"
message: rejected update from "LowUpdateSlowReaction" to "Default" latency profile
as extreme profile transition is unsupported
reason: ProfileUpdateProhibited
status: "True"
type: WorkerLatencyProfileComplete
- lastTransitionTime: "2022-05-26T08:01:23Z"
reason: AsExpected
status: "False"
type: MissingStaticPodControllerDegraded
--
type: GuardControllerDegraded
- lastTransitionTime: "2022-05-26T08:01:44Z"
reason: AsExpected
status: "False"
type: WorkerLatencyProfileDegraded
- lastTransitionTime: "2022-05-27T05:51:26Z"
status: "False"
type: StaticPodsDegraded
- lastTransitionTime: "2022-05-27T07:42:12Z"
$ oc get KubeControllerManager -o yaml | grep -i node-monitor-grace-period -A 1
node-monitor-grace-period:
- 5m0s
```
It has the expected outcome in status as: 'rejected update from "LowUpdateSlowReaction" to "Default" latency profile as extreme profile transition is unsupported', plus the value of node-monitor-grace-period remains unchanged at 5m0s, which is indicative of LowUpdateSlowReaction profile (as Default profile is rejected).
[1] https://github.com/openshift/cluster-kube-controller-manager-operator/pull/629
$ oc edit nodes.config/cluster error: nodes.config.openshift.io "cluster" is invalid A copy of your changes has been stored to "/tmp/oc-edit-3314310019.yaml" error: Edit cancelled, no valid changes were saved oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-132614 True False 8h Cluster version is 4.11.0-0.nightly-2022-06-25-132614 Verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |