|Summary:||[Performance] Lease refresh period for cloud-credential-operator is too high, causes heavy writes to etcd at idle|
|Product:||OpenShift Container Platform||Reporter:||Clayton Coleman <ccoleman>|
|Component:||Cloud Credential Operator||Assignee:||Devan Goodwin <dgoodwin>|
|Status:||VERIFIED ---||QA Contact:||wang lin <lwan>|
|Version:||4.5||CC:||agarcial, gshereme, lwan, zhsun|
|Fixed In Version:||Doc Type:||Bug Fix|
Cause: Cloud Credential Operator leader election was using default values from controller-runtime. Consequence: Writing to etcd every 2s. Fix: Implemented custom leader election. Result: Leader election now writes only every 90s and releases the lock immediately on normal shutdown.
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Clayton Coleman 2020-07-17 20:05:54 UTC
+++ This bug was initially created as a clone of Bug #1858400 +++ The cloud-credential operator is refreshing its lease more than all other components except machine-api combined (writes to their election config map by client within a window of time): 2 system:serviceaccount:openshift-network-operator:default openshift-multus cni-binary-copy-script 2 system:serviceaccount:openshift-network-operator:default openshift-network-operator applied-cluster 2 system:serviceaccount:openshift-network-operator:default openshift-network-operator openshift-service-ca 2 system:serviceaccount:openshift-network-operator:default openshift-sdn sdn-config 18 system:serviceaccount:openshift-machine-config-operator:default openshift-machine-config-operator machine-config 18 system:serviceaccount:openshift-machine-config-operator:machine-config-controller openshift-machine-config-operator machine-config-controller 27 system:serviceaccount:openshift-machine-api:cluster-autoscaler-operator openshift-machine-api cluster-autoscaler-operator-leader 53 system:kube-controller-manager openshift-kube-controller-manager cluster-policy-controller 53 system:serviceaccount:openshift-config-operator:openshift-config-operator openshift-config-operator config-operator-lock 54 system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator openshift-apiserver-operator openshift-apiserver-operator-lock 54 system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator openshift-controller-manager-operator openshift-controller-manager-operator-lock 54 system:serviceaccount:openshift-etcd-operator:etcd-operator openshift-etcd-operator openshift-cluster-etcd-operator-lock 54 system:serviceaccount:openshift-image-registry:cluster-image-registry-operator openshift-image-registry openshift-master-controllers 54 system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator openshift-kube-apiserver-operator kube-apiserver-operator-lock 54 system:serviceaccount:openshift-kube-apiserver:localhost-recovery-client openshift-kube-apiserver cert-regeneration-controller-lock 54 system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator openshift-kube-controller-manager-operator kube-controller-manager-operator-lock 54 system:serviceaccount:openshift-kube-scheduler-operator:openshift-kube-scheduler-operator openshift-kube-scheduler-operator openshift-cluster-kube-scheduler-operator-lock 54 system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator openshift-kube-storage-version-migrator-operator openshift-kube-storage-version-migrator-operator-lock 54 system:serviceaccount:openshift-service-ca-operator:service-ca-operator openshift-service-ca-operator service-ca-operator-lock 179 system:kube-controller-manager kube-system kube-controller-manager 268 system:kube-scheduler openshift-kube-scheduler kube-scheduler 268 system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator openshift-cloud-credential-operator cloud-credential-operator-leader 268 system:serviceaccount:openshift-machine-api:machine-api-controllers openshift-machine-api cluster-api-provider-gcp-leader 268 system:serviceaccount:openshift-machine-api:machine-api-controllers openshift-machine-api cluster-api-provider-healthcheck-leader 268 system:serviceaccount:openshift-machine-api:machine-api-controllers openshift-machine-api cluster-api-provider-nodelink-leader The operator should set a leader interval to 90s (it is currently 15s). Since the operator already runs only a single pod, is are mainly using election to prevent administrator error (in force deleting a pod or node), vs needing to have rapid failover between multiple components. Please ensure the component listed here has leader election intervals at 90s, and that after this change the rate of configmap updates from this client (you can check audit log on a cluster) occurs no more frequently than that interval (in case there is a secondary bug).
Comment 1 Devan Goodwin 2020-07-30 12:06:46 UTC
On the list for inclusion in next sprint.
Comment 2 Devan Goodwin 2020-07-31 18:38:43 UTC
This is trickier than it looks initially. From controller-runtime we have three tunables: // LeaseDuration is the duration that non-leader candidates will // wait to force acquire leadership. This is measured against time of // last observed ack. Default is 15 seconds. LeaseDuration *time.Duration // RenewDeadline is the duration that the acting master will retry // refreshing leadership before giving up. Default is 10 seconds. RenewDeadline *time.Duration // RetryPeriod is the duration the LeaderElector clients should wait // between tries of actions. Default is 2 seconds. RetryPeriod *time.Duration Clayton's report indicates to change the leader election interval. This is not lease duration, and changing lease duration has no impact on this issue. It appears the writes are due to the retry period which defaults to 2s. I was not clear what audit log the bug report is referring to so I monitored with kubectl get configmap -w to see a line every write and indeed with CCO they are coming every 2s. But we cannot just bump this to 90s, it must be lower than both lease duration and renew deadline to allow for a couple retries. So a 90s retry period which this bug requests would mean at least 180 or 270s renew deadline so we can try a couple times, and then a very large lease duration, probably 6 or 7 minutes. Lease duration seems to affect startup time even if the old lease is stale (not sure why this is) so I think we'd be potentially adding several minutes to install.
Comment 3 Devan Goodwin 2020-07-31 18:49:57 UTC
// A client needs to wait a full LeaseDuration without observing a change to // the record before it can attempt to take over. When all clients are // shutdown and a new set of clients are started with different names against // the same leader record, they must wait the full LeaseDuration before // attempting to acquire the lease. Thus LeaseDuration should be as short as // possible (within your tolerance for clock skew rate) to avoid a possible // long waits in the scenario. So lease duration means affects pod startup time in any scenario where the configmap already exists. This would impact both us and machine-api operator when transitioning from bootstrap controlplane and I believe will add time to the install process. I would propose: LeaseDuration: 60s (up from 15) RenewDeadline: 30s (up from 10) RetryPeriod: 10s (up from 2) We're writing to etcd every 10 seconds but this is a 1/5th what it was before. We have 3 tries before a lease holder would give up, and a fairly reasonable 60s lease duration and pod startup time.
Comment 4 Devan Goodwin 2020-07-31 19:07:50 UTC
Went with 90/60/15: https://github.com/openshift/cloud-credential-operator/pull/231 So we will have a 90s pod startup delay, and write every 15s.
Comment 5 Clayton Coleman 2020-07-31 19:46:12 UTC
You should not do that - you should be using "release on cancel" which is part of the election process. Shutdown on bootstrap is a red herring - stepping down should release the lease, and that is now part of the core library. See https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130 for an example of how to do this correctly. That speeds up bootstrap and rolling deploy and is more correct than abandoning the lease on shutdown. You should hold the lease for a longer period.
Comment 6 Greg Sheremeta 2020-08-22 12:00:09 UTC
PR approved and trying to get through CI now
Comment 9 wang lin 2020-09-01 02:43:58 UTC
The bug has fixed, the RetryPeriod has changed to 90s test payload: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600 step: run "oc get configmap -n openshift-cloud-credential-operator -w"