Bug 1858403

Summary: [Performance] Lease refresh period for cloud-credential-operator is too high, causes heavy writes to etcd at idle
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cloud Credential OperatorAssignee: Devan Goodwin <dgoodwin>
Status: VERIFIED --- QA Contact: wang lin <lwan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.5CC: agarcial, gshereme, lwan, zhsun
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Cloud Credential Operator leader election was using default values from controller-runtime. Consequence: Writing to etcd every 2s. Fix: Implemented custom leader election. Result: Leader election now writes only every 90s and releases the lock immediately on normal shutdown.
Story Points: ---
Clone Of: 1858400 Environment:
Last Closed: Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Clayton Coleman 2020-07-17 20:05:54 UTC
+++ This bug was initially created as a clone of Bug #1858400 +++

The cloud-credential operator is refreshing its lease more than all other components except machine-api combined (writes to their election config map by client within a window of time):

      2 system:serviceaccount:openshift-network-operator:default	openshift-multus	cni-binary-copy-script
      2 system:serviceaccount:openshift-network-operator:default	openshift-network-operator	applied-cluster
      2 system:serviceaccount:openshift-network-operator:default	openshift-network-operator	openshift-service-ca
      2 system:serviceaccount:openshift-network-operator:default	openshift-sdn	sdn-config
     18 system:serviceaccount:openshift-machine-config-operator:default	openshift-machine-config-operator	machine-config
     18 system:serviceaccount:openshift-machine-config-operator:machine-config-controller	openshift-machine-config-operator	machine-config-controller
     27 system:serviceaccount:openshift-machine-api:cluster-autoscaler-operator	openshift-machine-api	cluster-autoscaler-operator-leader
     53 system:kube-controller-manager	openshift-kube-controller-manager	cluster-policy-controller
     53 system:serviceaccount:openshift-config-operator:openshift-config-operator	openshift-config-operator	config-operator-lock
     54 system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator	openshift-apiserver-operator	openshift-apiserver-operator-lock
     54 system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator	openshift-controller-manager-operator	openshift-controller-manager-operator-lock
     54 system:serviceaccount:openshift-etcd-operator:etcd-operator	openshift-etcd-operator	openshift-cluster-etcd-operator-lock
     54 system:serviceaccount:openshift-image-registry:cluster-image-registry-operator	openshift-image-registry	openshift-master-controllers
     54 system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator	openshift-kube-apiserver-operator	kube-apiserver-operator-lock
     54 system:serviceaccount:openshift-kube-apiserver:localhost-recovery-client	openshift-kube-apiserver	cert-regeneration-controller-lock
     54 system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator	openshift-kube-controller-manager-operator	kube-controller-manager-operator-lock
     54 system:serviceaccount:openshift-kube-scheduler-operator:openshift-kube-scheduler-operator	openshift-kube-scheduler-operator	openshift-cluster-kube-scheduler-operator-lock
     54 system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator	openshift-kube-storage-version-migrator-operator	openshift-kube-storage-version-migrator-operator-lock
     54 system:serviceaccount:openshift-service-ca-operator:service-ca-operator	openshift-service-ca-operator	service-ca-operator-lock
    179 system:kube-controller-manager	kube-system	kube-controller-manager
    268 system:kube-scheduler	openshift-kube-scheduler	kube-scheduler
    268 system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator	openshift-cloud-credential-operator	cloud-credential-operator-leader
    268 system:serviceaccount:openshift-machine-api:machine-api-controllers	openshift-machine-api	cluster-api-provider-gcp-leader
    268 system:serviceaccount:openshift-machine-api:machine-api-controllers	openshift-machine-api	cluster-api-provider-healthcheck-leader
    268 system:serviceaccount:openshift-machine-api:machine-api-controllers	openshift-machine-api	cluster-api-provider-nodelink-leader

The operator should set a leader interval to 90s (it is currently 15s). Since the operator already runs only a single pod, is are mainly using election to prevent administrator error (in force deleting a pod or node), vs needing to have rapid failover between multiple components.

Please ensure the component listed here has leader election intervals at 90s, and that after this change the rate of configmap updates from this client (you can check audit log on a cluster) occurs no more frequently than that interval (in case there is a secondary bug).

Comment 1 Devan Goodwin 2020-07-30 12:06:46 UTC
On the list for inclusion in next sprint.

Comment 2 Devan Goodwin 2020-07-31 18:38:43 UTC
This is trickier than it looks initially. From controller-runtime we have three tunables:

	// LeaseDuration is the duration that non-leader candidates will
	// wait to force acquire leadership. This is measured against time of
	// last observed ack. Default is 15 seconds.
	LeaseDuration *time.Duration
	// RenewDeadline is the duration that the acting master will retry
	// refreshing leadership before giving up. Default is 10 seconds.
	RenewDeadline *time.Duration
	// RetryPeriod is the duration the LeaderElector clients should wait
	// between tries of actions. Default is 2 seconds.
	RetryPeriod *time.Duration

Clayton's report indicates to change the leader election interval. This is not lease duration, and changing lease duration has no impact on this issue. It appears the writes are due to the retry period which defaults to 2s. 

I was not clear what audit log the bug report is referring to so I monitored with kubectl get configmap -w to see a line every write and indeed with CCO they are coming every 2s.

But we cannot just bump this to 90s, it must be lower than both lease duration and renew deadline to allow for a couple retries. So a 90s retry period which this bug requests would mean at least 180 or 270s renew deadline so we can try a couple times, and then a very large lease duration, probably 6 or 7 minutes. Lease duration seems to affect startup time even if the old lease is stale (not sure why this is) so I think we'd be potentially adding several minutes to install.

Comment 3 Devan Goodwin 2020-07-31 18:49:57 UTC
// A client needs to wait a full LeaseDuration without observing a change to
// the record before it can attempt to take over. When all clients are
// shutdown and a new set of clients are started with different names against
// the same leader record, they must wait the full LeaseDuration before
// attempting to acquire the lease. Thus LeaseDuration should be as short as
// possible (within your tolerance for clock skew rate) to avoid a possible
// long waits in the scenario.

So lease duration means affects pod startup time in any scenario where the configmap already exists. This would impact both us and machine-api operator when transitioning from bootstrap controlplane and I believe will add time to the install process.

I would propose:

LeaseDuration: 60s (up from 15)
RenewDeadline: 30s (up from 10)
RetryPeriod: 10s (up from 2)

We're writing to etcd every 10 seconds but this is a 1/5th what it was before. We have 3 tries before a lease holder would give up, and a fairly reasonable 60s lease duration and pod startup time.

Comment 4 Devan Goodwin 2020-07-31 19:07:50 UTC
Went with 90/60/15: https://github.com/openshift/cloud-credential-operator/pull/231

So we will have a 90s pod startup delay, and write every 15s.

Comment 5 Clayton Coleman 2020-07-31 19:46:12 UTC
You should not do that - you should be using "release on cancel" which is part of the election process.  Shutdown on bootstrap is a red herring - stepping down should release the lease, and that is now part of the core library.

See https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130 for an example of how to do this correctly.

That speeds up bootstrap and rolling deploy and is more correct than abandoning the lease on shutdown.

You should hold the lease for a longer period.

Comment 6 Greg Sheremeta 2020-08-22 12:00:09 UTC
PR approved and trying to get through CI now

Comment 9 wang lin 2020-09-01 02:43:58 UTC
The bug has fixed, the RetryPeriod has changed to 90s
test payload: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600

run "oc get configmap -n openshift-cloud-credential-operator -w"