Bug 1858403

Summary:	[Performance] Lease refresh period for cloud-credential-operator is too high, causes heavy writes to etcd at idle
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cloud Credential Operator	Assignee:	Devan Goodwin <dgoodwin>
Status:	CLOSED ERRATA	QA Contact:	wang lin <lwan>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	agarcial, gshereme, lwan, zhsun
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Cloud Credential Operator leader election was using default values from controller-runtime. Consequence: Writing to etcd every 2s. Fix: Implemented custom leader election. Result: Leader election now writes only every 90s and releases the lock immediately on normal shutdown.	Story Points:	---
Clone Of:	1858400	Environment:
Last Closed:	2020-10-27 16:15:58 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-07-17 20:05:54 UTC

+++ This bug was initially created as a clone of Bug #1858400 +++

The cloud-credential operator is refreshing its lease more than all other components except machine-api combined (writes to their election config map by client within a window of time):

      2 system:serviceaccount:openshift-network-operator:default	openshift-multus	cni-binary-copy-script
      2 system:serviceaccount:openshift-network-operator:default	openshift-network-operator	applied-cluster
      2 system:serviceaccount:openshift-network-operator:default	openshift-network-operator	openshift-service-ca
      2 system:serviceaccount:openshift-network-operator:default	openshift-sdn	sdn-config
     18 system:serviceaccount:openshift-machine-config-operator:default	openshift-machine-config-operator	machine-config
     18 system:serviceaccount:openshift-machine-config-operator:machine-config-controller	openshift-machine-config-operator	machine-config-controller
     27 system:serviceaccount:openshift-machine-api:cluster-autoscaler-operator	openshift-machine-api	cluster-autoscaler-operator-leader
     53 system:kube-controller-manager	openshift-kube-controller-manager	cluster-policy-controller
     53 system:serviceaccount:openshift-config-operator:openshift-config-operator	openshift-config-operator	config-operator-lock
     54 system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator	openshift-apiserver-operator	openshift-apiserver-operator-lock
     54 system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator	openshift-controller-manager-operator	openshift-controller-manager-operator-lock
     54 system:serviceaccount:openshift-etcd-operator:etcd-operator	openshift-etcd-operator	openshift-cluster-etcd-operator-lock
     54 system:serviceaccount:openshift-image-registry:cluster-image-registry-operator	openshift-image-registry	openshift-master-controllers
     54 system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator	openshift-kube-apiserver-operator	kube-apiserver-operator-lock
     54 system:serviceaccount:openshift-kube-apiserver:localhost-recovery-client	openshift-kube-apiserver	cert-regeneration-controller-lock
     54 system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator	openshift-kube-controller-manager-operator	kube-controller-manager-operator-lock
     54 system:serviceaccount:openshift-kube-scheduler-operator:openshift-kube-scheduler-operator	openshift-kube-scheduler-operator	openshift-cluster-kube-scheduler-operator-lock
     54 system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator	openshift-kube-storage-version-migrator-operator	openshift-kube-storage-version-migrator-operator-lock
     54 system:serviceaccount:openshift-service-ca-operator:service-ca-operator	openshift-service-ca-operator	service-ca-operator-lock
    179 system:kube-controller-manager	kube-system	kube-controller-manager
    268 system:kube-scheduler	openshift-kube-scheduler	kube-scheduler
    268 system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator	openshift-cloud-credential-operator	cloud-credential-operator-leader
    268 system:serviceaccount:openshift-machine-api:machine-api-controllers	openshift-machine-api	cluster-api-provider-gcp-leader
    268 system:serviceaccount:openshift-machine-api:machine-api-controllers	openshift-machine-api	cluster-api-provider-healthcheck-leader
    268 system:serviceaccount:openshift-machine-api:machine-api-controllers	openshift-machine-api	cluster-api-provider-nodelink-leader

The operator should set a leader interval to 90s (it is currently 15s). Since the operator already runs only a single pod, is are mainly using election to prevent administrator error (in force deleting a pod or node), vs needing to have rapid failover between multiple components.

Please ensure the component listed here has leader election intervals at 90s, and that after this change the rate of configmap updates from this client (you can check audit log on a cluster) occurs no more frequently than that interval (in case there is a secondary bug).

Comment 1 Devan Goodwin 2020-07-30 12:06:46 UTC

On the list for inclusion in next sprint.

Comment 2 Devan Goodwin 2020-07-31 18:38:43 UTC

This is trickier than it looks initially. From controller-runtime we have three tunables:

// LeaseDuration is the duration that non-leader candidates will
// wait to force acquire leadership. This is measured against time of
// last observed ack. Default is 15 seconds.
LeaseDuration *time.Duration
// RenewDeadline is the duration that the acting master will retry
// refreshing leadership before giving up. Default is 10 seconds.
RenewDeadline *time.Duration
// RetryPeriod is the duration the LeaderElector clients should wait
// between tries of actions. Default is 2 seconds.
RetryPeriod *time.Duration

Clayton's report indicates to change the leader election interval. This is not lease duration, and changing lease duration has no impact on this issue. It appears the writes are due to the retry period which defaults to 2s.

I was not clear what audit log the bug report is referring to so I monitored with kubectl get configmap -w to see a line every write and indeed with CCO they are coming every 2s.

But we cannot just bump this to 90s, it must be lower than both lease duration and renew deadline to allow for a couple retries. So a 90s retry period which this bug requests would mean at least 180 or 270s renew deadline so we can try a couple times, and then a very large lease duration, probably 6 or 7 minutes. Lease duration seems to affect startup time even if the old lease is stale (not sure why this is) so I think we'd be potentially adding several minutes to install.

Comment 3 Devan Goodwin 2020-07-31 18:49:57 UTC

// A client needs to wait a full LeaseDuration without observing a change to
// the record before it can attempt to take over. When all clients are
// shutdown and a new set of clients are started with different names against
// the same leader record, they must wait the full LeaseDuration before
// attempting to acquire the lease. Thus LeaseDuration should be as short as
// possible (within your tolerance for clock skew rate) to avoid a possible
// long waits in the scenario.

So lease duration means affects pod startup time in any scenario where the configmap already exists. This would impact both us and machine-api operator when transitioning from bootstrap controlplane and I believe will add time to the install process.

I would propose:

LeaseDuration: 60s (up from 15)
RenewDeadline: 30s (up from 10)
RetryPeriod: 10s (up from 2)

We're writing to etcd every 10 seconds but this is a 1/5th what it was before. We have 3 tries before a lease holder would give up, and a fairly reasonable 60s lease duration and pod startup time.

Comment 4 Devan Goodwin 2020-07-31 19:07:50 UTC

Went with 90/60/15: https://github.com/openshift/cloud-credential-operator/pull/231

So we will have a 90s pod startup delay, and write every 15s.

Comment 5 Clayton Coleman 2020-07-31 19:46:12 UTC

You should not do that - you should be using "release on cancel" which is part of the election process.  Shutdown on bootstrap is a red herring - stepping down should release the lease, and that is now part of the core library.

See https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130 for an example of how to do this correctly.

That speeds up bootstrap and rolling deploy and is more correct than abandoning the lease on shutdown.

You should hold the lease for a longer period.

Comment 6 Greg Sheremeta 2020-08-22 12:00:09 UTC

PR approved and trying to get through CI now

Comment 9 wang lin 2020-09-01 02:43:58 UTC

The bug has fixed, the RetryPeriod has changed to 90s
test payload: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600

step:
run "oc get configmap -n openshift-cloud-credential-operator -w"

Comment 11 errata-xmlrpc 2020-10-27 16:15:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196