+++ This bug was initially created as a clone of Bug #1858400 +++
The cloud-credential operator is refreshing its lease more than all other components except machine-api combined (writes to their election config map by client within a window of time):
2 system:serviceaccount:openshift-network-operator:default openshift-multus cni-binary-copy-script
2 system:serviceaccount:openshift-network-operator:default openshift-network-operator applied-cluster
2 system:serviceaccount:openshift-network-operator:default openshift-network-operator openshift-service-ca
2 system:serviceaccount:openshift-network-operator:default openshift-sdn sdn-config
18 system:serviceaccount:openshift-machine-config-operator:default openshift-machine-config-operator machine-config
18 system:serviceaccount:openshift-machine-config-operator:machine-config-controller openshift-machine-config-operator machine-config-controller
27 system:serviceaccount:openshift-machine-api:cluster-autoscaler-operator openshift-machine-api cluster-autoscaler-operator-leader
53 system:kube-controller-manager openshift-kube-controller-manager cluster-policy-controller
53 system:serviceaccount:openshift-config-operator:openshift-config-operator openshift-config-operator config-operator-lock
54 system:serviceaccount:openshift-apiserver-operator:openshift-apiserver-operator openshift-apiserver-operator openshift-apiserver-operator-lock
54 system:serviceaccount:openshift-controller-manager-operator:openshift-controller-manager-operator openshift-controller-manager-operator openshift-controller-manager-operator-lock
54 system:serviceaccount:openshift-etcd-operator:etcd-operator openshift-etcd-operator openshift-cluster-etcd-operator-lock
54 system:serviceaccount:openshift-image-registry:cluster-image-registry-operator openshift-image-registry openshift-master-controllers
54 system:serviceaccount:openshift-kube-apiserver-operator:kube-apiserver-operator openshift-kube-apiserver-operator kube-apiserver-operator-lock
54 system:serviceaccount:openshift-kube-apiserver:localhost-recovery-client openshift-kube-apiserver cert-regeneration-controller-lock
54 system:serviceaccount:openshift-kube-controller-manager-operator:kube-controller-manager-operator openshift-kube-controller-manager-operator kube-controller-manager-operator-lock
54 system:serviceaccount:openshift-kube-scheduler-operator:openshift-kube-scheduler-operator openshift-kube-scheduler-operator openshift-cluster-kube-scheduler-operator-lock
54 system:serviceaccount:openshift-kube-storage-version-migrator-operator:kube-storage-version-migrator-operator openshift-kube-storage-version-migrator-operator openshift-kube-storage-version-migrator-operator-lock
54 system:serviceaccount:openshift-service-ca-operator:service-ca-operator openshift-service-ca-operator service-ca-operator-lock
179 system:kube-controller-manager kube-system kube-controller-manager
268 system:kube-scheduler openshift-kube-scheduler kube-scheduler
268 system:serviceaccount:openshift-cloud-credential-operator:cloud-credential-operator openshift-cloud-credential-operator cloud-credential-operator-leader
268 system:serviceaccount:openshift-machine-api:machine-api-controllers openshift-machine-api cluster-api-provider-gcp-leader
268 system:serviceaccount:openshift-machine-api:machine-api-controllers openshift-machine-api cluster-api-provider-healthcheck-leader
268 system:serviceaccount:openshift-machine-api:machine-api-controllers openshift-machine-api cluster-api-provider-nodelink-leader
The operator should set a leader interval to 90s (it is currently 15s). Since the operator already runs only a single pod, is are mainly using election to prevent administrator error (in force deleting a pod or node), vs needing to have rapid failover between multiple components.
Please ensure the component listed here has leader election intervals at 90s, and that after this change the rate of configmap updates from this client (you can check audit log on a cluster) occurs no more frequently than that interval (in case there is a secondary bug).
On the list for inclusion in next sprint.
This is trickier than it looks initially. From controller-runtime we have three tunables:
// LeaseDuration is the duration that non-leader candidates will
// wait to force acquire leadership. This is measured against time of
// last observed ack. Default is 15 seconds.
// RenewDeadline is the duration that the acting master will retry
// refreshing leadership before giving up. Default is 10 seconds.
// RetryPeriod is the duration the LeaderElector clients should wait
// between tries of actions. Default is 2 seconds.
Clayton's report indicates to change the leader election interval. This is not lease duration, and changing lease duration has no impact on this issue. It appears the writes are due to the retry period which defaults to 2s.
I was not clear what audit log the bug report is referring to so I monitored with kubectl get configmap -w to see a line every write and indeed with CCO they are coming every 2s.
But we cannot just bump this to 90s, it must be lower than both lease duration and renew deadline to allow for a couple retries. So a 90s retry period which this bug requests would mean at least 180 or 270s renew deadline so we can try a couple times, and then a very large lease duration, probably 6 or 7 minutes. Lease duration seems to affect startup time even if the old lease is stale (not sure why this is) so I think we'd be potentially adding several minutes to install.
// A client needs to wait a full LeaseDuration without observing a change to
// the record before it can attempt to take over. When all clients are
// shutdown and a new set of clients are started with different names against
// the same leader record, they must wait the full LeaseDuration before
// attempting to acquire the lease. Thus LeaseDuration should be as short as
// possible (within your tolerance for clock skew rate) to avoid a possible
// long waits in the scenario.
So lease duration means affects pod startup time in any scenario where the configmap already exists. This would impact both us and machine-api operator when transitioning from bootstrap controlplane and I believe will add time to the install process.
I would propose:
LeaseDuration: 60s (up from 15)
RenewDeadline: 30s (up from 10)
RetryPeriod: 10s (up from 2)
We're writing to etcd every 10 seconds but this is a 1/5th what it was before. We have 3 tries before a lease holder would give up, and a fairly reasonable 60s lease duration and pod startup time.
Went with 90/60/15: https://github.com/openshift/cloud-credential-operator/pull/231
So we will have a 90s pod startup delay, and write every 15s.
You should not do that - you should be using "release on cancel" which is part of the election process. Shutdown on bootstrap is a red herring - stepping down should release the lease, and that is now part of the core library.
See https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/examples/leader-election/main.go#L130 for an example of how to do this correctly.
That speeds up bootstrap and rolling deploy and is more correct than abandoning the lease on shutdown.
You should hold the lease for a longer period.
PR approved and trying to get through CI now
The bug has fixed, the RetryPeriod has changed to 90s
test payload: registry.svc.ci.openshift.org/ocp/release:4.6.0-0.nightly-2020-08-31-194600
run "oc get configmap -n openshift-cloud-credential-operator -w"