Bug 1703232

Summary:	High read and write rate from cluster-autoscaler-operator
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cloud Compute	Assignee:	Vikas Choudhary <vichoudh>
Status:	CLOSED ERRATA	QA Contact:	Mike Fiedler <mifiedle>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.1.0	CC:	agarcial, mgugino, mifiedle, vichoudh
Target Milestone:	---	Keywords:	BetaBlocker
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:48:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-25 20:39:24 UTC

{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="GET"}	0.5
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="PUT"}	0.5

This looks like you have an overly aggressive leader election (2 seconds?).  Operators should be on 10-20s refresh intervals (with 90s-120s timeouts) because handoff is not important.

Please correct your tuning before GA because this drives write load to the cluster.

Comment 1 Vikas Choudhary 2019-04-29 13:22:35 UTC

https://github.com/kubernetes/kubernetes/pull/77204
https://github.com/kubernetes-sigs/controller-runtime/pull/412

Currently controller-runtime repo runs leader election with hard-coded, very agressive values. With above PRs, leader election configuration will become configurable. Then we would pass higher time durations from cluster-autoscaler-operator using the options which above PRs are adding.

Comment 2 Vikas Choudhary 2019-05-01 04:13:02 UTC

Meanwhile until upstream PR merges at controller-runtime, this is a stop-gap/workaround fix https://github.com/openshift/cluster-autoscaler-operator/pull/96

Comment 3 Michael Gugino 2019-05-01 19:22:19 UTC

Locally patched in vendor.

Comment 5 Michael Gugino 2019-05-01 21:35:06 UTC

Query from clayton: topk(20, sum without (instance) (rate(apiserver_request_count[5m])))

Comment 6 Mike Fiedler 2019-05-03 15:00:30 UTC

Verified on 4.1.0-0.nightly-2019-05-03-093152.   Lower by 10x

{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="GET"}	0.05185185185185186
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="PUT"}	0.05185185185185186

Comment 7 Mike Fiedler 2019-05-03 15:00:58 UTC

rate(apiserver_request_count{client=~"cluster-autoscaler-operator.*"}[5m])

Comment 9 errata-xmlrpc 2019-06-04 10:48:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758