1703232 – High read and write rate from cluster-autoscaler-operator

Bug 1703232 - High read and write rate from cluster-autoscaler-operator

Summary: High read and write rate from cluster-autoscaler-operator

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Vikas Choudhary
QA Contact:	Mike Fiedler
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-25 20:39 UTC by Clayton Coleman
Modified:	2019-06-04 10:48 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:48:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:48:12 UTC

Description Clayton Coleman 2019-04-25 20:39:24 UTC

{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="GET"}	0.5
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="PUT"}	0.5

This looks like you have an overly aggressive leader election (2 seconds?).  Operators should be on 10-20s refresh intervals (with 90s-120s timeouts) because handoff is not important.

Please correct your tuning before GA because this drives write load to the cluster.

Comment 1 Vikas Choudhary 2019-04-29 13:22:35 UTC

https://github.com/kubernetes/kubernetes/pull/77204
https://github.com/kubernetes-sigs/controller-runtime/pull/412

Currently controller-runtime repo runs leader election with hard-coded, very agressive values. With above PRs, leader election configuration will become configurable. Then we would pass higher time durations from cluster-autoscaler-operator using the options which above PRs are adding.

Comment 2 Vikas Choudhary 2019-05-01 04:13:02 UTC

Meanwhile until upstream PR merges at controller-runtime, this is a stop-gap/workaround fix https://github.com/openshift/cluster-autoscaler-operator/pull/96

Comment 3 Michael Gugino 2019-05-01 19:22:19 UTC

Locally patched in vendor.

Comment 5 Michael Gugino 2019-05-01 21:35:06 UTC

Query from clayton: topk(20, sum without (instance) (rate(apiserver_request_count[5m])))

Comment 6 Mike Fiedler 2019-05-03 15:00:30 UTC

Verified on 4.1.0-0.nightly-2019-05-03-093152.   Lower by 10x

{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="GET"}	0.05185185185185186
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="PUT"}	0.05185185185185186

Comment 7 Mike Fiedler 2019-05-03 15:00:58 UTC

rate(apiserver_request_count{client=~"cluster-autoscaler-operator.*"}[5m])

Comment 9 errata-xmlrpc 2019-06-04 10:48:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.