Bug 1703232 - High read and write rate from cluster-autoscaler-operator
Summary: High read and write rate from cluster-autoscaler-operator
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.1.0
Assignee: Vikas Choudhary
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-25 20:39 UTC by Clayton Coleman
Modified: 2019-06-04 10:48 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:48:02 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:48:12 UTC

Description Clayton Coleman 2019-04-25 20:39:24 UTC
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="GET"}	0.5
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="PUT"}	0.5

This looks like you have an overly aggressive leader election (2 seconds?).  Operators should be on 10-20s refresh intervals (with 90s-120s timeouts) because handoff is not important.

Please correct your tuning before GA because this drives write load to the cluster.

Comment 1 Vikas Choudhary 2019-04-29 13:22:35 UTC
https://github.com/kubernetes/kubernetes/pull/77204
https://github.com/kubernetes-sigs/controller-runtime/pull/412

Currently controller-runtime repo runs leader election with hard-coded, very agressive values. With above PRs, leader election configuration will become configurable. Then we would pass higher time durations from cluster-autoscaler-operator using the options which above PRs are adding.

Comment 2 Vikas Choudhary 2019-05-01 04:13:02 UTC
Meanwhile until upstream PR merges at controller-runtime, this is a stop-gap/workaround fix https://github.com/openshift/cluster-autoscaler-operator/pull/96

Comment 3 Michael Gugino 2019-05-01 19:22:19 UTC
Locally patched in vendor.

Comment 5 Michael Gugino 2019-05-01 21:35:06 UTC
Query from clayton: topk(20, sum without (instance) (rate(apiserver_request_count[5m])))

Comment 6 Mike Fiedler 2019-05-03 15:00:30 UTC
Verified on 4.1.0-0.nightly-2019-05-03-093152.   Lower by 10x

{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="GET"}	0.05185185185185186
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="PUT"}	0.05185185185185186

Comment 7 Mike Fiedler 2019-05-03 15:00:58 UTC
rate(apiserver_request_count{client=~"cluster-autoscaler-operator.*"}[5m])

Comment 9 errata-xmlrpc 2019-06-04 10:48:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.