Bug 1703232

Summary: High read and write rate from cluster-autoscaler-operator
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cloud ComputeAssignee: Vikas Choudhary <vichoudh>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: agarcial, mgugino, mifiedle, vichoudh
Target Milestone: ---Keywords: BetaBlocker
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-25 20:39:24 UTC
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="GET"}	0.5
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="PUT"}	0.5

This looks like you have an overly aggressive leader election (2 seconds?).  Operators should be on 10-20s refresh intervals (with 90s-120s timeouts) because handoff is not important.

Please correct your tuning before GA because this drives write load to the cluster.

Comment 1 Vikas Choudhary 2019-04-29 13:22:35 UTC
https://github.com/kubernetes/kubernetes/pull/77204
https://github.com/kubernetes-sigs/controller-runtime/pull/412

Currently controller-runtime repo runs leader election with hard-coded, very agressive values. With above PRs, leader election configuration will become configurable. Then we would pass higher time durations from cluster-autoscaler-operator using the options which above PRs are adding.

Comment 2 Vikas Choudhary 2019-05-01 04:13:02 UTC
Meanwhile until upstream PR merges at controller-runtime, this is a stop-gap/workaround fix https://github.com/openshift/cluster-autoscaler-operator/pull/96

Comment 3 Michael Gugino 2019-05-01 19:22:19 UTC
Locally patched in vendor.

Comment 5 Michael Gugino 2019-05-01 21:35:06 UTC
Query from clayton: topk(20, sum without (instance) (rate(apiserver_request_count[5m])))

Comment 6 Mike Fiedler 2019-05-03 15:00:30 UTC
Verified on 4.1.0-0.nightly-2019-05-03-093152.   Lower by 10x

{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="GET"}	0.05185185185185186
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="PUT"}	0.05185185185185186

Comment 7 Mike Fiedler 2019-05-03 15:00:58 UTC
rate(apiserver_request_count{client=~"cluster-autoscaler-operator.*"}[5m])

Comment 9 errata-xmlrpc 2019-06-04 10:48:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758