Bug 1703232
| Summary: | High read and write rate from cluster-autoscaler-operator | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
| Component: | Cloud Compute | Assignee: | Vikas Choudhary <vichoudh> |
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.0 | CC: | agarcial, mgugino, mifiedle, vichoudh |
| Target Milestone: | --- | Keywords: | BetaBlocker |
| Target Release: | 4.1.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-06-04 10:48:02 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
https://github.com/kubernetes/kubernetes/pull/77204 https://github.com/kubernetes-sigs/controller-runtime/pull/412 Currently controller-runtime repo runs leader election with hard-coded, very agressive values. With above PRs, leader election configuration will become configurable. Then we would pass higher time durations from cluster-autoscaler-operator using the options which above PRs are adding. Meanwhile until upstream PR merges at controller-runtime, this is a stop-gap/workaround fix https://github.com/openshift/cluster-autoscaler-operator/pull/96 Locally patched in vendor. Query from clayton: topk(20, sum without (instance) (rate(apiserver_request_count[5m]))) Verified on 4.1.0-0.nightly-2019-05-03-093152. Lower by 10x
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="GET"} 0.05185185185185186
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",code="200",contentType="application/json",endpoint="https",instance="172.31.128.217:6443",job="apiserver",namespace="default",resource="configmaps",scope="namespace",service="kubernetes",verb="PUT"} 0.05185185185185186
rate(apiserver_request_count{client=~"cluster-autoscaler-operator.*"}[5m])
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |
{client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="GET"} 0.5 {client="cluster-autoscaler-operator/v0.0.0 (linux/amd64) kubernetes/$Format",resource="configmaps",scope="namespace",verb="PUT"} 0.5 This looks like you have an overly aggressive leader election (2 seconds?). Operators should be on 10-20s refresh intervals (with 90s-120s timeouts) because handoff is not important. Please correct your tuning before GA because this drives write load to the cluster.