Bug 1777189
Summary: | e2e: promQL query: openshift_build_total{phase="Complete"} >= 0 had reported incorrect results: model.Vector{} | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
Component: | Build | Assignee: | W. Trevor King <wking> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.4 | CC: | alegrand, anpicker, aos-bugs, erooth, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania, wzheng |
Target Milestone: | --- | ||
Target Release: | 4.4.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-05-04 11:17:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
W. Trevor King
2019-11-27 05:15:42 UTC
Looks like this affects 4.4 [1], but not 4.3 [2]. [1]: https://search.svc.ci.openshift.org/chart?name=%5erelease-openshift-ocp-.*4.4$&search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector [2]: https://search.svc.ci.openshift.org/chart?name=%5erelease-openshift-ocp-.*4.3$&search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector It might be related to https://github.com/openshift/origin/pull/24117 In particular runQueries() used to retry failed queries a significant number of times (eg 240) [1] but now it only retries five times [2]. Given that it takes time for a metric to be collected by Prometheus (eg up to 30 seconds), I think that queries should be retried with back-off instead of immediately. [1] https://github.com/openshift/origin/pull/24117/files#diff-65caacaafc03301b4bd2a5a01a96bb65L104 [2] https://github.com/openshift/origin/pull/24117/files#diff-68b78becedd34dfeef78260ab7c23952R37 The build controller has its own custom metric - it was recently refactored to work with k8s 1.16 metrics [1]. Per Simon, it seems that the new retry logic limit is too short. [1] https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/metrics/prometheus/metrics.go Looks good. From [1], the last match was [2], launched before the PR landed. [1]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24226/pull-ci-openshift-origin-master-e2e-aws-fips/259 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581 |