Bug 1777189

Summary: e2e: promQL query: openshift_build_total{phase="Complete"} >= 0 had reported incorrect results: model.Vector{}
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: BuildAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: alegrand, anpicker, aos-bugs, erooth, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania, wzheng
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:17:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-11-27 05:15:42 UTC
Errors like [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:156]: Expected
    <map[string]error | len:1>: {
        "openshift_build_total{phase=\"Complete\"} >= 0": {
            s: "promQL query: openshift_build_total{phase=\"Complete\"} >= 0 had reported incorrect results: model.Vector{}",
        },
    }
to be empty
...
failed: (1m4s) 2019-11-26T23:05:24 "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]"

This error occurred in 76 of our e2e runs in CI today (8% of all failed jobs) [2], so which is a pretty high flake rate.

I'm not sure if this is a Monitoring thing because of the PromQL or a Build thing because of [Feature:Builds].  Going with Monitoring, because Berlin wakes up early, but obviously feel free to reassign if I'm guessing wrong ;). 

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/73
[2]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector

Comment 3 Simon Pasquier 2019-11-27 09:09:44 UTC
It might be related to https://github.com/openshift/origin/pull/24117

In particular runQueries() used to retry failed queries a significant number of times (eg 240) [1] but now it only retries five times [2]. Given that it takes time for a metric to be collected by Prometheus (eg up to 30 seconds), I think that queries should be retried with back-off instead of immediately.

[1] https://github.com/openshift/origin/pull/24117/files#diff-65caacaafc03301b4bd2a5a01a96bb65L104
[2] https://github.com/openshift/origin/pull/24117/files#diff-68b78becedd34dfeef78260ab7c23952R37

Comment 5 Adam Kaplan 2019-12-02 14:01:02 UTC
The build controller has its own custom metric - it was recently refactored to work with k8s 1.16 metrics [1].
Per Simon, it seems that the new retry logic limit is too short.


[1] https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/metrics/prometheus/metrics.go

Comment 10 errata-xmlrpc 2020-05-04 11:17:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581