Bug 1777189 - e2e: promQL query: openshift_build_total{phase="Complete"} >= 0 had reported incorrect results: model.Vector{}
Summary: e2e: promQL query: openshift_build_total{phase="Complete"} >= 0 had reported ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Build
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.4.0
Assignee: W. Trevor King
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-11-27 05:15 UTC by W. Trevor King
Modified: 2020-05-04 11:18 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:17:42 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 24248 'None' closed Bug 1777189: test/extended/prometheus/prometheus_builds: Wait up to 40s 2020-04-21 13:23:24 UTC
Red Hat Product Errata RHBA-2020:0581 None None None 2020-05-04 11:18:08 UTC

Description W. Trevor King 2019-11-27 05:15:42 UTC
Errors like [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:156]: Expected
    <map[string]error | len:1>: {
        "openshift_build_total{phase=\"Complete\"} >= 0": {
            s: "promQL query: openshift_build_total{phase=\"Complete\"} >= 0 had reported incorrect results: model.Vector{}",
        },
    }
to be empty
...
failed: (1m4s) 2019-11-26T23:05:24 "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]"

This error occurred in 76 of our e2e runs in CI today (8% of all failed jobs) [2], so which is a pretty high flake rate.

I'm not sure if this is a Monitoring thing because of the PromQL or a Build thing because of [Feature:Builds].  Going with Monitoring, because Berlin wakes up early, but obviously feel free to reassign if I'm guessing wrong ;). 

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/73
[2]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector

Comment 3 Simon Pasquier 2019-11-27 09:09:44 UTC
It might be related to https://github.com/openshift/origin/pull/24117

In particular runQueries() used to retry failed queries a significant number of times (eg 240) [1] but now it only retries five times [2]. Given that it takes time for a metric to be collected by Prometheus (eg up to 30 seconds), I think that queries should be retried with back-off instead of immediately.

[1] https://github.com/openshift/origin/pull/24117/files#diff-65caacaafc03301b4bd2a5a01a96bb65L104
[2] https://github.com/openshift/origin/pull/24117/files#diff-68b78becedd34dfeef78260ab7c23952R37

Comment 5 Adam Kaplan 2019-12-02 14:01:02 UTC
The build controller has its own custom metric - it was recently refactored to work with k8s 1.16 metrics [1].
Per Simon, it seems that the new retry logic limit is too short.


[1] https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/metrics/prometheus/metrics.go

Comment 10 errata-xmlrpc 2020-05-04 11:17:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.