Bug 1777189

Summary:	e2e: promQL query: openshift_build_total{phase="Complete"} >= 0 had reported incorrect results: model.Vector{}
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Build	Assignee:	W. Trevor King <wking>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	alegrand, anpicker, aos-bugs, erooth, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania, wzheng
Target Milestone:	---
Target Release:	4.4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-05-04 11:17:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-11-27 05:15:42 UTC

Errors like [1]:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:156]: Expected
    <map[string]error | len:1>: {
        "openshift_build_total{phase=\"Complete\"} >= 0": {
            s: "promQL query: openshift_build_total{phase=\"Complete\"} >= 0 had reported incorrect results: model.Vector{}",
        },
    }
to be empty
...
failed: (1m4s) 2019-11-26T23:05:24 "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]"

This error occurred in 76 of our e2e runs in CI today (8% of all failed jobs) [2], so which is a pretty high flake rate.

I'm not sure if this is a Monitoring thing because of the PromQL or a Build thing because of [Feature:Builds].  Going with Monitoring, because Berlin wakes up early, but obviously feel free to reassign if I'm guessing wrong ;). 

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/73
[2]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector

Comment 1 W. Trevor King 2019-11-27 05:23:52 UTC

Looks like this affects 4.4 [1], but not 4.3 [2].

[1]: https://search.svc.ci.openshift.org/chart?name=%5erelease-openshift-ocp-.*4.4$&search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector
[2]: https://search.svc.ci.openshift.org/chart?name=%5erelease-openshift-ocp-.*4.3$&search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector

Comment 3 Simon Pasquier 2019-11-27 09:09:44 UTC

It might be related to https://github.com/openshift/origin/pull/24117

In particular runQueries() used to retry failed queries a significant number of times (eg 240) [1] but now it only retries five times [2]. Given that it takes time for a metric to be collected by Prometheus (eg up to 30 seconds), I think that queries should be retried with back-off instead of immediately.

[1] https://github.com/openshift/origin/pull/24117/files#diff-65caacaafc03301b4bd2a5a01a96bb65L104
[2] https://github.com/openshift/origin/pull/24117/files#diff-68b78becedd34dfeef78260ab7c23952R37

Comment 5 Adam Kaplan 2019-12-02 14:01:02 UTC

The build controller has its own custom metric - it was recently refactored to work with k8s 1.16 metrics [1].
Per Simon, it seems that the new retry logic limit is too short.


[1] https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/metrics/prometheus/metrics.go

Comment 6 W. Trevor King 2019-12-03 14:56:01 UTC

Looks good.  From [1], the last match was [2], launched before the PR landed.

[1]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector
[2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24226/pull-ci-openshift-origin-master-e2e-aws-fips/259

Comment 10 errata-xmlrpc 2020-05-04 11:17:42 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581