Errors like [1]: fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:156]: Expected <map[string]error | len:1>: { "openshift_build_total{phase=\"Complete\"} >= 0": { s: "promQL query: openshift_build_total{phase=\"Complete\"} >= 0 had reported incorrect results: model.Vector{}", }, } to be empty ... failed: (1m4s) 2019-11-26T23:05:24 "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" This error occurred in 76 of our e2e runs in CI today (8% of all failed jobs) [2], so which is a pretty high flake rate. I'm not sure if this is a Monitoring thing because of the PromQL or a Build thing because of [Feature:Builds]. Going with Monitoring, because Berlin wakes up early, but obviously feel free to reassign if I'm guessing wrong ;). [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/73 [2]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector
Looks like this affects 4.4 [1], but not 4.3 [2]. [1]: https://search.svc.ci.openshift.org/chart?name=%5erelease-openshift-ocp-.*4.4$&search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector [2]: https://search.svc.ci.openshift.org/chart?name=%5erelease-openshift-ocp-.*4.3$&search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector
It might be related to https://github.com/openshift/origin/pull/24117 In particular runQueries() used to retry failed queries a significant number of times (eg 240) [1] but now it only retries five times [2]. Given that it takes time for a metric to be collected by Prometheus (eg up to 30 seconds), I think that queries should be retried with back-off instead of immediately. [1] https://github.com/openshift/origin/pull/24117/files#diff-65caacaafc03301b4bd2a5a01a96bb65L104 [2] https://github.com/openshift/origin/pull/24117/files#diff-68b78becedd34dfeef78260ab7c23952R37
The build controller has its own custom metric - it was recently refactored to work with k8s 1.16 metrics [1]. Per Simon, it seems that the new retry logic limit is too short. [1] https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/metrics/prometheus/metrics.go
Looks good. From [1], the last match was [2], launched before the PR landed. [1]: https://search.svc.ci.openshift.org/chart?search=promQL%20query:%20openshift_build_total.*Complete.*reported%20incorrect%20results.*Vector [2]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/24226/pull-ci-openshift-origin-master-e2e-aws-fips/259
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581