Bug 1977095 - telemetry count test failing in release-openshift-origin-installer-old-rhcos-e2e-aws-4.7
Summary: telemetry count test failing in release-openshift-origin-installer-old-rhcos-...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.z
Assignee: Prashant Balachandran
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-28 21:41 UTC by Ben Parees
Modified: 2021-07-28 05:51 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
job=release-openshift-origin-installer-old-rhcos-e2e-aws-4.7=all
Last Closed: 2021-07-28 05:51:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:2767 0 None None None 2021-07-28 05:51:35 UTC

Description Ben Parees 2021-06-28 21:41:06 UTC
job:
release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 

is always failing in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-old-rhcos-e2e-aws-4.7

One of the consistently failing tests is:


: [sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Suite:openshift/conformance/parallel] expand_less

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "max_over_time(cluster:telemetry_selected_series:count[2h]) >= 500": {
            s: "promQL query: max_over_time(cluster:telemetry_selected_series:count[2h]) >= 500 had reported incorrect results:\n[{\"metric\":{},\"value\":[1624748035.001,\"514\"]}]",
        },
    }
to be empty


sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.7/1408892345029496832

Either the test needs to raise the limit or we need to reduce our metric time series count (it would also be useful to understand why this fails w/ old rhcos, but presumably not w/ current rhcos)

Comment 1 Simon Pasquier 2021-06-29 08:10:05 UTC
Hmm this is weird because the limits have been increased to 600 series in 4.7 [1] while 4.6 still has the 500 limit [2]. It would mean that the release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 job uses the release-4.6 branch of openshift/origin?

Anyway we need to fix the title of the test because "... the 500 series limit ..." isn't accurate.

[1] https://github.com/openshift/origin/blob/5013124a4cb27df4b199aabb5812ec0fc1184196/test/extended/prometheus/prometheus.go#L111-L118
[2] https://github.com/openshift/origin/blob/f629c90891c0c7e49dbcc2a5fb44a177712fcfd8/test/extended/prometheus/prometheus.go#L103-L107

Comment 2 Ben Parees 2021-06-29 16:31:45 UTC
The job appears to use the 4.6 tests deliberately:

https://github.com/openshift/release/blob/f572056645f7536ac91857204edfaef8088f1766/ci-operator/jobs/openshift/release/openshift-release-release-4.7-periodics.yaml#L570

so you'll need to backport the change to 4.6 if that's appropriate.

Comment 7 errata-xmlrpc 2021-07-28 05:51:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.40 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2767


Note You need to log in before you can comment on or make changes to this bug.