Bug 1977095

Summary: telemetry count test failing in release-openshift-origin-installer-old-rhcos-e2e-aws-4.7
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: MonitoringAssignee: Prashant Balachandran <pnair>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.7CC: alegrand, anpicker, aos-bugs, erooth, kakkoyun, pkrupa, sippy, spasquie
Target Milestone: ---   
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
job=release-openshift-origin-installer-old-rhcos-e2e-aws-4.7=all
Last Closed: 2021-07-28 05:51:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ben Parees 2021-06-28 21:41:06 UTC
job:
release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 

is always failing in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#release-openshift-origin-installer-old-rhcos-e2e-aws-4.7

One of the consistently failing tests is:


: [sig-instrumentation][Late] Alerts shouldn't exceed the 500 series limit of total series sent via telemetry from each cluster [Suite:openshift/conformance/parallel] expand_less

fail [github.com/openshift/origin/test/extended/util/prometheus/helpers.go:174]: Expected
    <map[string]error | len:1>: {
        "max_over_time(cluster:telemetry_selected_series:count[2h]) >= 500": {
            s: "promQL query: max_over_time(cluster:telemetry_selected_series:count[2h]) >= 500 had reported incorrect results:\n[{\"metric\":{},\"value\":[1624748035.001,\"514\"]}]",
        },
    }
to be empty


sample job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-old-rhcos-e2e-aws-4.7/1408892345029496832

Either the test needs to raise the limit or we need to reduce our metric time series count (it would also be useful to understand why this fails w/ old rhcos, but presumably not w/ current rhcos)

Comment 1 Simon Pasquier 2021-06-29 08:10:05 UTC
Hmm this is weird because the limits have been increased to 600 series in 4.7 [1] while 4.6 still has the 500 limit [2]. It would mean that the release-openshift-origin-installer-old-rhcos-e2e-aws-4.7 job uses the release-4.6 branch of openshift/origin?

Anyway we need to fix the title of the test because "... the 500 series limit ..." isn't accurate.

[1] https://github.com/openshift/origin/blob/5013124a4cb27df4b199aabb5812ec0fc1184196/test/extended/prometheus/prometheus.go#L111-L118
[2] https://github.com/openshift/origin/blob/f629c90891c0c7e49dbcc2a5fb44a177712fcfd8/test/extended/prometheus/prometheus.go#L103-L107

Comment 2 Ben Parees 2021-06-29 16:31:45 UTC
The job appears to use the 4.6 tests deliberately:

https://github.com/openshift/release/blob/f572056645f7536ac91857204edfaef8088f1766/ci-operator/jobs/openshift/release/openshift-release-release-4.7-periodics.yaml#L570

so you'll need to backport the change to 4.6 if that's appropriate.

Comment 7 errata-xmlrpc 2021-07-28 05:51:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.40 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2767