Bug 1683057 - [ci] [aws] haproxy should expose prometheus metrics for a route
Summary: [ci] [aws] haproxy should expose prometheus metrics for a route
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.3.0
Assignee: Dan Mace
QA Contact: Hongan Li
URL:
Whiteboard:
: 1768907 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-26 07:56 UTC by Ravi Sankar
Modified: 2022-08-04 22:20 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-23 11:03:45 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24075 0 'None' closed Bug 1683057: Fix router metrics test flakes 2020-05-26 07:36:18 UTC
Red Hat Product Errata RHBA-2020:0062 0 None None None 2020-01-23 11:03:59 UTC

Description Ravi Sankar 2019-02-26 07:56:28 UTC
Description of problem:
openshift-tests [Conformance][Area:Networking][Feature:Router] The HAProxy router should expose prometheus metrics for a route [Suite:openshift/conformance/parallel/minimal] 1m28s

go run hack/e2e.go -v -test --test_args='--ginkgo.focus=openshift\-tests\s\[Conformance\]\[Area\:Networking\]\[Feature\:Router\]\sThe\sHAProxy\srouter\sshould\sexpose\sprometheus\smetrics\sfor\sa\sroute\s\[Suite\:openshift\/conformance\/parallel\/minimal\]$'

fail [github.com/openshift/origin/test/extended/router/metrics.go:205]: Expected
    <float64>: 7
to be >=
    <float64>: 13

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/4541/

https://github.com/openshift/origin/pull/22140

Comment 3 Dan Mace 2019-08-08 19:57:15 UTC
In v4, there are two router replicas by default. The routers have their own uncoordinated metrics state. The test was refactored to speak to the metrics endpoint via a Service in front of the routers. This means the metrics requests are getting load balanced to different router Service endpoints. The test was originally designed in v3 to assume communication with a single router endpoint and makes assertions about the metrics state of the single router. By abstracting communication through the Service, the test no longer knows which router it's talking to, and so sometimes gets the answers it expects, and sometimes not, resulting in flakiness.

Comment 5 Dan Mace 2019-08-14 19:17:46 UTC
The effort to refactor the test is greater than expected. During the course of debugging the test, I think we gained reasonable confidence that:

1. The feature itself is working well, but the test is flawed
2. Lack of the test hasn't caused us any serious problems yet
3. The router code is stable enough that this test particular isn't critical to the release

For now I'm going to reprioritize the bug to low and if we don't get back to this soon I'll move it out to 4.3.

Comment 7 Gabe Montero 2019-11-05 14:14:07 UTC
*** Bug 1768907 has been marked as a duplicate of this bug. ***

Comment 9 Dan Mace 2019-11-06 13:25:55 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/236 is not the same root cause. In this case there's something going on with the CI tooling that's causing old code to get executed during the test.

Commit with fix:

https://github.com/openshift/origin/blob/9b16f97a13a88695f114dbea8be98563a4444374/test/extended/router/metrics.go#L43

Prior commit (without fix):

https://github.com/openshift/origin/blob/ea01775e609f075f0f755396cd57f9daaafdeadc/test/extended/router/metrics.go#L43

Notice the failing line:

fail [github.com/openshift/origin/test/extended/router/metrics.go:140]: Unexpected error:
12981
    <*errors.errorString | 0xc000290100>: {
12982
        s: "timed out waiting for the condition",
12983
    }
12984
    timed out waiting for the condition
12985
occurred


The `metrics.go:140` reference refers to the old commit. Clayton and Steve K. are aware of this issue and are investigating. I'm going to open a new bug against test infrastructure to help clarify. If you see any more of these failures, please take a look at the `metrics.go` references and compare them with the linked commits. If the line numbers don't match up to executable code in the fixed commit, the problem is something in CI and not anything to do with the patch.

Sorry for the confusion!

Comment 10 Dan Mace 2019-11-06 13:30:30 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/236 actually executed on the same day the fix merged, so I'm not sure it's going to be useful to Clayton and Steve's investigation. I suspect they're more interested in reoccurrences days after the merge of the fix. I'll bring it to their attention before filing a new Test Infra bug.

Please do make sure to closely inspect any perceived additional failures to make sure the problem isn't the test infra issue.

Thank you!

Comment 13 errata-xmlrpc 2020-01-23 11:03:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062

Comment 14 Devan Goodwin 2020-03-19 17:21:42 UTC
I think we're seeing this again, rarely:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/1149

[Conformance][Area:Networking][Feature:Router] The HAProxy router should expose prometheus metrics for a route [Suite:openshift/conformance/parallel/minimal] expand_less 	1m9s
fail [github.com/openshift/origin/test/extended/router/metrics.go:210]: Expected
    <float64>: 1073
to be >=
    <float64>: 1242




And again back on March 1: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.3/1058

[Conformance][Area:Networking][Feature:Router] The HAProxy router should expose prometheus metrics for a route [Suite:openshift/conformance/parallel/minimal] expand_less 	1m1s
fail [github.com/openshift/origin/test/extended/router/metrics.go:210]: Expected
    <float64>: 672
to be >=
    <float64>: 824



Should this be reopened?

Comment 15 Dan Mace 2020-03-19 17:44:41 UTC
(In reply to Devan Goodwin from comment #14)
> I think we're seeing this again, rarely:
> 
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-ocp-installer-e2e-aws-4.3/1149
> 
> [Conformance][Area:Networking][Feature:Router] The HAProxy router should
> expose prometheus metrics for a route
> [Suite:openshift/conformance/parallel/minimal] expand_less 	1m9s
> fail [github.com/openshift/origin/test/extended/router/metrics.go:210]:
> Expected
>     <float64>: 1073
> to be >=
>     <float64>: 1242
> 
> 
> 
> 
> And again back on March 1:
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-
> openshift-ocp-installer-e2e-aws-4.3/1058
> 
> [Conformance][Area:Networking][Feature:Router] The HAProxy router should
> expose prometheus metrics for a route
> [Suite:openshift/conformance/parallel/minimal] expand_less 	1m1s
> fail [github.com/openshift/origin/test/extended/router/metrics.go:210]:
> Expected
>     <float64>: 672
> to be >=
>     <float64>: 824
> 
> 
> 
> Should this be reopened?

Maybe, but if it's only rarely and only in 4.3 I think it should be opened against a future release and low priority


Note You need to log in before you can comment on or make changes to this bug.