Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1835371

Summary:	[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Networking	Assignee:	aos-network-edge-staff <aos-network-edge-staff>
Networking sub component:	router	QA Contact:	Arvind iyengar <aiyengar>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	medium
Priority:	medium	CC:	amcdermo, aos-bugs, bbennett, bleanhar, bperkins, cholman, dgoodwin, gspence, hongli, mmasters, stbenjam, wking
Version:	4.5
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1857409 (view as bug list)		Environment:	[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route
Last Closed:	2022-08-02 00:57:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1846507, 1857409

Description Ben Parees 2020-05-13 17:16:33 UTC

test:
[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route


https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TopFailingTests

it is only passing 86% of the time.

Comment 1 Andrew McDermott 2020-05-19 15:22:56 UTC

Moving to 4.6.

Comment 2 Clayton Coleman 2020-05-21 15:56:15 UTC

This is now flaking heavily.  This started failing 5/11.  I think this is a regression in the product.  Moving back to 4.5 and bumping severity.

Comment 3 Andrew McDermott 2020-05-22 17:18:29 UTC

I looked at this today but was unable to reproduce when using GCP - I suspect I need to run just more than the single test though Steve Greene mentioned he was able to reproduce 1/10 times using AWS yesterday.

This run hints at a timing/race as in this run one of the endpoints had at least 1 session measured/recorded:

  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/558/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/2554


fail [github.com/openshift/origin/test/extended/router/metrics.go:189]: Expected
    <[]float64 | len:2, cap:2>: [1, 0]
to consist of
    <[]interface {} | len:2, cap:2>: [
        {Comparator: ">", CompareTo: [0]},
        {Comparator: ">", CompareTo: [0]},
    ]

Comment 4 Andrew McDermott 2020-05-27 16:12:21 UTC

Moving to 4.6 as not a release blocker or an upgrade blocker. This is a CI blocker as the flake rate is high and should also be considered for a backport.

Comment 5 Andrew McDermott 2020-05-27 17:09:43 UTC

I was able to periodically reproduce this today by running the single test against (on AWS) and simultaneously starting the full e2e suite. It would typically reproduce within a few minutes. It looks like an associated router reload resets the metric values. Continuing to investigate.

Comment 6 Andrew McDermott 2020-05-29 14:52:24 UTC

PR - https://github.com/openshift/origin/pull/25045

Comment 7 Andrew McDermott 2020-05-29 16:14:28 UTC

Moving this back to 4.5 as I posted a patch in comment #6 and it is a CI blocker given the flake rate.

Comment 8 Andrew McDermott 2020-06-04 16:28:44 UTC

Moved this to 4.6 but issues a cherry-pick for 4.5:

  https://github.com/openshift/origin/pull/25045#issuecomment-638957396

Comment 11 Hongan Li 2020-06-28 06:33:11 UTC

can still find some failure in recent 4.6 CI, see search result:

https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route

Comment 12 Andrew McDermott 2020-07-09 12:08:25 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 Andrew McDermott 2020-07-30 10:03:27 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 16 mfisher 2020-08-18 20:01:38 UTC

Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 17 Andrew McDermott 2020-09-07 16:38:22 UTC

(In reply to Hongan Li from comment #11)
> can still find some failure in recent 4.6 CI, see search result:
> 
> https://search.svc.ci.openshift.org/
> ?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&
> groupBy=job&search=%5C%5Bsig-
> network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+pr
> ometheus+metrics+for+a+route

I started to experiment with changing the timing on the part that flakes:

  https://github.com/openshift/origin/pull/25484

I don't really know if this will makes things better and it will need to
run through CI many times to draw any conclusion.

Comment 18 Andrew McDermott 2020-09-10 11:47:42 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 19 Andrew McDermott 2020-09-15 16:28:16 UTC

Taking this one step at a time:

- merging https://github.com/openshift/router/pull/179

to see the impact to CI flake rate.

Then will take another look at:

 https://github.com/openshift/origin/pull/25484

Comment 20 Andrew McDermott 2020-10-02 16:54:51 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 22 Andrew McDermott 2020-10-23 16:02:04 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 23 Andrew McDermott 2020-11-16 08:23:42 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 24 Andrew McDermott 2020-12-04 16:50:48 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 25 Miciah Dashiel Butler Masters 2021-05-26 16:36:20 UTC

https://search.ci.openshift.org/?search=E+e2e-test%2F%22%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job shows the following:

    Found in 0.09% of runs (0.33% of failures) across 170361 total runs and 5597 jobs (28.49% failed)

Given the low flake rate at present and other priorities, I'm lowering the severity of this Bugzilla report.

Comment 31 Ben Parees 2022-08-02 00:57:10 UTC

https://sippy.dptools.openshift.org/sippy-ng/tests/4.12?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522The%2520HAProxy%2520router%2520should%2520expose%2520prometheus%2520metrics%2520for%2520a%2520route%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage

indicates this test is now passing 99.7% of the time in 4.12 and similarly good in 4.11 and older (in fact it's currently at 100% on 4.6).

So i'm going to close this out as resolved.