1835371 – [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route

Bug 1835371 - [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route

Summary: [sig-network][Feature:Router] The HAProxy router should expose prometheus met...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	aos-network-edge-staff
QA Contact:	Arvind iyengar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1846507 1857409
TreeView+	depends on / blocked

Reported:	2020-05-13 17:16 UTC by Ben Parees
Modified:	2022-08-04 22:27 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1857409 (view as bug list)
Environment:	[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route
Last Closed:	2022-08-02 00:57:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25045	0	None	closed	Bug 1835371: router/metrics: Fix haproxy_server_max_sessions test	2021-02-15 15:06:29 UTC

Description Ben Parees 2020-05-13 17:16:33 UTC

test:
[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route


https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TopFailingTests

it is only passing 86% of the time.

Comment 1 Andrew McDermott 2020-05-19 15:22:56 UTC

Moving to 4.6.

Comment 2 Clayton Coleman 2020-05-21 15:56:15 UTC

This is now flaking heavily.  This started failing 5/11.  I think this is a regression in the product.  Moving back to 4.5 and bumping severity.

Comment 3 Andrew McDermott 2020-05-22 17:18:29 UTC

I looked at this today but was unable to reproduce when using GCP - I suspect I need to run just more than the single test though Steve Greene mentioned he was able to reproduce 1/10 times using AWS yesterday.

This run hints at a timing/race as in this run one of the endpoints had at least 1 session measured/recorded:

  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/558/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/2554


fail [github.com/openshift/origin/test/extended/router/metrics.go:189]: Expected
    <[]float64 | len:2, cap:2>: [1, 0]
to consist of
    <[]interface {} | len:2, cap:2>: [
        {Comparator: ">", CompareTo: [0]},
        {Comparator: ">", CompareTo: [0]},
    ]

Comment 4 Andrew McDermott 2020-05-27 16:12:21 UTC

Moving to 4.6 as not a release blocker or an upgrade blocker. This is a CI blocker as the flake rate is high and should also be considered for a backport.

Comment 5 Andrew McDermott 2020-05-27 17:09:43 UTC

I was able to periodically reproduce this today by running the single test against (on AWS) and simultaneously starting the full e2e suite. It would typically reproduce within a few minutes. It looks like an associated router reload resets the metric values. Continuing to investigate.

Comment 6 Andrew McDermott 2020-05-29 14:52:24 UTC

PR - https://github.com/openshift/origin/pull/25045

Comment 7 Andrew McDermott 2020-05-29 16:14:28 UTC

Moving this back to 4.5 as I posted a patch in comment #6 and it is a CI blocker given the flake rate.

Comment 8 Andrew McDermott 2020-06-04 16:28:44 UTC

Moved this to 4.6 but issues a cherry-pick for 4.5:

  https://github.com/openshift/origin/pull/25045#issuecomment-638957396

Comment 11 Hongan Li 2020-06-28 06:33:11 UTC

can still find some failure in recent 4.6 CI, see search result:

https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route

Comment 12 Andrew McDermott 2020-07-09 12:08:25 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 Andrew McDermott 2020-07-30 10:03:27 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 16 mfisher 2020-08-18 20:01:38 UTC

Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 17 Andrew McDermott 2020-09-07 16:38:22 UTC

(In reply to Hongan Li from comment #11)
> can still find some failure in recent 4.6 CI, see search result:
> 
> https://search.svc.ci.openshift.org/
> ?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&
> groupBy=job&search=%5C%5Bsig-
> network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+pr
> ometheus+metrics+for+a+route

I started to experiment with changing the timing on the part that flakes:

  https://github.com/openshift/origin/pull/25484

I don't really know if this will makes things better and it will need to
run through CI many times to draw any conclusion.

Comment 18 Andrew McDermott 2020-09-10 11:47:42 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 19 Andrew McDermott 2020-09-15 16:28:16 UTC

Taking this one step at a time:

- merging https://github.com/openshift/router/pull/179

to see the impact to CI flake rate.

Then will take another look at:

 https://github.com/openshift/origin/pull/25484

Comment 20 Andrew McDermott 2020-10-02 16:54:51 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 22 Andrew McDermott 2020-10-23 16:02:04 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 23 Andrew McDermott 2020-11-16 08:23:42 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 24 Andrew McDermott 2020-12-04 16:50:48 UTC

Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 25 Miciah Dashiel Butler Masters 2021-05-26 16:36:20 UTC

https://search.ci.openshift.org/?search=E+e2e-test%2F%22%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job shows the following:

    Found in 0.09% of runs (0.33% of failures) across 170361 total runs and 5597 jobs (28.49% failed)

Given the low flake rate at present and other priorities, I'm lowering the severity of this Bugzilla report.

Comment 31 Ben Parees 2022-08-02 00:57:10 UTC

https://sippy.dptools.openshift.org/sippy-ng/tests/4.12?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522The%2520HAProxy%2520router%2520should%2520expose%2520prometheus%2520metrics%2520for%2520a%2520route%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage

indicates this test is now passing 99.7% of the time in 4.12 and similarly good in 4.11 and older (in fact it's currently at 100% on 4.6).

So i'm going to close this out as resolved.

Note You need to log in before you can comment on or make changes to this bug.