Bug 1835371 - [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route
Summary: [sig-network][Feature:Router] The HAProxy router should expose prometheus met...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: aos-network-edge-staff
QA Contact: Arvind iyengar
URL:
Whiteboard:
Depends On:
Blocks: 1846507 1857409
TreeView+ depends on / blocked
 
Reported: 2020-05-13 17:16 UTC by Ben Parees
Modified: 2022-08-04 22:27 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1857409 (view as bug list)
Environment:
[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route
Last Closed: 2022-08-02 00:57:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25045 0 None closed Bug 1835371: router/metrics: Fix haproxy_server_max_sessions test 2021-02-15 15:06:29 UTC

Description Ben Parees 2020-05-13 17:16:33 UTC
test:
[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route


https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TopFailingTests

it is only passing 86% of the time.

Comment 1 Andrew McDermott 2020-05-19 15:22:56 UTC
Moving to 4.6.

Comment 2 Clayton Coleman 2020-05-21 15:56:15 UTC
This is now flaking heavily.  This started failing 5/11.  I think this is a regression in the product.  Moving back to 4.5 and bumping severity.

Comment 3 Andrew McDermott 2020-05-22 17:18:29 UTC
I looked at this today but was unable to reproduce when using GCP - I suspect I need to run just more than the single test though Steve Greene mentioned he was able to reproduce 1/10 times using AWS yesterday.

This run hints at a timing/race as in this run one of the endpoints had at least 1 session measured/recorded:

  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/558/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/2554


fail [github.com/openshift/origin/test/extended/router/metrics.go:189]: Expected
    <[]float64 | len:2, cap:2>: [1, 0]
to consist of
    <[]interface {} | len:2, cap:2>: [
        {Comparator: ">", CompareTo: [0]},
        {Comparator: ">", CompareTo: [0]},
    ]

Comment 4 Andrew McDermott 2020-05-27 16:12:21 UTC
Moving to 4.6 as not a release blocker or an upgrade blocker. This is a CI blocker as the flake rate is high and should also be considered for a backport.

Comment 5 Andrew McDermott 2020-05-27 17:09:43 UTC
I was able to periodically reproduce this today by running the single test against (on AWS) and simultaneously starting the full e2e suite. It would typically reproduce within a few minutes. It looks like an associated router reload resets the metric values. Continuing to investigate.

Comment 6 Andrew McDermott 2020-05-29 14:52:24 UTC
PR - https://github.com/openshift/origin/pull/25045

Comment 7 Andrew McDermott 2020-05-29 16:14:28 UTC
Moving this back to 4.5 as I posted a patch in comment #6 and it is a CI blocker given the flake rate.

Comment 8 Andrew McDermott 2020-06-04 16:28:44 UTC
Moved this to 4.6 but issues a cherry-pick for 4.5:

  https://github.com/openshift/origin/pull/25045#issuecomment-638957396

Comment 12 Andrew McDermott 2020-07-09 12:08:25 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 15 Andrew McDermott 2020-07-30 10:03:27 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 16 mfisher 2020-08-18 20:01:38 UTC
Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started.  Will be considered for earlier release versions when diagnosed and resolved.

Comment 17 Andrew McDermott 2020-09-07 16:38:22 UTC
(In reply to Hongan Li from comment #11)
> can still find some failure in recent 4.6 CI, see search result:
> 
> https://search.svc.ci.openshift.org/
> ?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&
> groupBy=job&search=%5C%5Bsig-
> network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+pr
> ometheus+metrics+for+a+route

I started to experiment with changing the timing on the part that flakes:

  https://github.com/openshift/origin/pull/25484

I don't really know if this will makes things better and it will need to
run through CI many times to draw any conclusion.

Comment 18 Andrew McDermott 2020-09-10 11:47:42 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 19 Andrew McDermott 2020-09-15 16:28:16 UTC
Taking this one step at a time:

- merging https://github.com/openshift/router/pull/179

to see the impact to CI flake rate.

Then will take another look at:

 https://github.com/openshift/origin/pull/25484

Comment 20 Andrew McDermott 2020-10-02 16:54:51 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 22 Andrew McDermott 2020-10-23 16:02:04 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 23 Andrew McDermott 2020-11-16 08:23:42 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 24 Andrew McDermott 2020-12-04 16:50:48 UTC
Tagging with UpcomingSprint while investigation is either ongoing or
pending. Will be considered for earlier release versions when
diagnosed and resolved.

Comment 25 Miciah Dashiel Butler Masters 2021-05-26 16:36:20 UTC
https://search.ci.openshift.org/?search=E+e2e-test%2F%22%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job shows the following:

    Found in 0.09% of runs (0.33% of failures) across 170361 total runs and 5597 jobs (28.49% failed)

Given the low flake rate at present and other priorities, I'm lowering the severity of this Bugzilla report.


Note You need to log in before you can comment on or make changes to this bug.