Bug 1835371
Summary: | [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> | |
Component: | Networking | Assignee: | aos-network-edge-staff <aos-network-edge-staff> | |
Networking sub component: | router | QA Contact: | Arvind iyengar <aiyengar> | |
Status: | CLOSED WORKSFORME | Docs Contact: | ||
Severity: | medium | |||
Priority: | medium | CC: | amcdermo, aos-bugs, bbennett, bleanhar, bperkins, cholman, dgoodwin, gspence, hongli, mmasters, stbenjam, wking | |
Version: | 4.5 | |||
Target Milestone: | --- | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1857409 (view as bug list) | Environment: |
[sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route
|
|
Last Closed: | 2022-08-02 00:57:10 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1846507, 1857409 |
Description
Ben Parees
2020-05-13 17:16:33 UTC
Moving to 4.6. This is now flaking heavily. This started failing 5/11. I think this is a regression in the product. Moving back to 4.5 and bumping severity. I looked at this today but was unable to reproduce when using GCP - I suspect I need to run just more than the single test though Steve Greene mentioned he was able to reproduce 1/10 times using AWS yesterday. This run hints at a timing/race as in this run one of the endpoints had at least 1 session measured/recorded: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/558/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/2554 fail [github.com/openshift/origin/test/extended/router/metrics.go:189]: Expected <[]float64 | len:2, cap:2>: [1, 0] to consist of <[]interface {} | len:2, cap:2>: [ {Comparator: ">", CompareTo: [0]}, {Comparator: ">", CompareTo: [0]}, ] Moving to 4.6 as not a release blocker or an upgrade blocker. This is a CI blocker as the flake rate is high and should also be considered for a backport. I was able to periodically reproduce this today by running the single test against (on AWS) and simultaneously starting the full e2e suite. It would typically reproduce within a few minutes. It looks like an associated router reload resets the metric values. Continuing to investigate. Moving this back to 4.5 as I posted a patch in comment #6 and it is a CI blocker given the flake rate. Moved this to 4.6 but issues a cherry-pick for 4.5: https://github.com/openshift/origin/pull/25045#issuecomment-638957396 can still find some failure in recent 4.6 CI, see search result: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved. (In reply to Hongan Li from comment #11) > can still find some failure in recent 4.6 CI, see search result: > > https://search.svc.ci.openshift.org/ > ?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520& > groupBy=job&search=%5C%5Bsig- > network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+pr > ometheus+metrics+for+a+route I started to experiment with changing the timing on the part that flakes: https://github.com/openshift/origin/pull/25484 I don't really know if this will makes things better and it will need to run through CI many times to draw any conclusion. I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint. Taking this one step at a time: - merging https://github.com/openshift/router/pull/179 to see the impact to CI flake rate. Then will take another look at: https://github.com/openshift/origin/pull/25484 Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved. https://search.ci.openshift.org/?search=E+e2e-test%2F%22%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job shows the following: Found in 0.09% of runs (0.33% of failures) across 170361 total runs and 5597 jobs (28.49% failed) Given the low flake rate at present and other priorities, I'm lowering the severity of this Bugzilla report. https://sippy.dptools.openshift.org/sippy-ng/tests/4.12?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522The%2520HAProxy%2520router%2520should%2520expose%2520prometheus%2520metrics%2520for%2520a%2520route%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage indicates this test is now passing 99.7% of the time in 4.12 and similarly good in 4.11 and older (in fact it's currently at 100% on 4.6). So i'm going to close this out as resolved. |