test: [sig-network][Feature:Router] The HAProxy router should expose prometheus metrics for a route is failing frequently in CI, see search results: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route https://sippy-bparees.svc.ci.openshift.org/?release=4.5#TopFailingTests it is only passing 86% of the time.
Moving to 4.6.
This is now flaking heavily. This started failing 5/11. I think this is a regression in the product. Moving back to 4.5 and bumping severity.
I looked at this today but was unable to reproduce when using GCP - I suspect I need to run just more than the single test though Steve Greene mentioned he was able to reproduce 1/10 times using AWS yesterday. This run hints at a timing/race as in this run one of the endpoints had at least 1 session measured/recorded: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/558/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws/2554 fail [github.com/openshift/origin/test/extended/router/metrics.go:189]: Expected <[]float64 | len:2, cap:2>: [1, 0] to consist of <[]interface {} | len:2, cap:2>: [ {Comparator: ">", CompareTo: [0]}, {Comparator: ">", CompareTo: [0]}, ]
Moving to 4.6 as not a release blocker or an upgrade blocker. This is a CI blocker as the flake rate is high and should also be considered for a backport.
I was able to periodically reproduce this today by running the single test against (on AWS) and simultaneously starting the full e2e suite. It would typically reproduce within a few minutes. It looks like an associated router reload resets the metric values. Continuing to investigate.
PR - https://github.com/openshift/origin/pull/25045
Moving this back to 4.5 as I posted a patch in comment #6 and it is a CI blocker given the flake rate.
Moved this to 4.6 but issues a cherry-pick for 4.5: https://github.com/openshift/origin/pull/25045#issuecomment-638957396
can still find some failure in recent 4.6 CI, see search result: https://search.svc.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Target reset from 4.6 to 4.7 while investigation is either ongoing or not yet started. Will be considered for earlier release versions when diagnosed and resolved.
(In reply to Hongan Li from comment #11) > can still find some failure in recent 4.6 CI, see search result: > > https://search.svc.ci.openshift.org/ > ?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520& > groupBy=job&search=%5C%5Bsig- > network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+pr > ometheus+metrics+for+a+route I started to experiment with changing the timing on the part that flakes: https://github.com/openshift/origin/pull/25484 I don't really know if this will makes things better and it will need to run through CI many times to draw any conclusion.
Taking this one step at a time: - merging https://github.com/openshift/router/pull/179 to see the impact to CI flake rate. Then will take another look at: https://github.com/openshift/origin/pull/25484
Tagging with UpcomingSprint while investigation is either ongoing or pending. Will be considered for earlier release versions when diagnosed and resolved.
https://search.ci.openshift.org/?search=E+e2e-test%2F%22%5C%5Bsig-network%5C%5D%5C%5BFeature%3ARouter%5C%5D+The+HAProxy+router+should+expose+prometheus+metrics+for+a+route&maxAge=336h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job shows the following: Found in 0.09% of runs (0.33% of failures) across 170361 total runs and 5597 jobs (28.49% failed) Given the low flake rate at present and other priorities, I'm lowering the severity of this Bugzilla report.
https://sippy.dptools.openshift.org/sippy-ng/tests/4.12?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522never-stable%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522The%2520HAProxy%2520router%2520should%2520expose%2520prometheus%2520metrics%2520for%2520a%2520route%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&sort=asc&sortField=current_working_percentage indicates this test is now passing 99.7% of the time in 4.12 and similarly good in 4.11 and older (in fact it's currently at 100% on 4.6). So i'm going to close this out as resolved.