Bug 1857192 - High flakiness of e2e-aws-operator tests
Summary: High flakiness of e2e-aws-operator tests
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.6
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Sergiusz Urbaniak
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-15 11:55 UTC by Pawel Krupa
Modified: 2020-10-27 16:14 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:14:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 899 0 None open Bug 1857192: test/e2e: don't count central grpc-tls 2020-08-03 16:04:31 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:14:57 UTC

Description Pawel Krupa 2020-07-15 11:55:37 UTC
Description of problem:

e2e-aws-operator tests in CMO are failing randomly even on PRs without code changes

Version-Release number of selected component (if applicable):
4.6+

How reproducible:
Often

Steps to Reproduce:
1. Create a PR to CMO repo
2. Observe e2e-aws-operator CI job
3.

Actual results:
Job is failing randomly


Expected results:
Tests are more resilient and do not fail on PRs not changing code


Additional info:

Test flakiness can be observed with a CI search query[1] or in specific PRs[2][3]. Especially good example is a PR[4] adding new bash script, which doesn't affect operator code.

[1]: https://search.ci.openshift.org/?search=error+getting+first+value+from+prometheus+response&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://github.com/openshift/cluster-monitoring-operator/pull/838
[3]: https://github.com/openshift/cluster-monitoring-operator/pull/725
[4]: https://github.com/openshift/cluster-monitoring-operator/pull/849

Comment 2 Lili Cosic 2020-07-29 14:40:00 UTC
Another one I found was:

Test flake I see on multiple tests: “TestUserWorkloadMonitoring/assert_grpc_tls_rotation”
https://github.com/openshift/cluster-monitoring-operator/pull/883 for example should never fail as its just docs
https://search.ci.openshift.org/?search=TestUserWorkloadMonitoring%2Fassert_grpc_tls_rotation+&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 4 Sergiusz Urbaniak 2020-08-03 15:48:29 UTC
The reason for flakiness is simple:

We expect a constant number of 5 secrets as per https://github.com/openshift/cluster-monitoring-operator/blob/f9567c5be644a867e41e2312f93c7183d5aa9c10/test/e2e/user_workload_monitoring_test.go#L894

However the underlying `countGRPCSecrets` method only lists secrets having the `monitoring.openshift.io/hash` label set which excludes `grpc-tls`.

It is flaky because, during GRPC TLS rotation there are indeed sometimes 5 secrets with the above hash (pre- and post-rotation secrets).

The fix is to expect 4 secrets only which is the what the rotation has to converge against.

Comment 10 errata-xmlrpc 2020-10-27 16:14:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.