Description of problem: e2e-aws-operator tests in CMO are failing randomly even on PRs without code changes Version-Release number of selected component (if applicable): 4.6+ How reproducible: Often Steps to Reproduce: 1. Create a PR to CMO repo 2. Observe e2e-aws-operator CI job 3. Actual results: Job is failing randomly Expected results: Tests are more resilient and do not fail on PRs not changing code Additional info: Test flakiness can be observed with a CI search query[1] or in specific PRs[2][3]. Especially good example is a PR[4] adding new bash script, which doesn't affect operator code. [1]: https://search.ci.openshift.org/?search=error+getting+first+value+from+prometheus+response&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job [2]: https://github.com/openshift/cluster-monitoring-operator/pull/838 [3]: https://github.com/openshift/cluster-monitoring-operator/pull/725 [4]: https://github.com/openshift/cluster-monitoring-operator/pull/849
Another one I found was: Test flake I see on multiple tests: “TestUserWorkloadMonitoring/assert_grpc_tls_rotation” https://github.com/openshift/cluster-monitoring-operator/pull/883 for example should never fail as its just docs https://search.ci.openshift.org/?search=TestUserWorkloadMonitoring%2Fassert_grpc_tls_rotation+&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
The reason for flakiness is simple: We expect a constant number of 5 secrets as per https://github.com/openshift/cluster-monitoring-operator/blob/f9567c5be644a867e41e2312f93c7183d5aa9c10/test/e2e/user_workload_monitoring_test.go#L894 However the underlying `countGRPCSecrets` method only lists secrets having the `monitoring.openshift.io/hash` label set which excludes `grpc-tls`. It is flaky because, during GRPC TLS rotation there are indeed sometimes 5 secrets with the above hash (pre- and post-rotation secrets). The fix is to expect 4 secrets only which is the what the rotation has to converge against.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196