Description of problem:
e2e-aws-operator tests in CMO are failing randomly even on PRs without code changes
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. Create a PR to CMO repo
2. Observe e2e-aws-operator CI job
Job is failing randomly
Tests are more resilient and do not fail on PRs not changing code
Test flakiness can be observed with a CI search query or in specific PRs. Especially good example is a PR adding new bash script, which doesn't affect operator code.
Another one I found was:
Test flake I see on multiple tests: “TestUserWorkloadMonitoring/assert_grpc_tls_rotation”
https://github.com/openshift/cluster-monitoring-operator/pull/883 for example should never fail as its just docs
The reason for flakiness is simple:
We expect a constant number of 5 secrets as per https://github.com/openshift/cluster-monitoring-operator/blob/f9567c5be644a867e41e2312f93c7183d5aa9c10/test/e2e/user_workload_monitoring_test.go#L894
However the underlying `countGRPCSecrets` method only lists secrets having the `monitoring.openshift.io/hash` label set which excludes `grpc-tls`.
It is flaky because, during GRPC TLS rotation there are indeed sometimes 5 secrets with the above hash (pre- and post-rotation secrets).
The fix is to expect 4 secrets only which is the what the rotation has to converge against.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.