Bug 1857192

Summary:	High flakiness of e2e-aws-operator tests
Product:	OpenShift Container Platform	Reporter:	Pawel Krupa <pkrupa>
Component:	Monitoring	Assignee:	Sergiusz Urbaniak <surbania>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.6	CC:	alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, surbania
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:14:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pawel Krupa 2020-07-15 11:55:37 UTC

Description of problem:

e2e-aws-operator tests in CMO are failing randomly even on PRs without code changes

Version-Release number of selected component (if applicable):
4.6+

How reproducible:
Often

Steps to Reproduce:
1. Create a PR to CMO repo
2. Observe e2e-aws-operator CI job
3.

Actual results:
Job is failing randomly


Expected results:
Tests are more resilient and do not fail on PRs not changing code


Additional info:

Test flakiness can be observed with a CI search query[1] or in specific PRs[2][3]. Especially good example is a PR[4] adding new bash script, which doesn't affect operator code.

[1]: https://search.ci.openshift.org/?search=error+getting+first+value+from+prometheus+response&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://github.com/openshift/cluster-monitoring-operator/pull/838
[3]: https://github.com/openshift/cluster-monitoring-operator/pull/725
[4]: https://github.com/openshift/cluster-monitoring-operator/pull/849

Comment 2 Lili Cosic 2020-07-29 14:40:00 UTC

Another one I found was:

Test flake I see on multiple tests: “TestUserWorkloadMonitoring/assert_grpc_tls_rotation”
https://github.com/openshift/cluster-monitoring-operator/pull/883 for example should never fail as its just docs
https://search.ci.openshift.org/?search=TestUserWorkloadMonitoring%2Fassert_grpc_tls_rotation+&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 4 Sergiusz Urbaniak 2020-08-03 15:48:29 UTC

The reason for flakiness is simple:

We expect a constant number of 5 secrets as per https://github.com/openshift/cluster-monitoring-operator/blob/f9567c5be644a867e41e2312f93c7183d5aa9c10/test/e2e/user_workload_monitoring_test.go#L894

However the underlying `countGRPCSecrets` method only lists secrets having the `monitoring.openshift.io/hash` label set which excludes `grpc-tls`.

It is flaky because, during GRPC TLS rotation there are indeed sometimes 5 secrets with the above hash (pre- and post-rotation secrets).

The fix is to expect 4 secrets only which is the what the rotation has to converge against.

Comment 10 errata-xmlrpc 2020-10-27 16:14:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196