1857192 – High flakiness of e2e-aws-operator tests

Bug 1857192 - High flakiness of e2e-aws-operator tests

Summary: High flakiness of e2e-aws-operator tests

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-15 11:55 UTC by Pawel Krupa
Modified:	2020-10-27 16:14 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 16:14:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 899	0	None	open	Bug 1857192: test/e2e: don't count central grpc-tls	2020-08-03 16:04:31 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:14:57 UTC

Description Pawel Krupa 2020-07-15 11:55:37 UTC

Description of problem:

e2e-aws-operator tests in CMO are failing randomly even on PRs without code changes

Version-Release number of selected component (if applicable):
4.6+

How reproducible:
Often

Steps to Reproduce:
1. Create a PR to CMO repo
2. Observe e2e-aws-operator CI job
3.

Actual results:
Job is failing randomly


Expected results:
Tests are more resilient and do not fail on PRs not changing code


Additional info:

Test flakiness can be observed with a CI search query[1] or in specific PRs[2][3]. Especially good example is a PR[4] adding new bash script, which doesn't affect operator code.

[1]: https://search.ci.openshift.org/?search=error+getting+first+value+from+prometheus+response&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://github.com/openshift/cluster-monitoring-operator/pull/838
[3]: https://github.com/openshift/cluster-monitoring-operator/pull/725
[4]: https://github.com/openshift/cluster-monitoring-operator/pull/849

Comment 2 Lili Cosic 2020-07-29 14:40:00 UTC

Another one I found was:

Test flake I see on multiple tests: “TestUserWorkloadMonitoring/assert_grpc_tls_rotation”
https://github.com/openshift/cluster-monitoring-operator/pull/883 for example should never fail as its just docs
https://search.ci.openshift.org/?search=TestUserWorkloadMonitoring%2Fassert_grpc_tls_rotation+&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 4 Sergiusz Urbaniak 2020-08-03 15:48:29 UTC

The reason for flakiness is simple:

We expect a constant number of 5 secrets as per https://github.com/openshift/cluster-monitoring-operator/blob/f9567c5be644a867e41e2312f93c7183d5aa9c10/test/e2e/user_workload_monitoring_test.go#L894

However the underlying `countGRPCSecrets` method only lists secrets having the `monitoring.openshift.io/hash` label set which excludes `grpc-tls`.

It is flaky because, during GRPC TLS rotation there are indeed sometimes 5 secrets with the above hash (pre- and post-rotation secrets).

The fix is to expect 4 secrets only which is the what the rotation has to converge against.

Comment 10 errata-xmlrpc 2020-10-27 16:14:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.