1835992 – [ci] Flaky test: e2e-aws-operator frequently fails because the TestSATokenSignerControllerSyncCerts test times out

Bug 1835992 - [ci] Flaky test: e2e-aws-operator frequently fails because the TestSATokenSignerControllerSyncCerts test times out

Summary: [ci] Flaky test: e2e-aws-operator frequently fails because the TestSATokenSig...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-controller-manager
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Mike Dame
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-14 20:55 UTC by Miciah Dashiel Butler Masters
Modified:	2020-10-27 15:59 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-27 15:59:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-kube-controller-manager-operator pull 425	0	None	closed	Bug 1835992: Raise operator e2e test timeout	2020-10-08 14:06:56 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:59:36 UTC

Description Miciah Dashiel Butler Masters 2020-05-14 20:55:21 UTC

Description of problem:

The e2e-aws-operator CI job frequently fails because the TestSATokenSignerControllerSyncCerts test times out.


How reproducible:

"Across 69 runs and 1 jobs (91.30% failed), matched 49.21% of failing runs and 100.00% of jobs".

https://search.apps.build01.ci.devcluster.openshift.com/?search=FAIL%3A+.*+TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator&maxMatches=5&maxBytes=20971520&groupBy=job


Additional info:

I investigated one failure:  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/415/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator/738

In this instance, TestSATokenSignerControllerSyncCerts times out at or shortly before 17:47:11:
    
    FAIL: github.com/openshift/cluster-kube-controller-manager-operator/test/e2e TestSATokenSignerControllerSyncCerts 10m0.063s
    make: *** [test-unit] Error 1
    2020/05/14 17:47:11 Container test in pod e2e-aws-operator failed, exit code 2, reason Error

The test is timing out waiting for the clusteroperator to reach level:
    
    github.com/openshift/cluster-kube-controller-manager-operator/test/e2e.TestSATokenSignerControllerSyncCerts(0xc00043f400)
    	/go/src/github.com/openshift/cluster-kube-controller-manager-operator/test/e2e/satokensigner_test.go:45 +0x68c

It is failing on this line, waiting for Available=True, Progressing=False, Degraded=False:
    
    	test.WaitForKubeControllerManagerClusterOperator(t, ctx, configClient, configv1.ConditionTrue, configv1.ConditionFalse, configv1.ConditionFalse)

The operator's logs show that it is reporting Progressing=True shortly before the timeout:
    
    I0514 17:45:54.565824       1 status_controller.go:172] clusteroperator/kube-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-05-14T17:25:39Z","message":"NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is not ready: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is waiting: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-05-14T17:43:31Z","message":"NodeInstallerProgressing: 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-05-14T17:22:19Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-05-14T17:18:24Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

The operator reaches level about a minute later, per the CI artifacts:
    
                        {
                            "lastTransitionTime": "2020-05-14T17:25:39Z",
                            "message": "NodeControllerDegraded: All master nodes are ready",
                            "reason": "AsExpected",
                            "status": "False",
                            "type": "Degraded"
                        },
                        {
                            "lastTransitionTime": "2020-05-14T17:48:06Z",
                            "message": "NodeInstallerProgressing: 3 nodes are at revision 10",
                            "reason": "AsExpected",
                            "status": "False",
                            "type": "Progressing"
                        },
                        {
                            "lastTransitionTime": "2020-05-14T17:22:19Z",
                            "message": "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10",
                            "reason": "AsExpected",
                            "status": "True",
                            "type": "Available"
                        },

I think the test needs a longer timeout.

Comment 1 Maciej Szulik 2020-05-20 09:45:00 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Maciej Szulik 2020-05-26 14:22:33 UTC

I don't think I'll be able to look into this sooner than next week, so I'm moving this out of 4.5 list, since that's not a blocker.

Comment 3 Maciej Szulik 2020-06-18 10:11:05 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 6 zhou ying 2020-07-24 13:04:26 UTC

Checked with https://search.ci.openshift.org/?search=TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job, failed 4 times in the past 14 days,so , will verify this issue.

Comment 7 zhou ying 2020-07-24 13:04:42 UTC

Checked with https://search.ci.openshift.org/?search=TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job, failed 4 times in the past 14 days,so , will verify this issue.

Comment 9 errata-xmlrpc 2020-10-27 15:59:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.