Bug 1835992

Summary: [ci] Flaky test: e2e-aws-operator frequently fails because the TestSATokenSignerControllerSyncCerts test times out
Product: OpenShift Container Platform Reporter: Miciah Dashiel Butler Masters <mmasters>
Component: kube-controller-managerAssignee: Mike Dame <mdame>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.5CC: aos-bugs, mfojtik
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 15:59:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Miciah Dashiel Butler Masters 2020-05-14 20:55:21 UTC
Description of problem:

The e2e-aws-operator CI job frequently fails because the TestSATokenSignerControllerSyncCerts test times out.


How reproducible:

"Across 69 runs and 1 jobs (91.30% failed), matched 49.21% of failing runs and 100.00% of jobs".

https://search.apps.build01.ci.devcluster.openshift.com/?search=FAIL%3A+.*+TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator&maxMatches=5&maxBytes=20971520&groupBy=job


Additional info:

I investigated one failure:  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/415/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator/738

In this instance, TestSATokenSignerControllerSyncCerts times out at or shortly before 17:47:11:
    
    FAIL: github.com/openshift/cluster-kube-controller-manager-operator/test/e2e TestSATokenSignerControllerSyncCerts 10m0.063s
    make: *** [test-unit] Error 1
    2020/05/14 17:47:11 Container test in pod e2e-aws-operator failed, exit code 2, reason Error

The test is timing out waiting for the clusteroperator to reach level:
    
    github.com/openshift/cluster-kube-controller-manager-operator/test/e2e.TestSATokenSignerControllerSyncCerts(0xc00043f400)
    	/go/src/github.com/openshift/cluster-kube-controller-manager-operator/test/e2e/satokensigner_test.go:45 +0x68c

It is failing on this line, waiting for Available=True, Progressing=False, Degraded=False:
    
    	test.WaitForKubeControllerManagerClusterOperator(t, ctx, configClient, configv1.ConditionTrue, configv1.ConditionFalse, configv1.ConditionFalse)

The operator's logs show that it is reporting Progressing=True shortly before the timeout:
    
    I0514 17:45:54.565824       1 status_controller.go:172] clusteroperator/kube-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-05-14T17:25:39Z","message":"NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is not ready: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is waiting: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-05-14T17:43:31Z","message":"NodeInstallerProgressing: 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-05-14T17:22:19Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-05-14T17:18:24Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

The operator reaches level about a minute later, per the CI artifacts:
    
                        {
                            "lastTransitionTime": "2020-05-14T17:25:39Z",
                            "message": "NodeControllerDegraded: All master nodes are ready",
                            "reason": "AsExpected",
                            "status": "False",
                            "type": "Degraded"
                        },
                        {
                            "lastTransitionTime": "2020-05-14T17:48:06Z",
                            "message": "NodeInstallerProgressing: 3 nodes are at revision 10",
                            "reason": "AsExpected",
                            "status": "False",
                            "type": "Progressing"
                        },
                        {
                            "lastTransitionTime": "2020-05-14T17:22:19Z",
                            "message": "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10",
                            "reason": "AsExpected",
                            "status": "True",
                            "type": "Available"
                        },

I think the test needs a longer timeout.

Comment 1 Maciej Szulik 2020-05-20 09:45:00 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Maciej Szulik 2020-05-26 14:22:33 UTC
I don't think I'll be able to look into this sooner than next week, so I'm moving this out of 4.5 list, since that's not a blocker.

Comment 3 Maciej Szulik 2020-06-18 10:11:05 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 errata-xmlrpc 2020-10-27 15:59:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196