Bug 1835992 - [ci] Flaky test: e2e-aws-operator frequently fails because the TestSATokenSignerControllerSyncCerts test times out
Summary: [ci] Flaky test: e2e-aws-operator frequently fails because the TestSATokenSig...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-controller-manager
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Mike Dame
QA Contact: zhou ying
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-14 20:55 UTC by Miciah Dashiel Butler Masters
Modified: 2020-10-27 15:59 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 15:59:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-controller-manager-operator pull 425 0 None closed Bug 1835992: Raise operator e2e test timeout 2020-10-08 14:06:56 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:59:36 UTC

Description Miciah Dashiel Butler Masters 2020-05-14 20:55:21 UTC
Description of problem:

The e2e-aws-operator CI job frequently fails because the TestSATokenSignerControllerSyncCerts test times out.


How reproducible:

"Across 69 runs and 1 jobs (91.30% failed), matched 49.21% of failing runs and 100.00% of jobs".

https://search.apps.build01.ci.devcluster.openshift.com/?search=FAIL%3A+.*+TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator&maxMatches=5&maxBytes=20971520&groupBy=job


Additional info:

I investigated one failure:  https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/415/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator/738

In this instance, TestSATokenSignerControllerSyncCerts times out at or shortly before 17:47:11:
    
    FAIL: github.com/openshift/cluster-kube-controller-manager-operator/test/e2e TestSATokenSignerControllerSyncCerts 10m0.063s
    make: *** [test-unit] Error 1
    2020/05/14 17:47:11 Container test in pod e2e-aws-operator failed, exit code 2, reason Error

The test is timing out waiting for the clusteroperator to reach level:
    
    github.com/openshift/cluster-kube-controller-manager-operator/test/e2e.TestSATokenSignerControllerSyncCerts(0xc00043f400)
    	/go/src/github.com/openshift/cluster-kube-controller-manager-operator/test/e2e/satokensigner_test.go:45 +0x68c

It is failing on this line, waiting for Available=True, Progressing=False, Degraded=False:
    
    	test.WaitForKubeControllerManagerClusterOperator(t, ctx, configClient, configv1.ConditionTrue, configv1.ConditionFalse, configv1.ConditionFalse)

The operator's logs show that it is reporting Progressing=True shortly before the timeout:
    
    I0514 17:45:54.565824       1 status_controller.go:172] clusteroperator/kube-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-05-14T17:25:39Z","message":"NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is not ready: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is waiting: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-05-14T17:43:31Z","message":"NodeInstallerProgressing: 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-05-14T17:22:19Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-05-14T17:18:24Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}

The operator reaches level about a minute later, per the CI artifacts:
    
                        {
                            "lastTransitionTime": "2020-05-14T17:25:39Z",
                            "message": "NodeControllerDegraded: All master nodes are ready",
                            "reason": "AsExpected",
                            "status": "False",
                            "type": "Degraded"
                        },
                        {
                            "lastTransitionTime": "2020-05-14T17:48:06Z",
                            "message": "NodeInstallerProgressing: 3 nodes are at revision 10",
                            "reason": "AsExpected",
                            "status": "False",
                            "type": "Progressing"
                        },
                        {
                            "lastTransitionTime": "2020-05-14T17:22:19Z",
                            "message": "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10",
                            "reason": "AsExpected",
                            "status": "True",
                            "type": "Available"
                        },

I think the test needs a longer timeout.

Comment 1 Maciej Szulik 2020-05-20 09:45:00 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 2 Maciej Szulik 2020-05-26 14:22:33 UTC
I don't think I'll be able to look into this sooner than next week, so I'm moving this out of 4.5 list, since that's not a blocker.

Comment 3 Maciej Szulik 2020-06-18 10:11:05 UTC
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.

Comment 9 errata-xmlrpc 2020-10-27 15:59:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.