Description of problem: The e2e-aws-operator CI job frequently fails because the TestSATokenSignerControllerSyncCerts test times out. How reproducible: "Across 69 runs and 1 jobs (91.30% failed), matched 49.21% of failing runs and 100.00% of jobs". https://search.apps.build01.ci.devcluster.openshift.com/?search=FAIL%3A+.*+TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=junit&name=pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator&maxMatches=5&maxBytes=20971520&groupBy=job Additional info: I investigated one failure: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-controller-manager-operator/415/pull-ci-openshift-cluster-kube-controller-manager-operator-master-e2e-aws-operator/738 In this instance, TestSATokenSignerControllerSyncCerts times out at or shortly before 17:47:11: FAIL: github.com/openshift/cluster-kube-controller-manager-operator/test/e2e TestSATokenSignerControllerSyncCerts 10m0.063s make: *** [test-unit] Error 1 2020/05/14 17:47:11 Container test in pod e2e-aws-operator failed, exit code 2, reason Error The test is timing out waiting for the clusteroperator to reach level: github.com/openshift/cluster-kube-controller-manager-operator/test/e2e.TestSATokenSignerControllerSyncCerts(0xc00043f400) /go/src/github.com/openshift/cluster-kube-controller-manager-operator/test/e2e/satokensigner_test.go:45 +0x68c It is failing on this line, waiting for Available=True, Progressing=False, Degraded=False: test.WaitForKubeControllerManagerClusterOperator(t, ctx, configClient, configv1.ConditionTrue, configv1.ConditionFalse, configv1.ConditionFalse) The operator's logs show that it is reporting Progressing=True shortly before the timeout: I0514 17:45:54.565824 1 status_controller.go:172] clusteroperator/kube-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2020-05-14T17:25:39Z","message":"NodeControllerDegraded: All master nodes are ready\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is not ready: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)\nStaticPodsDegraded: pod/kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal container \"cluster-policy-controller\" is waiting: CrashLoopBackOff: back-off 20s restarting failed container=cluster-policy-controller pod=kube-controller-manager-ip-10-0-159-62.us-east-2.compute.internal_openshift-kube-controller-manager(6c1d45a7d2eb9c29222aa7c80fb9e663)","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2020-05-14T17:43:31Z","message":"NodeInstallerProgressing: 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2020-05-14T17:22:19Z","message":"StaticPodsAvailable: 3 nodes are active; 2 nodes are at revision 9; 1 nodes are at revision 10","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2020-05-14T17:18:24Z","reason":"AsExpected","status":"True","type":"Upgradeable"}]}} The operator reaches level about a minute later, per the CI artifacts: { "lastTransitionTime": "2020-05-14T17:25:39Z", "message": "NodeControllerDegraded: All master nodes are ready", "reason": "AsExpected", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2020-05-14T17:48:06Z", "message": "NodeInstallerProgressing: 3 nodes are at revision 10", "reason": "AsExpected", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2020-05-14T17:22:19Z", "message": "StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 10", "reason": "AsExpected", "status": "True", "type": "Available" }, I think the test needs a longer timeout.
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
I don't think I'll be able to look into this sooner than next week, so I'm moving this out of 4.5 list, since that's not a blocker.
Checked with https://search.ci.openshift.org/?search=TestSATokenSignerControllerSyncCerts&maxAge=336h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job, failed 4 times in the past 14 days,so , will verify this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196