The test "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]" is failing in many different release and operator test jobs https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=none Examples: - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry/366 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-single/97 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi/96 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.5/1043 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1417 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-registry/319/pull-ci-operator-framework-operator-registry-master-e2e-aws/1061 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1028 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1040 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6/4902 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/199/pull-ci-openshift-sriov-network-operator-master-e2e-aws/552 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/cri-o_cri-o/3738/pull-ci-cri-o-cri-o-release-1.18-e2e-aws/80 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1933 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1044 - https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10530 - ... and many more ...
[buildcop] Still seeing this consistently as of today, e.g.: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1908
I looked into two failed runs. In both of these runs, kube-apiserver rollout was complete around ~30s after timeout. https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/345/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1425 https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/345/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/1329 From my experience, kube-apiserver operator takes anywhere from 7mins to 9 mins assuming no errors are seen during rollout . Maybe the number of failures can be reduced by increasing the timeout.
[buildcop] Still seeing this consistently as of today, e.g.: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1105
Looked at this run to debug https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1125 CI setting up the cluster is complete at > 2020/05/13 12:17:55 Container setup in pod e2e-aws-ovn completed successfully Test is started after the setup is complete > May 13 12:18:07.939: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable Test failed after 1 minute > May 13 12:19:09.575: Some cluster operators never became ready: kube-apiserver (Progressing=True NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7) But kube-apiserver operator is rolling out new revision when the test is started and roll out is only complete after a certain time > "lastTransitionTime": "2020-05-13T12:19:54Z", > "message": "3 nodes are at revision 7" > "reason": "AllNodesAtLatestRevision", > "status": "False", > "type": "NodeInstallerProgressing" It looks like the test is started prematurely even before the installation is complete. I tried to find out what triggered a new revision of kube-apiserver. After enough digging, it turned out that the oauth-openshift route is not ready. authentication operator updates oauth-metadata only after the route is available which triggered the revision.
[build-cop] still seeing these errors in CI https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5 https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/638/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1446 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1129 and many more, https://search.apps.build01.ci.devcluster.openshift.com/?search=Managed+cluster+should+start+all+core+operators&maxAge=12h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
This issue started popping up in the past 8 days. Clearly looks like a regression. The test is not supposed to be run before the install is complete. Moving this to the installer team for them to have a look. Ideally install command should not complete as cluster operators are still rolling out. https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=none
This test always fails once on tge Progressing called out above. But then it is retested at the end of the run and passes, so it never (or at least, it rarely) fails the job.
Per https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=none, no 'Some cluster operators never became ready: kube-apiserver' error was seen recently.
No doc update needed, because the PRs fixed the test from flaky -> passing. It was never failing. And users probably don't care about flaky -> passing getting fixed with "the old tests weren't quite looking at the right thing".
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409