Bug 1833387 - [sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]: Some cluster operators never became ready: kube-apiserver
Summary: [sig-arch][Early] Managed cluster should start all core operators [Suite:open...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: Unspecified
OS: Unspecified
medium
low
Target Milestone: ---
: 4.5.0
Assignee: W. Trevor King
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-08 14:53 UTC by Joe Lanford
Modified: 2020-07-13 17:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:36:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24992 0 None closed Bug 1833387: test/extended/operators/operators: Don't worry about Progressing ClusterOperator 2020-09-03 23:34:50 UTC
Github openshift origin pull 24993 0 None closed Bug 1833387: test/extended/operators/operators: Drop cvoWait and operatorWait 2020-09-03 23:34:49 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:36:34 UTC

Description Joe Lanford 2020-05-08 14:53:07 UTC
The test "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]" is failing in many different release and operator test jobs

https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=none

Examples:
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry/366
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-single/97
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi/96
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.5/1043
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1417
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-registry/319/pull-ci-operator-framework-operator-registry-master-e2e-aws/1061
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1028
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1040
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6/4902
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/199/pull-ci-openshift-sriov-network-operator-master-e2e-aws/552
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/cri-o_cri-o/3738/pull-ci-cri-o-cri-o-release-1.18-e2e-aws/80
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1933
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1044
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10530
- ... and many more ...

Comment 1 Fabiano Franz 2020-05-11 17:53:56 UTC
[buildcop] Still seeing this consistently as of today, e.g.:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1908

Comment 2 Venkata Siva Teja Areti 2020-05-12 16:20:25 UTC
I looked into two failed runs. In both of these runs, kube-apiserver rollout was complete around ~30s after timeout.

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/345/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1425
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/345/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/1329

From my experience, kube-apiserver operator takes anywhere from 7mins to 9 mins assuming no errors are seen during rollout . Maybe the number of failures can be reduced by increasing the timeout.

Comment 3 Mansi Kulkarni 2020-05-12 18:40:25 UTC
[buildcop] Still seeing this consistently as of today, e.g.:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1105

Comment 4 Venkata Siva Teja Areti 2020-05-14 17:53:40 UTC
Looked at this run to debug

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1125

CI setting up the cluster is complete at

> 2020/05/13 12:17:55 Container setup in pod e2e-aws-ovn completed successfully

Test is started after the setup is complete

> May 13 12:18:07.939: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable

Test failed after 1 minute

> May 13 12:19:09.575: Some cluster operators never became ready: kube-apiserver (Progressing=True NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7)

But kube-apiserver operator is rolling out new revision when the test is started and roll out is only complete after a certain time

> "lastTransitionTime": "2020-05-13T12:19:54Z",
> "message": "3 nodes are at revision 7"
> "reason": "AllNodesAtLatestRevision",
> "status": "False",
> "type": "NodeInstallerProgressing"

It looks like the test is started prematurely even before the installation is complete.

I tried to find out what triggered a new revision of kube-apiserver. After enough digging, it turned out that the oauth-openshift route is not ready. authentication operator updates oauth-metadata only after the route is available which triggered the revision.

Comment 6 Venkata Siva Teja Areti 2020-05-15 16:31:12 UTC
This issue started popping up in the past 8 days. Clearly looks like a regression.

The test is not supposed to be run before the install is complete. Moving this to the installer team for them to have a look. Ideally install command should not complete as cluster operators are still rolling out.

https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=none

Comment 7 W. Trevor King 2020-05-16 02:51:52 UTC
This test always fails once on tge Progressing called out above.  But then it is retested at the end of the run and passes, so it never (or at least, it rarely) fails the job.

Comment 11 W. Trevor King 2020-06-05 14:59:13 UTC
No doc update needed, because the PRs fixed the test from flaky -> passing.  It was never failing.  And users probably don't care about flaky -> passing getting fixed with "the old tests weren't quite looking at the right thing".

Comment 12 errata-xmlrpc 2020-07-13 17:36:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.