1833387 – [sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]: Some cluster operators never became ready: kube-apiserver

Bug 1833387 - [sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]: Some cluster operators never became ready: kube-apiserver

Summary: [sig-arch][Early] Managed cluster should start all core operators [Suite:open...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	W. Trevor King
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-05-08 14:53 UTC by Joe Lanford
Modified:	2020-07-13 17:36 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-13 17:36:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24992	None	closed	Bug 1833387: test/extended/operators/operators: Don't worry about Progressing ClusterOperator	2020-09-03 23:34:50 UTC
Github	openshift origin pull 24993	None	closed	Bug 1833387: test/extended/operators/operators: Drop cvoWait and operatorWait	2020-09-03 23:34:49 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:36:34 UTC

Description Joe Lanford 2020-05-08 14:53:07 UTC

The test "[sig-arch][Early] Managed cluster should start all core operators [Suite:openshift/conformance/parallel]" is failing in many different release and operator test jobs

https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=none

Examples:
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-ovn-step-registry/366
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-single/97
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-aws-sdn-multi/96
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-fips-4.5/1043
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1417
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/operator-framework_operator-registry/319/pull-ci-operator-framework-operator-registry-master-e2e-aws/1061
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/630/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1028
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1040
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.6/4902
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_sriov-network-operator/199/pull-ci-openshift-sriov-network-operator-master-e2e-aws/552
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/cri-o_cri-o/3738/pull-ci-cri-o-cri-o-release-1.18-e2e-aws/80
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/1933
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1044
- https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/promote-release-openshift-machine-os-content-e2e-aws-4.5/10530
- ... and many more ...

Comment 1 Fabiano Franz 2020-05-11 17:53:56 UTC

[buildcop] Still seeing this consistently as of today, e.g.:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5/1908

Comment 2 Venkata Siva Teja Areti 2020-05-12 16:20:25 UTC

I looked into two failed runs. In both of these runs, kube-apiserver rollout was complete around ~30s after timeout.

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/345/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws/1425
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-etcd-operator/345/pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp/1329

From my experience, kube-apiserver operator takes anywhere from 7mins to 9 mins assuming no errors are seen during rollout . Maybe the number of failures can be reduced by increasing the timeout.

Comment 3 Mansi Kulkarni 2020-05-12 18:40:25 UTC

[buildcop] Still seeing this consistently as of today, e.g.:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1105

Comment 4 Venkata Siva Teja Areti 2020-05-14 17:53:40 UTC

Looked at this run to debug

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-ovn-4.5/1125

CI setting up the cluster is complete at

> 2020/05/13 12:17:55 Container setup in pod e2e-aws-ovn completed successfully

Test is started after the setup is complete

> May 13 12:18:07.939: INFO: Waiting up to 30m0s for all (but 100) nodes to be schedulable

Test failed after 1 minute

> May 13 12:19:09.575: Some cluster operators never became ready: kube-apiserver (Progressing=True NodeInstaller: NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 7)

But kube-apiserver operator is rolling out new revision when the test is started and roll out is only complete after a certain time

> "lastTransitionTime": "2020-05-13T12:19:54Z",
> "message": "3 nodes are at revision 7"
> "reason": "AllNodesAtLatestRevision",
> "status": "False",
> "type": "NodeInstallerProgressing"

It looks like the test is started prematurely even before the installation is complete.

I tried to find out what triggered a new revision of kube-apiserver. After enough digging, it turned out that the oauth-openshift route is not ready. authentication operator updates oauth-metadata only after the route is available which triggered the revision.

Comment 5 Abhinav Dahiya 2020-05-14 18:15:35 UTC

[build-cop] still seeing these errors in CI

https://prow.svc.ci.openshift.org/job-history/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.5
https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/638/pull-ci-openshift-cluster-network-operator-master-e2e-gcp/1446
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-gcp-4.5/1129

and many more, 
https://search.apps.build01.ci.devcluster.openshift.com/?search=Managed+cluster+should+start+all+core+operators&maxAge=12h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 6 Venkata Siva Teja Areti 2020-05-15 16:31:12 UTC

This issue started popping up in the past 8 days. Clearly looks like a regression.

The test is not supposed to be run before the install is complete. Moving this to the installer team for them to have a look. Ideally install command should not complete as cluster operators are still rolling out.

https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=none

Comment 7 W. Trevor King 2020-05-16 02:51:52 UTC

This test always fails once on tge Progressing called out above.  But then it is retested at the end of the run and passes, so it never (or at least, it rarely) fails the job.

Comment 10 Johnny Liu 2020-06-01 02:36:47 UTC

Per https://search.apps.build01.ci.devcluster.openshift.com/?search=Some+cluster+operators+never+became+ready%3A+kube-apiserver&maxAge=336h&context=1&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=none, no 'Some cluster operators never became ready: kube-apiserver' error was seen recently.

Comment 11 W. Trevor King 2020-06-05 14:59:13 UTC

No doc update needed, because the PRs fixed the test from flaky -> passing.  It was never failing.  And users probably don't care about flaky -> passing getting fixed with "the old tests weren't quite looking at the right thing".

Comment 12 errata-xmlrpc 2020-07-13 17:36:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.