Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1743114

Summary:	Several kinds of failures: Run template e2e-gcp - e2e-gcp container setup
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	Installer	Assignee:	Abhinav Dahiya <adahiya>
Installer sub component:	openshift-installer	QA Contact:	Johnny Liu <jialiu>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	alegrand, anpicker, aos-bugs, erooth, jokerman, mloibl, pkrupa, surbania
Version:	4.2.0
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-20 03:37:43 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1736168

Description Xingxing Xia 2019-08-19 06:41:15 UTC

Description of problem:
One job https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/47 failed with:
level=fatal msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"

Other 3 kinds of jobs:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/46
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/45
... etc.
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/21
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/20
... etc.
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-fips-4.2/21
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-fips-4.2/20
... etc.
These failed with:
Installing from release registry.svc.ci.openshift.org/ocp/release:4.2
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: Error creating service account: googleapi: Error 429: Maximum number of service accounts on project reached., rateLimitExceeded"
level=error
level=error msg="  on ../tmp/openshift-install-026222549/iam/main.tf line 1, in resource \"google_service_account\" \"worker-node-sa\":"
level=error msg="   1: resource \"google_service_account\" \"worker-node-sa\" {"
level=error
level=error
level=error
level=error msg="Error: Error creating service account: googleapi: Error 429: Maximum number of service accounts on project reached., rateLimitExceeded"
level=error
level=error msg="  on ../tmp/openshift-install-026222549/master/main.tf line 1, in resource \"google_service_account\" \"master-node-sa\":"
level=error msg="   1: resource \"google_service_account\" \"master-node-sa\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
---
Container test exited with code 1, reason Error
---
Another process exited

Comment 1 Xingxing Xia 2019-08-19 12:09:11 UTC

Also found above error in Azure env installation with latest 4.2.0-0.nightly-2019-08-19-071902 :
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65669/
log: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65669/console
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65669/artifact/workdir/install-dir/auth/kubeconfig

level=info msg="Waiting up to 30m0s for the cluster at https://api....qe.azure.devcluster.openshift.com:6443 to initialize..."
level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-08-19-071902: 92% complete"
level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-08-19-071902: 92% complete"
level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"
level=fatal msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"

Comment 2 Pawel Krupa 2019-08-19 14:54:06 UTC

Analyzing CI artifacts it looks like cluster-monitoring-operator was never started (there are no logs from cmo pods). This is probably due to the fact that CMO runs on worker nodes which weren't started either (no logs from them too and workers-journal is empty: https://storage.cloud.google.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/47/artifacts/e2e-gcp/nodes/workers-journal)

Reassigning to installer team as it seems like installation didn't finish successfully.

Comment 3 Abhinav Dahiya 2019-08-19 16:37:49 UTC

> https://00e9e64bac72a6c5a275303d2f50c1d69c774c446078c71291-apidata.googleusercontent.com/download/storage/v1/b/origin-ci-test/o/logs%2Fcanary-openshift-ocp-installer-e2e-gcp-4.2%2F47%2Fartifacts%2Fe2e-gcp%2Fpods%2Fopenshift-cloud-credential-operator_cloud-credential-operator-cc9fd5444-6pgbd_manager.log?qk=AD5uMEuj-ON8tndmm7-17UyXQmeJ-J51t4-2GknMDQqYoKwsCx_kiIYNTYIIu7Lvs0ESj5ncfZVAz33-n9Pp4UxCKBCkAlaKvnUObYpTnl48mN21klZewhXsFPPa6q1liLKmvBvnrjyR4L0PC77H_y47MGGvpJjf0E4Nsl_TEb8_uEv-Nk8DctAAAlURmhQnrFFDp8KzBVO-mf8A33kyv9J3c3nGoKoLgTZnCewMKJ_McOmwdw72MYPDk1QvCTOaQPtxwWsbofdctvQvdr8-jzNfLKw3WQiJF-yLOCeEKdwLlDSmqV9PgZtwHcQ_D9MPNdCsB6bKP3S1gu3rvoxCP09jtxiJj2rlt907vXMM3a0jrBsvHGp0N04gC741YmTSUyMO0ew4cmML4nhVOTKJqi1uH1U7zRvvpRlnJBUL5TrELSBz-HJ84v4vLfJaRApZvo9a_v4xhnTWsuLgqxF-s7z0TIhzddT26Gv0rfSi4Uhl099jtWKI7Bg72HmiQ5_LNJwMIoivRUaMXBZcg8vKddvf2Q2IctytU1klZ4YgAxzUY_cNhi-7ijg4XscOxFrcfx0OIxrY5ky8dDTf3iWrl9E_dQbX3DlxJ9Kk5BD-Gd8-VhBSa2zRGy76Qu8Acj9v4huDyf9-mqbXYABrAuwX2yVpC9JUxZxtXUd-Agcp6ijno1hzQs8m9Gelsj5JIaOvVrQBn6FQPw3Q2xVOxliRiZfPfAzSX9aF_KdzpQUkpUNYNunhbCK3HC1cHW6DqItN1uWaniYuhQDT2aFFrH2HIg573fnpKSPB_JfL-PE8iCc43l_DjaOkz9rfmt5My2jdIL_1rPO-zH9GI6y99GrJisppe4I7xLrQf6v6yOd74Z6tpTC97O1q3e1kk3Tt0-emVraoROGYdMtGeUINlTAcvW96PNyVE1yTSeQh0KHcMngpzv7B02W6LFjSzNqIDPTH6zYzSn_STWxu23TPk0TP2pMIlqwPFjyas-CM3Ays-DCNdqlxQ4AhJQM

```
time="2019-08-18T22:42:04Z" level=error msg="error syncing creds in mint-mode" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp error="error creating service account: rpc error: code = ResourceExhausted desc = Maximum number of service accounts on project reached."
time="2019-08-18T22:42:04Z" level=error msg="error syncing credentials: error syncing creds in mint-mode: error creating service account: rpc error: code = ResourceExhausted desc = Maximum number of service accounts on project reached." controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-gcp secret=openshift-machine-api/gcp-cloud-credentials
```

Most of these are due to service account being reached.

Comment 4 Abhinav Dahiya 2019-08-19 16:48:59 UTC

> Most of these are due to service account being reached.

* service account limits being reached in the CI project.

Comment 5 Xingxing Xia 2019-08-20 03:27:38 UTC

(In reply to Xingxing Xia from comment #1)
> Also found above error in Azure env installation with latest
> 4.2.0-0.nightly-2019-08-19-071902 :
PS: launched AWS and Azure envs with same payload 4.2.0-0.nightly-2019-08-20-002921 , AWS env installation succeeds while Azure env encounters the failure of comment 1

Comment 6 Abhinav Dahiya 2019-08-20 03:37:43 UTC

The bug was about e2e-gcp failing the canary jobs.

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2

https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/53 suceeded.

So this should no longer be a test blocker.

Please open a separate issue for azure please.