Bug 1743114 - Several kinds of failures: Run template e2e-gcp - e2e-gcp container setup
Summary: Several kinds of failures: Run template e2e-gcp - e2e-gcp container setup
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.2.0
Assignee: Abhinav Dahiya
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks: 1736168
TreeView+ depends on / blocked
 
Reported: 2019-08-19 06:41 UTC by Xingxing Xia
Modified: 2019-08-20 04:16 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-08-20 03:37:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Xingxing Xia 2019-08-19 06:41:15 UTC
Description of problem:
One job https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/47 failed with:
level=fatal msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"

Other 3 kinds of jobs:
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/46
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/45
... etc.
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/21
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-serial-4.2/20
... etc.
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-fips-4.2/21
https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-fips-4.2/20
... etc.
These failed with:
Installing from release registry.svc.ci.openshift.org/ocp/release:4.2
level=warning msg="Found override for ReleaseImage. Please be warned, this is not advised"
level=info msg="Consuming \"Install Config\" from target directory"
level=info msg="Creating infrastructure resources..."
level=error
level=error msg="Error: Error creating service account: googleapi: Error 429: Maximum number of service accounts on project reached., rateLimitExceeded"
level=error
level=error msg="  on ../tmp/openshift-install-026222549/iam/main.tf line 1, in resource \"google_service_account\" \"worker-node-sa\":"
level=error msg="   1: resource \"google_service_account\" \"worker-node-sa\" {"
level=error
level=error
level=error
level=error msg="Error: Error creating service account: googleapi: Error 429: Maximum number of service accounts on project reached., rateLimitExceeded"
level=error
level=error msg="  on ../tmp/openshift-install-026222549/master/main.tf line 1, in resource \"google_service_account\" \"master-node-sa\":"
level=error msg="   1: resource \"google_service_account\" \"master-node-sa\" {"
level=error
level=error
level=fatal msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed to apply using Terraform"
---
Container test exited with code 1, reason Error
---
Another process exited

Comment 1 Xingxing Xia 2019-08-19 12:09:11 UTC
Also found above error in Azure env installation with latest 4.2.0-0.nightly-2019-08-19-071902 :
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65669/
log: https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65669/console
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/65669/artifact/workdir/install-dir/auth/kubeconfig

level=info msg="Waiting up to 30m0s for the cluster at https://api....qe.azure.devcluster.openshift.com:6443 to initialize..."
level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-08-19-071902: 92% complete"
level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"
level=debug msg="Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-08-19-071902: 92% complete"
level=debug msg="Still waiting for the cluster to initialize: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"
level=fatal msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Cluster operator console is reporting a failure: CustomLogoDegraded: waiting on route host\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (405 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (370 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-cluster-version/cluster-version-operator\" (8 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-controller-manager-operator/openshift-controller-manager-operator\" (409 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-image-registry/image-registry\" (376 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (386 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-controller-manager-operator/kube-controller-manager-operator\" (390 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-kube-scheduler-operator/kube-scheduler-operator\" (394 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/cluster-autoscaler-operator\" (148 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-machine-api/machine-api-operator\" (396 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-operator-lifecycle-manager/olm-operator\" (399 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (379 of 410): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator\" (382 of 410): the server does not recognize this resource, check extension API servers"

Comment 2 Pawel Krupa 2019-08-19 14:54:06 UTC
Analyzing CI artifacts it looks like cluster-monitoring-operator was never started (there are no logs from cmo pods). This is probably due to the fact that CMO runs on worker nodes which weren't started either (no logs from them too and workers-journal is empty: https://storage.cloud.google.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/47/artifacts/e2e-gcp/nodes/workers-journal)

Reassigning to installer team as it seems like installation didn't finish successfully.

Comment 3 Abhinav Dahiya 2019-08-19 16:37:49 UTC
> https://00e9e64bac72a6c5a275303d2f50c1d69c774c446078c71291-apidata.googleusercontent.com/download/storage/v1/b/origin-ci-test/o/logs%2Fcanary-openshift-ocp-installer-e2e-gcp-4.2%2F47%2Fartifacts%2Fe2e-gcp%2Fpods%2Fopenshift-cloud-credential-operator_cloud-credential-operator-cc9fd5444-6pgbd_manager.log?qk=AD5uMEuj-ON8tndmm7-17UyXQmeJ-J51t4-2GknMDQqYoKwsCx_kiIYNTYIIu7Lvs0ESj5ncfZVAz33-n9Pp4UxCKBCkAlaKvnUObYpTnl48mN21klZewhXsFPPa6q1liLKmvBvnrjyR4L0PC77H_y47MGGvpJjf0E4Nsl_TEb8_uEv-Nk8DctAAAlURmhQnrFFDp8KzBVO-mf8A33kyv9J3c3nGoKoLgTZnCewMKJ_McOmwdw72MYPDk1QvCTOaQPtxwWsbofdctvQvdr8-jzNfLKw3WQiJF-yLOCeEKdwLlDSmqV9PgZtwHcQ_D9MPNdCsB6bKP3S1gu3rvoxCP09jtxiJj2rlt907vXMM3a0jrBsvHGp0N04gC741YmTSUyMO0ew4cmML4nhVOTKJqi1uH1U7zRvvpRlnJBUL5TrELSBz-HJ84v4vLfJaRApZvo9a_v4xhnTWsuLgqxF-s7z0TIhzddT26Gv0rfSi4Uhl099jtWKI7Bg72HmiQ5_LNJwMIoivRUaMXBZcg8vKddvf2Q2IctytU1klZ4YgAxzUY_cNhi-7ijg4XscOxFrcfx0OIxrY5ky8dDTf3iWrl9E_dQbX3DlxJ9Kk5BD-Gd8-VhBSa2zRGy76Qu8Acj9v4huDyf9-mqbXYABrAuwX2yVpC9JUxZxtXUd-Agcp6ijno1hzQs8m9Gelsj5JIaOvVrQBn6FQPw3Q2xVOxliRiZfPfAzSX9aF_KdzpQUkpUNYNunhbCK3HC1cHW6DqItN1uWaniYuhQDT2aFFrH2HIg573fnpKSPB_JfL-PE8iCc43l_DjaOkz9rfmt5My2jdIL_1rPO-zH9GI6y99GrJisppe4I7xLrQf6v6yOd74Z6tpTC97O1q3e1kk3Tt0-emVraoROGYdMtGeUINlTAcvW96PNyVE1yTSeQh0KHcMngpzv7B02W6LFjSzNqIDPTH6zYzSn_STWxu23TPk0TP2pMIlqwPFjyas-CM3Ays-DCNdqlxQ4AhJQM

```
time="2019-08-18T22:42:04Z" level=error msg="error syncing creds in mint-mode" actuator=gcp cr=openshift-cloud-credential-operator/openshift-machine-api-gcp error="error creating service account: rpc error: code = ResourceExhausted desc = Maximum number of service accounts on project reached."
time="2019-08-18T22:42:04Z" level=error msg="error syncing credentials: error syncing creds in mint-mode: error creating service account: rpc error: code = ResourceExhausted desc = Maximum number of service accounts on project reached." controller=credreq cr=openshift-cloud-credential-operator/openshift-machine-api-gcp secret=openshift-machine-api/gcp-cloud-credentials
```

Most of these are due to service account being reached.

Comment 4 Abhinav Dahiya 2019-08-19 16:48:59 UTC
> Most of these are due to service account being reached.

* service account limits being reached in the CI project.

Comment 5 Xingxing Xia 2019-08-20 03:27:38 UTC
(In reply to Xingxing Xia from comment #1)
> Also found above error in Azure env installation with latest
> 4.2.0-0.nightly-2019-08-19-071902 :
PS: launched AWS and Azure envs with same payload 4.2.0-0.nightly-2019-08-20-002921 , AWS env installation succeeds while Azure env encounters the failure of comment 1

Comment 6 Abhinav Dahiya 2019-08-20 03:37:43 UTC
The bug was about e2e-gcp failing the canary jobs.

https://testgrid.k8s.io/redhat-openshift-release-informing#redhat-canary-openshift-ocp-installer-e2e-gcp-4.2

https://prow.k8s.io/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/53 suceeded.

So this should no longer be a test blocker.

Please open a separate issue for azure please.


Note You need to log in before you can comment on or make changes to this bug.