Description of problem: level=fatal msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Could not update oauthclient \"console\" (296 of 487): the server does not recognize this resource, check extension API servers\n* Could not update prometheusrule \"openshift-cloud-credential-operator/cloud-credential-operator-alerts\" (141 of 487): the server does not recognize this resource, check extension API servers\n* Could not update prometheusrule \"openshift-cluster-samples-operator/samples-operator-alerts\" (238 of 487): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (451 of 487): resource may have been deleted\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (482 of 487): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (429 of 487): the server does not recognize this resource, check extension API servers\n* Version-Release number of the following components: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160#1:build-log.txt%3A48
I see that there is no openshift-monitoring namespace created and no prometheus-operator which creates those CRDs. Deferring to the CVO team to investigate further as it seems the cluster-monitoring-operator manifests don't seem to be deployed.
@sttts you moved the bug to monitoring when I pointed to openshift-apiserver as one of the failing operator https://bugzilla.redhat.com/show_bug.cgi?id=1771741#c1 Can you provide more context?
> I see that there is no openshift-monitoring namespace created and no prometheus-operator which creates those CRDs. Where you interested in a particular manifest? Looks like they are getting pushed out to me: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-origin-must-gather-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-cluster-version/pods/cluster-version-operator-746477f7d9-2mcfc/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'Done syncing.*openshift-monitoring' | sed 's/.*Done syncing//' | sort | uniq -c 8 for deployment "openshift-monitoring/cluster-monitoring-operator" (294 of 487) 8 for namespace "openshift-monitoring" (289 of 487) 8 for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (450 of 487) 8 for serviceaccount "openshift-monitoring/cluster-monitoring-operator" (292 of 487)
Washing the CVO logs through my cvo-waterfall script [1]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-origin-must-gather-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-cluster-version/pods/cluster-version-operator-746477f7d9-2mcfc/cluster-version-operator/cluster-version-operator/logs/current.log | cvo-waterfall.py >cvo.svg WARNING:root:not finished: prometheusrule openshift-cluster-samples-operator/samples-operator-alerts WARNING:root:not finished: prometheusrule openshift-cloud-credential-operator/cloud-credential-operator-alerts WARNING:root:not finished: oauthclient console WARNING:root:not finished: servicemonitor openshift-insights/insights-operator WARNING:root:not finished: servicemonitor openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator WARNING:root:not finished: servicemonitor openshift-controller-manager-operator/openshift-controller-manager-operator WARNING:root:not finished: servicemonitor openshift-kube-scheduler-operator/kube-scheduler-operator WARNING:root:not finished: servicemonitor openshift-machine-api/machine-api-operator WARNING:root:not finished: servicemonitor openshift-kube-controller-manager-operator/kube-controller-manager-operator WARNING:root:not finished: servicemonitor openshift-machine-config-operator/machine-config-daemon WARNING:root:not finished: servicemonitor openshift-cluster-version/cluster-version-operator WARNING:root:not finished: servicemonitor openshift-authentication-operator/authentication-operator WARNING:root:not finished: servicemonitor openshift-machine-api/cluster-autoscaler-operator WARNING:root:not finished: servicemonitor openshift-image-registry/image-registry WARNING:root:not finished: servicemonitor openshift-apiserver-operator/openshift-apiserver-operator WARNING:root:not finished: servicemonitor openshift-kube-apiserver-operator/kube-apiserver-operator WARNING:root:not finished: servicemonitor openshift-cluster-machine-approver/cluster-machine-approver WARNING:root:not finished: servicemonitor openshift-operator-lifecycle-manager/olm-operator WARNING:root:not finished: servicemonitor openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator WARNING:root:not finished: clusteroperator authentication WARNING:root:not finished: clusteroperator monitoring WARNING:root:not finished: clusteroperator image-registry WARNING:root:not finished: clusteroperator openshift-apiserver WARNING:root:not finished: clusteroperator node-tuning WARNING:root:not finished: clusteroperator service-catalog-controller-manager WARNING:root:not finished: clusteroperator storage WARNING:root:not finished: clusteroperator openshift-marketplace/marketplace WARNING:root:not finished: clusteroperator service-catalog-apiserver WARNING:root:not finished: clusteroperator ingress WARNING:root:not finished: role openshift-console-operator/prometheus-k8s Looks like the only not-finished monitoring manifest is the ClusterOperator, and that's something the CVO watches for, not something it pushes directly. [1]: https://github.com/wking/openshift-release/tree/debug-scripts/waterfall
The openshift-apiserver appears to be unable to connect to etcd based on the logs: ``` W1112 16:18:29.086651 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: operation was canceled". Reconnecting... ``` The endpoints for the etcd.openshift-etcd.svc appear to be present. I would ask either the networking team or the etcd team. If it's cloud specific, I'd try networking first. The monitoring operator deployment doesn't have any pods because no SCCs are present. The SCCs are still created by the openshift-apiserver (congratulations to us for yet another bootstrap loop).
SCC = SecurityContextConstraint, e.g. [1]. You can see that in this run with logs like: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/events.json | jq -r '.items[] | select(.message | contains("no SecurityContextConstraints found in cluster")).message' | sort | uniq Error creating: pods "authentication-operator-5478944f44-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "cluster-image-registry-operator-6b656597b6-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "cluster-monitoring-operator-74b4cdcd77-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "cluster-node-tuning-operator-5fd4c66b66-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "cluster-storage-operator-f6685bb59-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "ingress-operator-74c8b97b5-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "marketplace-operator-5db4ff9dfb-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "openshift-service-catalog-apiserver-operator-cc697748-" is forbidden: no SecurityContextConstraints found in cluster Error creating: pods "openshift-service-catalog-controller-manager-operator-6fd47b54bb-" is forbidden: no SecurityContextConstraints found in cluster although for some reason the monitoring operator namespace didn't show up in the must-gather [2]. Ah, probably because it never pushed its ClusterOperator. Seems like a must-gather hole, but I can spin that off into a separate bug. Also, guessing on the sub-component to make this comment, because Bugzilla says it is required. Not sure how David made the initial transfer to Networking without picking a sub-component. [1]: https://docs.okd.io/latest/admin_guide/manage_scc.html [2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-origin-must-gather-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/
> Also, guessing on the sub-component to make this comment... Ignore this, it was just my browser not updating the field on refresh in a misguided attempt to preserve my previous form entries in the face of Scott's Networking assignment.
Yeah, this looks no good. Juan, can you take a look at this? The issue is the openshift-apiserver which talks to etcd over a service. It seems to have failed to connect to etcd.
I'm not quite sure about what is going on, however, doesn't look related to the SDN. I see the three kube-api-server pods have similar issues and doesn't go through the kube-proxy service. I see the three kube-api-server are configured with: flags.go:33] FLAG: --etcd-servers="[https://etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379,https://etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379,https://etcd-2.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379]" (So, no sdn at all because both kube-api-server and etcd-member are host-network and this isn't going through kube-proxy. I see plenty of errors like these on all three masters: 2019-11-12T15:47:15.5744355Z W1112 15:47:15.574425 1 asm_amd64.s:1337] Failed to dial etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry. 2019-11-12T15:47:15.5745124Z W1112 15:47:15.574464 1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry. 2019-11-12T15:47:15.5785452Z I1112 15:47:15.578490 1 store.go:1342] Monitoring installplans.operators.coreos.com count at <storage-prefix>//operators.coreos.com/installplans 2019-11-12T15:47:15.6137171Z I1112 15:47:15.613665 1 log.go:172] http: TLS handshake error from 168.63.129.16:50904: EOF 2019-11-12T15:47:15.6216465Z I1112 15:47:15.621586 1 client.go:352] parsed scheme: "" 2019-11-12T15:47:15.6216465Z I1112 15:47:15.621614 1 client.go:352] scheme "" not registered, fallback to default scheme 2019-11-12T15:47:15.6216886Z I1112 15:47:15.621656 1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 0 <nil>}] 2019-11-12T15:47:15.621751Z I1112 15:47:15.621721 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>} {etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>} {etcd-2.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>}] 2019-11-12T15:47:15.6316643Z I1112 15:47:15.631607 1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>}] 2019-11-12T15:47:15.6316643Z I1112 15:47:15.631627 1 store.go:1342] Monitoring catalogsources.operators.coreos.com count at <storage-prefix>//operators.coreos.com/catalogsources 2019-11-12T15:47:15.6317152Z W1112 15:47:15.631694 1 asm_amd64.s:1337] Failed to dial etcd-2.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry. 2019-11-12T15:47:15.6317906Z W1112 15:47:15.631744 1 asm_amd64.s:1337] Failed to dial etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry. I also see plenty of errors in the etcd pods and plenty of raft negotiations: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/pods/openshift-etcd_etcd-member-ci-op-gy9qj7g9-6d41e-7624d-master-0_etcd-member.log Looks like it's either an etcd issue or a networking issue outside of the SDN. Is this cluster still accesible?
Excellent analysis, thanks Juan. Indeed, yes, etcd seems to be in a bad place. Lots of "likely overloaded". My guess is that azure had a blip and ran out of iops, but I'm no expert. Assigning to etcd team. Sorry for the ping-pong. Can you please confirm that this CI failure should not be a release blocker. Further, can you see if this indicates a product bug or is just bad luck?
Adding dependency on bug #1775878 The timeout issues noted in the logs appear to overlap. Not positive, but wanted to dry the link.
*** This bug has been marked as a duplicate of bug 1775878 ***