Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1771741

Summary:	[CI][Azure] Could not update servicemonitor, openshift-apiserver-operator
Product:	OpenShift Container Platform	Reporter:	Vinay Kapalavai <vkapalav>
Component:	Etcd	Assignee:	Sam Batschelet <sbatsche>
Status:	CLOSED DUPLICATE	QA Contact:	ge liu <geliu>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.3.0	CC:	alegrand, anpicker, aos-bugs, cdc, deads, erooth, gblomqui, jokerman, kakkoyun, lcosic, mfojtik, mloibl, pkrupa, sdodson, surbania, wking
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-12-20 15:15:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1775878
Bug Blocks:

Description Vinay Kapalavai 2019-11-12 21:11:48 UTC

Description of problem:
level=fatal msg="failed to initialize the cluster: Multiple errors are preventing progress:\n* Could not update oauthclient \"console\" (296 of 487): the server does not recognize this resource, check extension API servers\n* Could not update prometheusrule \"openshift-cloud-credential-operator/cloud-credential-operator-alerts\" (141 of 487): the server does not recognize this resource, check extension API servers\n* Could not update prometheusrule \"openshift-cluster-samples-operator/samples-operator-alerts\" (238 of 487): the server does not recognize this resource, check extension API servers\n* Could not update role \"openshift-console-operator/prometheus-k8s\" (451 of 487): resource may have been deleted\n* Could not update servicemonitor \"openshift-apiserver-operator/openshift-apiserver-operator\" (482 of 487): the server does not recognize this resource, check extension API servers\n* Could not update servicemonitor \"openshift-authentication-operator/authentication-operator\" (429 of 487): the server does not recognize this resource, check extension API servers\n*

Version-Release number of the following components:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Abhinav Dahiya 2019-11-12 21:36:28 UTC

> https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160#1:build-log.txt%3A48

Comment 2 Sergiusz Urbaniak 2019-12-09 14:58:49 UTC

I see that there is no openshift-monitoring namespace created and no prometheus-operator which creates those CRDs. Deferring to the CVO team to investigate further as it seems the cluster-monitoring-operator manifests don't seem to be deployed.

Comment 3 Abhinav Dahiya 2019-12-09 17:08:30 UTC

@sttts you moved the bug to monitoring when I pointed to openshift-apiserver as one of the failing operator https://bugzilla.redhat.com/show_bug.cgi?id=1771741#c1

Can you provide more context?

Comment 4 W. Trevor King 2019-12-09 19:26:22 UTC

> I see that there is no openshift-monitoring namespace created and no prometheus-operator which creates those CRDs.


Where you interested in a particular manifest?  Looks like they are getting pushed out to me:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-origin-must-gather-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-cluster-version/pods/cluster-version-operator-746477f7d9-2mcfc/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'Done syncing.*openshift-monitoring' | sed 's/.*Done syncing//' | sort | uniq -c
      8  for deployment "openshift-monitoring/cluster-monitoring-operator" (294 of 487)
      8  for namespace "openshift-monitoring" (289 of 487)
      8  for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (450 of 487)
      8  for serviceaccount "openshift-monitoring/cluster-monitoring-operator" (292 of 487)

Comment 5 W. Trevor King 2019-12-09 19:29:28 UTC

Washing the CVO logs through my cvo-waterfall script [1]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-origin-must-gather-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/openshift-cluster-version/pods/cluster-version-operator-746477f7d9-2mcfc/cluster-version-operator/cluster-version-operator/logs/current.log | cvo-waterfall.py >cvo.svg
WARNING:root:not finished: prometheusrule openshift-cluster-samples-operator/samples-operator-alerts
WARNING:root:not finished: prometheusrule openshift-cloud-credential-operator/cloud-credential-operator-alerts
WARNING:root:not finished: oauthclient console
WARNING:root:not finished: servicemonitor openshift-insights/insights-operator
WARNING:root:not finished: servicemonitor openshift-service-catalog-controller-manager-operator/openshift-service-catalog-controller-manager-operator
WARNING:root:not finished: servicemonitor openshift-controller-manager-operator/openshift-controller-manager-operator
WARNING:root:not finished: servicemonitor openshift-kube-scheduler-operator/kube-scheduler-operator
WARNING:root:not finished: servicemonitor openshift-machine-api/machine-api-operator
WARNING:root:not finished: servicemonitor openshift-kube-controller-manager-operator/kube-controller-manager-operator
WARNING:root:not finished: servicemonitor openshift-machine-config-operator/machine-config-daemon
WARNING:root:not finished: servicemonitor openshift-cluster-version/cluster-version-operator
WARNING:root:not finished: servicemonitor openshift-authentication-operator/authentication-operator
WARNING:root:not finished: servicemonitor openshift-machine-api/cluster-autoscaler-operator
WARNING:root:not finished: servicemonitor openshift-image-registry/image-registry
WARNING:root:not finished: servicemonitor openshift-apiserver-operator/openshift-apiserver-operator
WARNING:root:not finished: servicemonitor openshift-kube-apiserver-operator/kube-apiserver-operator
WARNING:root:not finished: servicemonitor openshift-cluster-machine-approver/cluster-machine-approver
WARNING:root:not finished: servicemonitor openshift-operator-lifecycle-manager/olm-operator
WARNING:root:not finished: servicemonitor openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator
WARNING:root:not finished: clusteroperator authentication
WARNING:root:not finished: clusteroperator monitoring
WARNING:root:not finished: clusteroperator image-registry
WARNING:root:not finished: clusteroperator openshift-apiserver
WARNING:root:not finished: clusteroperator node-tuning
WARNING:root:not finished: clusteroperator service-catalog-controller-manager
WARNING:root:not finished: clusteroperator storage
WARNING:root:not finished: clusteroperator openshift-marketplace/marketplace
WARNING:root:not finished: clusteroperator service-catalog-apiserver
WARNING:root:not finished: clusteroperator ingress
WARNING:root:not finished: role openshift-console-operator/prometheus-k8s

Looks like the only not-finished monitoring manifest is the ClusterOperator, and that's something the CVO watches for, not something it pushes directly.

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/waterfall

Comment 6 David Eads 2019-12-09 20:52:48 UTC

The openshift-apiserver appears to be unable to connect to etcd based on the logs: 

```
W1112 16:18:29.086651       1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {https://etcd.openshift-etcd.svc:2379 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: operation was canceled". Reconnecting...
```

The endpoints for the etcd.openshift-etcd.svc appear to be present.  I would ask either the networking team or the etcd team.  If it's cloud specific, I'd try networking first.

The monitoring operator deployment doesn't have any pods because no SCCs are present. The SCCs are still created by the openshift-apiserver (congratulations to us for yet another bootstrap loop).

Comment 7 W. Trevor King 2019-12-09 21:25:03 UTC

SCC = SecurityContextConstraint, e.g. [1].  You can see that in this run with logs like:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/events.json | jq -r '.items[] | select(.message | contains("no SecurityContextConstraints found in cluster")).message' | sort | uniq
Error creating: pods "authentication-operator-5478944f44-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "cluster-image-registry-operator-6b656597b6-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "cluster-monitoring-operator-74b4cdcd77-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "cluster-node-tuning-operator-5fd4c66b66-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "cluster-storage-operator-f6685bb59-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "ingress-operator-74c8b97b5-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "marketplace-operator-5db4ff9dfb-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "openshift-service-catalog-apiserver-operator-cc697748-" is forbidden: no SecurityContextConstraints found in cluster
Error creating: pods "openshift-service-catalog-controller-manager-operator-6fd47b54bb-" is forbidden: no SecurityContextConstraints found in cluster

although for some reason the monitoring operator namespace didn't show up in the must-gather [2].  Ah, probably because it never pushed its ClusterOperator.  Seems like a must-gather hole, but I can spin that off into a separate bug.

Also, guessing on the sub-component to make this comment, because Bugzilla says it is required.  Not sure how David made the initial transfer to Networking without picking a sub-component.

[1]: https://docs.okd.io/latest/admin_guide/manage_scc.html
[2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/must-gather/quay-io-openshift-origin-must-gather-sha256-dae1257b516a5c177237cfef5a6a3e241962b0d20cf54bcb2b66dc1671c5035e/namespaces/

Comment 8 W. Trevor King 2019-12-09 21:27:42 UTC

> Also, guessing on the sub-component to make this comment...

Ignore this, it was just my browser not updating the field on refresh in a misguided attempt to preserve my previous form entries in the face of Scott's Networking assignment.

Comment 9 Casey Callendrello 2019-12-10 14:10:31 UTC

Yeah, this looks no good.

Juan, can you take a look at this? The issue is the openshift-apiserver which talks to etcd over a service. It seems to have failed to connect to etcd.

Comment 10 Juan Luis de Sousa-Valadas 2019-12-10 16:24:53 UTC

I'm not quite sure about what is going on, however, doesn't look related to the SDN.
I see the three kube-api-server pods have similar issues and doesn't go through the kube-proxy service. I see the three kube-api-server are configured with:
flags.go:33] FLAG: --etcd-servers="[https://etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379,https://etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379,https://etcd-2.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379]"
(So, no sdn at all because both kube-api-server and etcd-member are host-network and this isn't going through kube-proxy.

I see plenty of errors like these on all three masters:
2019-11-12T15:47:15.5744355Z W1112 15:47:15.574425       1 asm_amd64.s:1337] Failed to dial etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.
2019-11-12T15:47:15.5745124Z W1112 15:47:15.574464       1 asm_amd64.s:1337] Failed to dial etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.
2019-11-12T15:47:15.5785452Z I1112 15:47:15.578490       1 store.go:1342] Monitoring installplans.operators.coreos.com count at <storage-prefix>//operators.coreos.com/installplans
2019-11-12T15:47:15.6137171Z I1112 15:47:15.613665       1 log.go:172] http: TLS handshake error from 168.63.129.16:50904: EOF
2019-11-12T15:47:15.6216465Z I1112 15:47:15.621586       1 client.go:352] parsed scheme: ""
2019-11-12T15:47:15.6216465Z I1112 15:47:15.621614       1 client.go:352] scheme "" not registered, fallback to default scheme
2019-11-12T15:47:15.6216886Z I1112 15:47:15.621656       1 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 0  <nil>}]
2019-11-12T15:47:15.621751Z I1112 15:47:15.621721       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>} {etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>} {etcd-2.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>}]
2019-11-12T15:47:15.6316643Z I1112 15:47:15.631607       1 asm_amd64.s:1337] balancerWrapper: got update addr from Notify: [{etcd-0.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379 <nil>}]
2019-11-12T15:47:15.6316643Z I1112 15:47:15.631627       1 store.go:1342] Monitoring catalogsources.operators.coreos.com count at <storage-prefix>//operators.coreos.com/catalogsources
2019-11-12T15:47:15.6317152Z W1112 15:47:15.631694       1 asm_amd64.s:1337] Failed to dial etcd-2.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.
2019-11-12T15:47:15.6317906Z W1112 15:47:15.631744       1 asm_amd64.s:1337] Failed to dial etcd-1.ci-op-gy9qj7g9-6d41e.ci.azure.devcluster.openshift.com:2379: grpc: the connection is closing; please retry.

I also see plenty of errors in the etcd pods and plenty of raft negotiations:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.3/160/artifacts/e2e-azure-upgrade/pods/openshift-etcd_etcd-member-ci-op-gy9qj7g9-6d41e-7624d-master-0_etcd-member.log

Looks like it's either an etcd issue or a networking issue outside of the SDN. Is this cluster still accesible?

Comment 11 Casey Callendrello 2019-12-11 14:22:15 UTC

Excellent analysis, thanks Juan.

Indeed, yes, etcd seems to be in a bad place. Lots of "likely overloaded". My guess is that azure had a blip and ran out of iops, but I'm no expert.

Assigning to etcd team. Sorry for the ping-pong. Can you please confirm that this CI failure should not be a release blocker. Further, can you see if this indicates a product bug or is just bad luck?

Comment 12 Greg Blomquist 2019-12-11 19:20:55 UTC

Adding dependency on bug #1775878

The timeout issues noted in the logs appear to overlap.  Not positive, but wanted to dry the link.

Comment 13 Greg Blomquist 2019-12-20 15:15:01 UTC


*** This bug has been marked as a duplicate of bug 1775878 ***