Hide Forgot
Description of problem: installation fails when service-catalog enabled, on RHEL AH, multi-zone, GCP installation success with same configration and service-catalog disabled. Version-Release number of selected component (if applicable): openshift-ansible-3.11.119-1.git.0.c9a8ebf.el7 sh-4.2# service-catalog --version v3.11.117;Upstream:v0.1.35 How reproducible: 1 out of 1 Steps to Reproduce: 1. install OCP with service-catalog enabled Actual results: Installation failed. # oc logs pod/controller-manager-gl4fk -n kube-service-catalog I0614 09:18:06.213055 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true] I0614 09:18:06.213353 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true AsyncBindingOperations:true] I0614 09:18:06.213375 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true AsyncBindingOperations:true NamespacedServiceBroker:true] I0614 09:18:06.213409 1 hyperkube.go:192] Service Catalog version v3.11.117;Upstream:v0.1.35 (built 2019-06-09T23:47:53Z) I0614 09:18:06.217427 1 leaderelection.go:185] attempting to acquire leader lease kube-service-catalog/service-catalog-controller-manager... I0614 09:18:23.660185 1 leaderelection.go:194] successfully acquired lease kube-service-catalog/service-catalog-controller-manager I0614 09:18:23.660283 1 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-service-catalog", Name:"service-catalog-controller-manager", UID:"ec653a7c-8e52-11e9-a7b9-42010af00079", APIVersion:"v1", ResourceVersion:"60766", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' controller-manager-gl4fk-external-service-catalog-controller became leader F0614 09:18:23.700155 1 controller_manager.go:237] error running controllers: failed to get api versions from server: failed to get supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request logs of apiserver in service-catalog project: I0614 09:15:04.126886 1 logs.go:49] http: TLS handshake error from 10.2.2.1:55380: read tcp 10.2.0.5:6443->10.2.2.1:55380: read: connection timed out I0614 09:15:04.126978 1 logs.go:49] http: TLS handshake error from 10.2.4.1:42130: read tcp 10.2.0.5:6443->10.2.4.1:42130: read: connection timed out Expected results: install success Additional info:
Created attachment 1580836 [details] tons of logs from the cluster logs from the master-api and controller. logs from the service catalog apiserver pods (r7v7t was returning 200s, while the rest were failing). logs from the service catalog controller oc describe of the apiserver and controller-manager pods
[root@wmengahha311117-master-etcd-zone1-1 ~]# oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-722dq 1/1 Running 0 16h apiserver-qk56x 1/1 Running 0 16h apiserver-r7v7t 1/1 Running 0 16h controller-manager-6mkfb 0/1 CrashLoopBackOff 191 16h controller-manager-gl4fk 0/1 CrashLoopBackOff 190 16h controller-manager-qxsfw 0/1 CrashLoopBackOff 190 16h
Works sometimes # oc get --raw /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ --loglevel=8 I0614 20:08:57.948520 99085 loader.go:359] Config loaded from file /root/.kube/config I0614 20:08:57.949503 99085 round_trippers.go:383] GET https://wmengahha311117-lb-1:443/apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ I0614 20:08:57.949534 99085 round_trippers.go:390] Request Headers: I0614 20:08:57.949546 99085 round_trippers.go:393] Accept: application/json, */* I0614 20:08:57.949556 99085 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0614 20:08:57.987994 99085 round_trippers.go:408] Response Status: 200 OK in 38 milliseconds I0614 20:08:57.988028 99085 round_trippers.go:411] Response Headers: I0614 20:08:57.988037 99085 round_trippers.go:414] Date: Fri, 14 Jun 2019 20:08:57 GMT I0614 20:08:57.988044 99085 round_trippers.go:414] Content-Length: 204 I0614 20:08:57.988050 99085 round_trippers.go:414] Cache-Control: no-store I0614 20:08:57.988057 99085 round_trippers.go:414] Content-Type: application/json {"kind":"ClusterServiceBrokerList","apiVersion":"servicecatalog.k8s.io/v1beta1","metadata":{"selfLink":"/apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/","resourceVersion":"159833"},"items":[]} Fails a lot # oc get --raw /apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ --loglevel=8 I0614 20:09:18.732062 99540 loader.go:359] Config loaded from file /root/.kube/config I0614 20:09:18.733569 99540 round_trippers.go:383] GET https://wmengahha311117-lb-1:443/apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ I0614 20:09:18.733710 99540 round_trippers.go:390] Request Headers: I0614 20:09:18.733810 99540 round_trippers.go:393] Accept: application/json, */* I0614 20:09:18.733880 99540 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0614 20:09:18.830132 99540 round_trippers.go:408] Response Status: 503 Service Unavailable in 96 milliseconds I0614 20:09:18.830175 99540 round_trippers.go:411] Response Headers: I0614 20:09:18.830183 99540 round_trippers.go:414] X-Content-Type-Options: nosniff I0614 20:09:18.830190 99540 round_trippers.go:414] Content-Length: 20 I0614 20:09:18.830211 99540 round_trippers.go:414] Date: Fri, 14 Jun 2019 20:09:18 GMT I0614 20:09:18.830218 99540 round_trippers.go:414] Cache-Control: no-store I0614 20:09:18.830225 99540 round_trippers.go:414] Content-Type: text/plain; charset=utf-8 I0614 20:09:18.830292 99540 request.go:897] Response Body: service unavailable I0614 20:09:18.830466 99540 helpers.go:201] server response object: [{ "metadata": {}, "status": "Failure", "message": "the server is currently unable to handle the request", "reason": "ServiceUnavailable", "details": { "causes": [ { "reason": "UnexpectedServerResponse", "message": "service unavailable" } ] }, "code": 503 }] F0614 20:09:18.830509 99540 helpers.go:119] Error from server (ServiceUnavailable): the server is currently unable to handle the request
# oc get --raw /apis/servicecatalog.k8s.io/v1beta1/ --loglevel=8 I0614 20:10:22.584209 100681 loader.go:359] Config loaded from file /root/.kube/config I0614 20:10:22.585199 100681 round_trippers.go:383] GET https://wmengahha311117-lb-1:443/apis/servicecatalog.k8s.io/v1beta1/ I0614 20:10:22.585231 100681 round_trippers.go:390] Request Headers: I0614 20:10:22.585243 100681 round_trippers.go:393] Accept: application/json, */* I0614 20:10:22.585254 100681 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0614 20:10:22.608623 100681 round_trippers.go:408] Response Status: 200 OK in 23 milliseconds I0614 20:10:22.608779 100681 round_trippers.go:411] Response Headers: I0614 20:10:22.608795 100681 round_trippers.go:414] Cache-Control: no-store I0614 20:10:22.608807 100681 round_trippers.go:414] Content-Type: application/json I0614 20:10:22.608822 100681 round_trippers.go:414] Date: Fri, 14 Jun 2019 20:10:22 GMT I0614 20:10:22.608833 100681 round_trippers.go:414] Content-Length: 2680 {"kind":"APIResourceList","apiVersion":"v1","groupVersion":"servicecatalog.k8s.io/v1beta1","resources":[{"name":"clusterservicebrokers","singularName":"","namespaced":false,"kind":"ClusterServiceBroker","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"clusterservicebrokers/status","singularName":"","namespaced":false,"kind":"ClusterServiceBroker","verbs":["get","patch","update"]},{"name":"clusterserviceclasses","singularName":"","namespaced":false,"kind":"ClusterServiceClass","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"clusterserviceclasses/status","singularName":"","namespaced":false,"kind":"ClusterServiceClass","verbs":["get","patch","update"]},{"name":"clusterserviceplans","singularName":"","namespaced":false,"kind":"ClusterServicePlan","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"clusterserviceplans/status","singularName":"","namespaced":false,"kind":"ClusterServicePlan","verbs":["get","patch","update"]},{"name":"servicebindings","singularName":"","namespaced":true,"kind":"ServiceBinding","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"servicebindings/status","singularName":"","namespaced":true,"kind":"ServiceBinding","verbs":["get","patch","update"]},{"name":"servicebrokers","singularName":"","namespaced":true,"kind":"ServiceBroker","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"servicebrokers/status","singularName":"","namespaced":true,"kind":"ServiceBroker","verbs":["get","patch","update"]},{"name":"serviceclasses","singularName":"","namespaced":true,"kind":"ServiceClass","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"serviceclasses/status","singularName":"","namespaced":true,"kind":"ServiceClass","verbs":["get","patch","update"]},{"name":"serviceinstances","singularName":"","namespaced":true,"kind":"ServiceInstance","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"serviceinstances/reference","singularName":"","namespaced":true,"kind":"ServiceInstance","verbs":["get","patch","update"]},{"name":"serviceinstances/status","singularName":"","namespaced":true,"kind":"ServiceInstance","verbs":["get","patch","update"]},{"name":"serviceplans","singularName":"","namespaced":true,"kind":"ServicePlan","verbs":["create","delete","deletecollection","get","list","patch","update","watch"]},{"name":"serviceplans/status","singularName":"","namespaced":true,"kind":"ServicePlan","verbs":["get","patch","update"]}]}
I've been restarting controllers and api containers in the kube-service-catalog namespace to see what was going on. Restarting and checking logs on catalog controller, I see this: ``` I0618 09:24:36.006024 104860 loader.go:359] Config loaded from file /root/.kube/config I0618 09:24:36.007040 104860 loader.go:359] Config loaded from file /root/.kube/config I0618 09:24:36.010757 104860 loader.go:359] Config loaded from file /root/.kube/config I0618 09:24:36.018679 104860 round_trippers.go:383] GET https://wmengahha311117-lb-1:443/apis/metrics.k8s.io/v1beta1?timeout=32s I0618 09:24:36.018774 104860 round_trippers.go:390] Request Headers: I0618 09:24:36.018786 104860 round_trippers.go:393] Accept: application/json, */* I0618 09:24:36.018808 104860 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0618 09:24:36.021532 104860 round_trippers.go:408] Response Status: 503 Service Unavailable in 2 milliseconds I0618 09:24:36.021583 104860 round_trippers.go:411] Response Headers: I0618 09:24:36.021604 104860 round_trippers.go:414] Cache-Control: no-store I0618 09:24:36.021625 104860 round_trippers.go:414] Content-Type: text/plain; charset=utf-8 I0618 09:24:36.021635 104860 round_trippers.go:414] X-Content-Type-Options: nosniff I0618 09:24:36.021646 104860 round_trippers.go:414] Content-Length: 20 I0618 09:24:36.021658 104860 round_trippers.go:414] Date: Tue, 18 Jun 2019 09:24:36 GMT I0618 09:24:36.021721 104860 request.go:897] Response Body: service unavailable I0618 09:24:36.021894 104860 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\" "; Kind string "json:\"kind,omitempty\"" } I0618 09:24:36.021929 104860 cached_discovery.go:77] skipped caching discovery info due to the server is currently unable to handle the request I0618 09:24:36.023399 104860 shortcut.go:89] Error loading discovery information: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request I0618 09:24:36.026221 104860 round_trippers.go:383] GET https://wmengahha311117-lb-1:443/apis/metrics.k8s.io/v1beta1?timeout=32s I0618 09:24:36.026252 104860 round_trippers.go:390] Request Headers: I0618 09:24:36.026261 104860 round_trippers.go:393] Accept: application/json, */* I0618 09:24:36.026269 104860 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0618 09:24:36.028314 104860 round_trippers.go:408] Response Status: 503 Service Unavailable in 2 milliseconds I0618 09:24:36.028343 104860 round_trippers.go:411] Response Headers: I0618 09:24:36.028352 104860 round_trippers.go:414] Cache-Control: no-store I0618 09:24:36.028360 104860 round_trippers.go:414] Content-Type: text/plain; charset=utf-8 I0618 09:24:36.028368 104860 round_trippers.go:414] X-Content-Type-Options: nosniff I0618 09:24:36.028375 104860 round_trippers.go:414] Content-Length: 20 I0618 09:24:36.028382 104860 round_trippers.go:414] Date: Tue, 18 Jun 2019 09:24:36 GMT I0618 09:24:36.028419 104860 request.go:897] Response Body: service unavailable I0618 09:24:36.028536 104860 request.go:1099] body was not decodable (unable to check for Status): couldn't get version/kind; json parse error: json: cannot unmarshal string into Go value of type struct { APIVersion string "json:\"apiVersion,omitempty\" "; Kind string "json:\"kind,omitempty\"" } I0618 09:24:36.028571 104860 cached_discovery.go:77] skipped caching discovery info due to the server is currently unable to handle the request I0618 09:24:36.030991 104860 loader.go:359] Config loaded from file /root/.kube/config I0618 09:24:36.032145 104860 loader.go:359] Config loaded from file /root/.kube/config I0618 09:24:36.033230 104860 loader.go:359] Config loaded from file /root/.kube/config I0618 09:24:36.033537 104860 round_trippers.go:383] GET https://wmengahha311117-lb-1:443/api/v1/namespaces/kube-service-catalog/pods/controller-manager-bbgm5/log?follow=true I0618 09:24:36.033559 104860 round_trippers.go:390] Request Headers: I0618 09:24:36.033567 104860 round_trippers.go:393] Accept: application/json, */* I0618 09:24:36.033575 104860 round_trippers.go:393] User-Agent: oc/v1.11.0+d4cacc0 (linux/amd64) kubernetes/d4cacc0 I0618 09:24:36.051402 104860 round_trippers.go:408] Response Status: 200 OK in 17 milliseconds I0618 09:24:36.051462 104860 round_trippers.go:411] Response Headers: I0618 09:24:36.051476 104860 round_trippers.go:414] Cache-Control: no-store I0618 09:24:36.051488 104860 round_trippers.go:414] Content-Type: text/plain I0618 09:24:36.051499 104860 round_trippers.go:414] Date: Tue, 18 Jun 2019 09:24:36 GMT I0618 09:24:21.710295 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true] I0618 09:24:21.710593 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true AsyncBindingOperations:true] I0618 09:24:21.710623 1 feature_gate.go:194] feature gates: map[OriginatingIdentity:true AsyncBindingOperations:true NamespacedServiceBroker:true] I0618 09:24:21.710650 1 hyperkube.go:192] Service Catalog version v3.11.117;Upstream:v0.1.35 (built 2019-06-09T23:47:53Z) I0618 09:24:21.713644 1 leaderelection.go:185] attempting to acquire leader lease kube-service-catalog/service-catalog-controller-manager... I0618 09:24:21.746672 1 leaderelection.go:194] successfully acquired lease kube-service-catalog/service-catalog-controller-manager I0618 09:24:21.747902 1 event.go:221] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"kube-service-catalog", Name:"service-catalog-controller-manager", UID:"ec653a7c-8e52-11e9-a7b9-42010af00079", APIVersion:"v1", ResourceVersion:"895074", Fie ldPath:""}): type: 'Normal' reason: 'LeaderElection' controller-manager-bbgm5-external-service-catalog-controller became leader F0618 09:24:21.788731 1 controller_manager.go:237] error running controllers: failed to get api versions from server: failed to get supported resources from server: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the se rver is currently unable to handle the request, servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request ``` As you can see, hitting kube-service-catalog API calls return 200s, but hitting the metrics API return 503.
In other words, this isn't a networking issue. If you manage to get another cluster in this state, feel free to poke us.
We can curl some of the servicecatalog apiservers from the master, where the main apiserver runs: ``` [root@wmeng195ahha311-master-etcd-zone1-1 ~]# curl -kv https://10.2.0.5:6443/healthz * About to connect() to 10.2.0.5 port 6443 (#0) * Trying 10.2.0.5... * Connected to 10.2.0.5 (10.2.0.5) port 6443 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb * skipping SSL peer certificate verification * NSS: client certificate not found (nickname not specified) * SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: * subject: CN=apiserver.kube-service-catalog * start date: Jun 19 06:18:35 2019 GMT * expire date: Jun 18 06:18:36 2021 GMT * common name: apiserver.kube-service-catalog * issuer: CN=service-catalog-signer > GET /healthz HTTP/1.1 > User-Agent: curl/7.29.0 > Host: 10.2.0.5:6443 > Accept: */* > < HTTP/1.1 200 OK < Date: Wed, 19 Jun 2019 09:21:50 GMT < Content-Length: 2 < Content-Type: text/plain; charset=utf-8 < * Connection #0 to host 10.2.0.5 left intact ok[root@wmeng195ahha311-master-etcd-zone1-1 ~]# curl -kv https://10.2.2.5:6443/healthz * About to connect() to 10.2.2.5 port 6443 (#0) * Trying 10.2.2.5... * Connected to 10.2.2.5 (10.2.2.5) port 6443 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb ^C [root@wmeng195ahha311-master-etcd-zone1-1 ~]# curl -kv https://10.2.4.4:6443/healthz * About to connect() to 10.2.4.4 port 6443 (#0) * Trying 10.2.4.4... * Connected to 10.2.4.4 (10.2.4.4) port 6443 (#0) * Initializing NSS with certpath: sql:/etc/pki/nssdb ``` On the apiservers that the request is not completed, you can see the connection is made, so it seems hung servicecatalog api servers. Moving the bug to apiserver team.
Pings are also fine, thus not networking issue: ``` [root@wmeng195ahha311-master-etcd-zone1-1 ~]# ping 10.2.4.4 PING 10.2.4.4 (10.2.4.4) 56(84) bytes of data. 64 bytes from 10.2.4.4: icmp_seq=1 ttl=64 time=0.785 ms ^C --- 10.2.4.4 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.785/0.785/0.785/0.000 ms [root@wmeng195ahha311-master-etcd-zone1-1 ~]# ping 10.2.0.5 PING 10.2.0.5 (10.2.0.5) 56(84) bytes of data. 64 bytes from 10.2.0.5: icmp_seq=1 ttl=64 time=0.099 ms ^C --- 10.2.0.5 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.099/0.099/0.099/0.000 ms [root@wmeng195ahha311-master-etcd-zone1-1 ~]# ping 10.2.2.5 PING 10.2.2.5 (10.2.2.5) 56(84) bytes of data. 64 bytes from 10.2.2.5: icmp_seq=1 ttl=64 time=0.765 ms ^C --- 10.2.2.5 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.765/0.765/0.765/0.000 ms ```
(In reply to Ricardo Carrillo Cruz from comment #11) > F0618 09:24:21.788731 1 controller_manager.go:237] error running > controllers: failed to get api versions from server: failed to get supported > resources from server: unable to retrieve the complete list of server APIs: > metrics.k8s.io/v1beta1: the se > rver is currently unable to handle the request, > servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the > request > ``` > > As you can see, hitting kube-service-catalog API calls return 200s, but > hitting the metrics API return 503. Per this hint, I checked component metrics.k8s.io, find it is False: [root@wmeng195ahha311-master-etcd-zone1-1 ~]# oc get apiservices -o=custom-columns="name:.metadata.name,status:.status.conditions[0].status" v1beta1.metrics.k8s.io False [root@wmeng195ahha311-master-etcd-zone1-1 ~]# oc logs deploy/metrics-server -n openshift-metrics-server I0619 06:18:22.137440 1 heapster.go:71] /usr/bin/metrics-server --source=kubernetes.summary_api:?useServiceAccount=true&kubeletHttps=true&kubeletPort=10250 --tls-cert-file=/certs/tls.crt --tls-private-key-file=/certs/tls.key --metric_resolution=30s --secure-port=8443 ... I0619 06:18:22.786769 1 serve.go:85] Serving securely on 0.0.0.0:8443 I0619 06:33:51.058649 1 logs.go:41] http: TLS handshake error from 10.2.4.1:52634: read tcp 10.2.12.7:8443->10.2.4.1:52634: read: connection timed out ... I0619 10:14:22.674626 1 logs.go:41] http: TLS handshake error from 10.2.2.1:52684: read tcp 10.2.12.7:8443->10.2.2.1:52684: read: connection timed out Per my memory of other bug 1625194#c25 , above metrics failure may produce side impact on SC.
> Per my memory of other bug 1625194#c25 , above metrics failure may produce side impact on SC. I believe this is only related to deleting namespaces. Can we re-try this with new changes here: https://github.com/openshift/openshift-ansible/pull/11704
Found the bug. GCP nodes are coming up with a 1460 eth0 mtu, and trying to set up openshift-sdn vxlan with 1450. That's not gonna work (it needs to be default - 50, or 1410). openshift-ansible should detect this.
And, it looks like someone in the community caught this a few hours ago too.
Fixed. openshift-ansible-3.11.123-1.git.0.db681ba.el7 Installation successful. The cluster is working fine. networkConfig: mtu: 1410 It was 1450 before the fix.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1605