Description of problem: controller-manager pod on one of the node stopped responding. As the controller-manager stopped responding, the app deployment using the template was not working. Error received: ~~~ The service is not yet ready. The provision call failed and will be retried: Error communicating with broker for provisioning: Put https://apiserver.openshift-template-service-broker.svc:443/brokers/template.openshift.io/v2/service_instances/8af1f3b4-ebaa-4ffe-b2ef-abbc6456a56d?accepts_incomplete=true: dial tcp X.X.226.105:443: connect: cannot assign requested address ~~~ logs on the pod: ~~~ $ oc logs apiserver-f2xx4 -n openshift-template-service-broker I0910 05:53:05.906810 1 serve.go:89] Serving securely on [::]:8443 I0910 05:53:05.907700 1 controller_utils.go:1019] Waiting for caches to sync for tsb controller I0910 05:53:06.007863 1 controller_utils.go:1026] Caches are synced for tsb controller W0910 06:00:06.012356 1 reflector.go:341] github.com/openshift/origin/vendor/github.com/openshift/client-go/template/informers/externalversions/factory.go:57: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old. ~~~ To troubleshoot, ran the command fin controller-manager form kube-service-catalog namespace: ~~~ $ oc project kube-service-catalog $ oc get pods NAME READY STATUS RESTARTS AGE apiserver-bxlxf 1/1 Running 0 1d apiserver-dkhz5 1/1 Running 0 1d apiserver-hmk5r 1/1 Running 0 1d controller-manager-9rh5l 1/1 Running 0 1d controller-manager-mhdqk 1/1 Running 0 1d controller-manager-s4662 1/1 Running 0 1d $ for pod in $(oc get pod -o name | grep controller-manager | awk '{print $1}'); do echo "--> $pod"; oc rsh $pod curl -kv https://apiserver.openshift-template-service-broker.svc:443 ; done --> pods/controller-manager-9rh5l * About to connect() to apiserver.openshift-template-service-broker.svc port 443 (#0) * Trying X.X.226.105... * Failed to connect to X.X.226.105: Cannot assign requested address * couldn't connect to host at apiserver.openshift-template-service-broker.svc:443 * Closing connection 0 curl: (7) Failed to connect to X.X.226.105: Cannot assign requested address command terminated with exit code 7 --> pods/controller-manager-mhdqk * About to connect() to apiserver.openshift-template-service-broker.svc port 443 (#0) * Trying X.X.226.105... * Connected to apiserver.openshift-template-service-broker.svc (X.X.226.105) port 443 (#0) [ curl output omitted ] --> pods/controller-manager-s4662 * About to connect() to apiserver.openshift-template-service-broker.svc port 443 (#0) * Trying X.X.226.105... * Connected to apiserver.openshift-template-service-broker.svc (X.X.226.105) port 443 (#0) [ curl output omitted ] ~~~ After deleting the pod where the curl command was failing, the template provisioning worked as expected. Query: As the controller-manager is deployed using the daemonset and the service inside the pod needs to be running properly. Can we add the Readiness/liveness in it to check if the service is running properly or not? Expected results: The daemonset should have liveness to check if service is responsive or not.
Both the Service Catalog API Server and the Service Catalog Controller Manager have liveness & readiness probes that cause the pods to be restarted if they fail. From what I can tell here, you are encountering some infrastructure error that is preventing Service Catalog controller to send HTTP requests to the Template Service Broker. Failures to reach Template Service Broker may or may not be an indication of Service Catalog health. I don't think it would be appropriate for the probes to indicate Service Catalog failure because Catalog can't reach a specific Broker (there are usually several brokers that Service Catalog communicates with). I'm inclined to close this as not a bug. Do you disagree?
Hi, we experienced this issue originally. We're running this version of openshift: ``` > oc version oc v3.9.41 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://prd-rose-console.runtastic.com:443 openshift v3.9.27 kubernetes v1.9.1+a0ce1bc657 ``` neither the controller-manager nor the apiserver in namespace kube-service-catalog have any readiness or liveness probes AFAICS. I attached the output of `oc export daemonsets -n kube-service-catalog > kube-service-catalog-daemonsets.yml` The Template Service Broker itself was healthy at the time when the service-catalog stopped working. A few days before we had to force a restart on one of the pods of the apiserver daemonset in openshift-template-service-broker. (that one seems to have a readiness probe but it didn't prevent the pod of being effectively dead). When we noticed the problems of not being able to provision templates we first forced another restart of the apiserver of the template service broker because we assumed that it stopped working again. But that didn't fix the issue. Only a forced restart of the controller-manager in kube-service-catalog solved the issues. During the issue we also could not find any infrastructure / networking issue.
Created attachment 1484581 [details] oc export daemonsets -n kube-service-catalog
I 100% agree there should be a probes on both the API Server and the Controller Manager - we have them upstream in Kubernetes which I verified when I wrote comment #2 and failed to verify in OpenShift. We'll get this addressed, thanks much for the bug report.
3.11.z fixed by https://github.com/openshift/openshift-ansible/pull/10625
merged into 3.11.z
LGTM, verify it. 1, I install/uninstall the service catalog successfully via the openshift-ansible release-3.11 branch. Details as below: mac:openshift-ansible jianzhang$ git branch master release-3.10 * release-3.11 mac:openshift-ansible jianzhang$ git log commit d96c05e14bf15882083acf15cb4a1018575037df (HEAD -> release-3.11, origin/release-3.11) Merge: 51c90a343 de46e6cb5 Author: Scott Dodson <sdodson> Date: Mon Dec 3 16:30:40 2018 -0500 Merge pull request #10789 from sdodson/bz1644416-2 Also set etcd_cert_config_dir for calico commit 51c90a34397afc65a8ffbd08c8a61c4a17298557 Author: AOS Automation Release Team <aos-team-art> Date: Mon Dec 3 00:24:50 2018 -0500 Automatic commit of package [openshift-ansible] release [3.11.51-1]. Created by command: /usr/bin/tito tag --debug --accept-auto-changelog --keep-version --debug image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1 [root@ip-172-18-1-1 ~]# oc exec controller-manager-wb4sh -- service-catalog --version v3.11.51;Upstream:v0.1.35 2, Change the port of the apieserver to 6444. Check the event of the apiserver pod: We can see the Readiness/Liveiness start to work. [root@ip-172-18-1-1 ~]# oc describe pods apiserver-6fhnp ... Warning Unhealthy 4s (x5 over 24s) kubelet, ip-172-18-1-1.ec2.internal Readiness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused Normal Pulled 2s (x2 over 58s) kubelet, ip-172-18-1-1.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine Normal Created 2s (x2 over 58s) kubelet, ip-172-18-1-1.ec2.internal Created container Normal Started 2s (x2 over 57s) kubelet, ip-172-18-1-1.ec2.internal Started container Warning Unhealthy 2s (x3 over 22s) kubelet, ip-172-18-1-1.ec2.internal Liveness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused Normal Killing 2s kubelet, ip-172-18-1-1.ec2.internal Killing container with id docker://apiserver:Container failed liveness probe.. Container will be killed and recreated. ... The apiserver restart, so the Liveliness probe works well. [root@ip-172-18-1-1 ~]# oc get pods NAME READY STATUS RESTARTS AGE apiserver-6fhnp 0/1 Running 1 1m controller-manager-k7zcs 0/1 CrashLoopBackOff 3 4m The apiserver cannot server the traffic, so the Readiness probe works well. [root@ip-172-18-1-1 ~]# oc get ep NAME ENDPOINTS AGE apiserver 3h controller-manager 3h 3, The same operation to the controller-manager of the service catalog. We can the Readiness/Liveness probes start to work. [root@ip-172-18-1-1 ~]# oc describe pods controller-manager-xmwrj ... Warning Unhealthy 20s (x4 over 35s) kubelet, ip-172-18-1-1.ec2.internal Readiness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused Warning Unhealthy 19s (x3 over 39s) kubelet, ip-172-18-1-1.ec2.internal Liveness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused Normal Pulled 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine Normal Created 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Created container Normal Started 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Started container Normal Killing 18s kubelet, ip-172-18-1-1.ec2.internal Killing container with id docker://controller-manager:Container failed liveness probe.. Container will be killed and recreated. The pod restart, so the Liveness probe works well. [root@ip-172-18-1-1 ~]# oc get pods NAME READY STATUS RESTARTS AGE apiserver-2sbn7 1/1 Running 0 9m controller-manager-xmwrj 0/1 Running 1 1m The controller-manager cannot server the traffic now, so the Readiness probe works well. [root@ip-172-18-1-1 ~]# oc get ep NAME ENDPOINTS AGE apiserver 10.128.0.12:6443 3h controller-manager 3h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3743