Bug 1630324

Summary:

Requirement of Liveness or Readiness probe in ds/controller-manager

Product:

OpenShift Container Platform

Reporter:

Sudarshan Chaudhari <suchaudh>

Component:

Service Catalog

Assignee:

Jay Boyd <jaboyd>

Status:

CLOSED ERRATA

QA Contact:

Jian Zhang <jiazha>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

3.9.0

CC:

andreas.eger, jaboyd, jiazha, mrobson, rbost, steven.barre, suchaudh, zitang

Target Milestone:

---

Target Release:

3.11.z

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Liveness & Readiness probes have been added for the Service Catalog API Server and Controller Manager. If these pods stop responding OpenShift will restart the pods. Previously there were no probes to monitor the health of Service Catalog.

Story Points:

---

Clone Of:

Clones:

1647511 (view as bug list)

Environment:

Last Closed:

2018-12-12 14:15:51 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1647511

Attachments:

Description	Flags
oc export daemonsets -n kube-service-catalog	none

Description Sudarshan Chaudhari 2018-09-18 12:09:51 UTC

Description of problem:

controller-manager pod on one of the node stopped responding. 

As the controller-manager stopped responding, the app deployment using the template was not working. 

Error received:
~~~
The service is not yet ready. The provision call failed and will be retried: Error communicating with broker for provisioning: Put https://apiserver.openshift-template-service-broker.svc:443/brokers/template.openshift.io/v2/service_instances/8af1f3b4-ebaa-4ffe-b2ef-abbc6456a56d?accepts_incomplete=true: dial tcp X.X.226.105:443: connect: cannot assign requested address
~~~

logs on the pod:
~~~
$ oc logs apiserver-f2xx4 -n openshift-template-service-broker
I0910 05:53:05.906810       1 serve.go:89] Serving securely on [::]:8443
I0910 05:53:05.907700       1 controller_utils.go:1019] Waiting for caches to sync for tsb controller
I0910 05:53:06.007863       1 controller_utils.go:1026] Caches are synced for tsb controller
W0910 06:00:06.012356       1 reflector.go:341] github.com/openshift/origin/vendor/github.com/openshift/client-go/template/informers/externalversions/factory.go:57: watch of *v1.Template ended with: The resourceVersion for the provided watch is too old.
~~~

To troubleshoot, ran the command fin controller-manager form kube-service-catalog namespace:
~~~
$ oc project kube-service-catalog

$ oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-bxlxf            1/1       Running   0          1d
apiserver-dkhz5            1/1       Running   0          1d
apiserver-hmk5r            1/1       Running   0          1d
controller-manager-9rh5l   1/1       Running   0          1d
controller-manager-mhdqk   1/1       Running   0          1d
controller-manager-s4662   1/1       Running   0          1d

$ for pod in $(oc get pod -o name | grep controller-manager | awk '{print $1}'); do echo "--> $pod"; oc rsh $pod curl -kv https://apiserver.openshift-template-service-broker.svc:443 ; done
--> pods/controller-manager-9rh5l
* About to connect() to apiserver.openshift-template-service-broker.svc port 443 (#0)
*   Trying X.X.226.105...
* Failed to connect to X.X.226.105: Cannot assign requested address
* couldn't connect to host at apiserver.openshift-template-service-broker.svc:443
* Closing connection 0
curl: (7) Failed to connect to X.X.226.105: Cannot assign requested address
command terminated with exit code 7
--> pods/controller-manager-mhdqk
* About to connect() to apiserver.openshift-template-service-broker.svc port 443 (#0)
*   Trying X.X.226.105...
* Connected to apiserver.openshift-template-service-broker.svc (X.X.226.105) port 443 (#0)
[ curl output omitted ]
--> pods/controller-manager-s4662
* About to connect() to apiserver.openshift-template-service-broker.svc port 443 (#0)
*   Trying X.X.226.105...
* Connected to apiserver.openshift-template-service-broker.svc (X.X.226.105) port 443 (#0)
[ curl output omitted ]
~~~

After deleting the pod where the curl command was failing, the template provisioning worked as expected.

Query:

As the controller-manager is deployed using the daemonset and the service inside the pod needs to be running properly. Can we add the Readiness/liveness in it to check if the service is running properly or not? 

Expected results:
The daemonset should have liveness to check if service is responsive or not.

Comment 1 Jay Boyd 2018-09-18 14:29:19 UTC

Both the Service Catalog API Server and the Service Catalog Controller Manager have liveness & readiness probes that cause the pods to be restarted if they fail. From what I can tell here, you are encountering some infrastructure error that is preventing Service Catalog controller to send HTTP requests to the Template Service Broker.  Failures to reach Template Service Broker may or may not be an indication of Service Catalog health.  I don't think it would be appropriate for the probes to indicate Service Catalog failure because Catalog can't reach a specific Broker (there are usually several brokers that Service Catalog communicates with).

I'm inclined to close this as not a bug.  Do you disagree?

Comment 2 andreas.eger 2018-09-19 07:08:08 UTC

Hi, we experienced this issue originally.

We're running this version of openshift:

```
> oc version
oc v3.9.41
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://prd-rose-console.runtastic.com:443
openshift v3.9.27
kubernetes v1.9.1+a0ce1bc657
```

neither the controller-manager nor the apiserver in namespace kube-service-catalog have any readiness or liveness probes AFAICS. I attached the output of `oc export daemonsets -n kube-service-catalog > kube-service-catalog-daemonsets.yml`


The Template Service Broker itself was healthy at the time when the service-catalog stopped working. A few days before we had to force a restart on one of the pods of the apiserver daemonset in openshift-template-service-broker. (that one seems to have a readiness probe but it didn't prevent the pod of being effectively dead).

When we noticed the problems of not being able to provision templates we first forced another restart of the apiserver of the template service broker because we assumed that it stopped working again. But that didn't fix the issue. Only a forced restart of the controller-manager in kube-service-catalog solved the issues.

During the issue we also could not find any infrastructure / networking issue.

Comment 3 andreas.eger 2018-09-19 07:08:38 UTC

Created attachment 1484581 [details]
oc export daemonsets -n kube-service-catalog

Comment 5 Jay Boyd 2018-09-25 13:09:37 UTC

I 100% agree there should be a probes on both the API Server and the Controller Manager - we have them upstream in Kubernetes which I verified when I wrote comment #2 and failed to verify in OpenShift.  We'll get this addressed, thanks much for the bug report.

Comment 8 Jay Boyd 2018-11-07 16:18:44 UTC

3.11.z fixed by https://github.com/openshift/openshift-ansible/pull/10625

Comment 9 Jay Boyd 2018-11-08 16:11:54 UTC

merged into 3.11.z

Comment 15 Jian Zhang 2018-12-04 13:20:15 UTC

LGTM, verify it.
1, I install/uninstall the service catalog successfully via the openshift-ansible release-3.11 branch. Details as below:
mac:openshift-ansible jianzhang$ git branch
  master
  release-3.10
* release-3.11
mac:openshift-ansible jianzhang$ git log
commit d96c05e14bf15882083acf15cb4a1018575037df (HEAD -> release-3.11, origin/release-3.11)
Merge: 51c90a343 de46e6cb5
Author: Scott Dodson <sdodson>
Date:   Mon Dec 3 16:30:40 2018 -0500

    Merge pull request #10789 from sdodson/bz1644416-2
    
    Also set etcd_cert_config_dir for calico

commit 51c90a34397afc65a8ffbd08c8a61c4a17298557
Author: AOS Automation Release Team <aos-team-art>
Date:   Mon Dec 3 00:24:50 2018 -0500

    Automatic commit of package [openshift-ansible] release [3.11.51-1].
    
    Created by command:
    
    /usr/bin/tito tag --debug --accept-auto-changelog --keep-version --debug

image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1
[root@ip-172-18-1-1 ~]# oc exec controller-manager-wb4sh -- service-catalog --version
v3.11.51;Upstream:v0.1.35

2, Change the port of the apieserver to 6444. Check the event of the apiserver pod:
We can see the Readiness/Liveiness start to work.
[root@ip-172-18-1-1 ~]# oc describe pods apiserver-6fhnp
...
  Warning  Unhealthy  4s (x5 over 24s)  kubelet, ip-172-18-1-1.ec2.internal  Readiness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused
  Normal   Pulled     2s (x2 over 58s)  kubelet, ip-172-18-1-1.ec2.internal  Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine
  Normal   Created    2s (x2 over 58s)  kubelet, ip-172-18-1-1.ec2.internal  Created container
  Normal   Started    2s (x2 over 57s)  kubelet, ip-172-18-1-1.ec2.internal  Started container
  Warning  Unhealthy  2s (x3 over 22s)  kubelet, ip-172-18-1-1.ec2.internal  Liveness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused
  Normal   Killing    2s                kubelet, ip-172-18-1-1.ec2.internal  Killing container with id docker://apiserver:Container failed liveness probe.. Container will be killed and recreated.
...

The apiserver restart, so the Liveliness probe works well.
[root@ip-172-18-1-1 ~]# oc get pods
NAME                       READY     STATUS             RESTARTS   AGE
apiserver-6fhnp            0/1       Running            1          1m
controller-manager-k7zcs   0/1       CrashLoopBackOff   3          4m

The apiserver cannot server the traffic, so the Readiness probe works well.
[root@ip-172-18-1-1 ~]# oc get ep
NAME                 ENDPOINTS   AGE
apiserver                        3h
controller-manager               3h

3, The same operation to the controller-manager of the service catalog.
We can the Readiness/Liveness probes start to work.

[root@ip-172-18-1-1 ~]# oc describe pods controller-manager-xmwrj
...
  Warning  Unhealthy  20s (x4 over 35s)  kubelet, ip-172-18-1-1.ec2.internal  Readiness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused
  Warning  Unhealthy  19s (x3 over 39s)  kubelet, ip-172-18-1-1.ec2.internal  Liveness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused
  Normal   Pulled     18s (x2 over 1m)   kubelet, ip-172-18-1-1.ec2.internal  Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine
  Normal   Created    18s (x2 over 1m)   kubelet, ip-172-18-1-1.ec2.internal  Created container
  Normal   Started    18s (x2 over 1m)   kubelet, ip-172-18-1-1.ec2.internal  Started container
  Normal   Killing    18s                kubelet, ip-172-18-1-1.ec2.internal  Killing container with id docker://controller-manager:Container failed liveness probe.. Container will be killed and recreated.

The pod restart, so the Liveness probe works well.
[root@ip-172-18-1-1 ~]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-2sbn7            1/1       Running   0          9m
controller-manager-xmwrj   0/1       Running   1          1m

The controller-manager cannot server the traffic now, so the Readiness probe works well.
[root@ip-172-18-1-1 ~]# oc get ep
NAME                 ENDPOINTS          AGE
apiserver            10.128.0.12:6443   3h
controller-manager                      3h

Comment 17 errata-xmlrpc 2018-12-12 14:15:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3743