Bug 1630324
| Summary: | Requirement of Liveness or Readiness probe in ds/controller-manager | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Sudarshan Chaudhari <suchaudh> | ||||
| Component: | Service Catalog | Assignee: | Jay Boyd <jaboyd> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Jian Zhang <jiazha> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 3.9.0 | CC: | andreas.eger, jaboyd, jiazha, mrobson, rbost, steven.barre, suchaudh, zitang | ||||
| Target Milestone: | --- | ||||||
| Target Release: | 3.11.z | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
Liveness & Readiness probes have been added for the Service Catalog API Server and Controller Manager. If these pods stop responding OpenShift will restart the pods. Previously there were no probes to monitor the health of Service Catalog.
|
Story Points: | --- | ||||
| Clone Of: | |||||||
| : | 1647511 (view as bug list) | Environment: | |||||
| Last Closed: | 2018-12-12 14:15:51 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | |||||||
| Bug Blocks: | 1647511 | ||||||
| Attachments: |
|
||||||
|
Description
Sudarshan Chaudhari
2018-09-18 12:09:51 UTC
Both the Service Catalog API Server and the Service Catalog Controller Manager have liveness & readiness probes that cause the pods to be restarted if they fail. From what I can tell here, you are encountering some infrastructure error that is preventing Service Catalog controller to send HTTP requests to the Template Service Broker. Failures to reach Template Service Broker may or may not be an indication of Service Catalog health. I don't think it would be appropriate for the probes to indicate Service Catalog failure because Catalog can't reach a specific Broker (there are usually several brokers that Service Catalog communicates with). I'm inclined to close this as not a bug. Do you disagree? Hi, we experienced this issue originally. We're running this version of openshift: ``` > oc version oc v3.9.41 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://prd-rose-console.runtastic.com:443 openshift v3.9.27 kubernetes v1.9.1+a0ce1bc657 ``` neither the controller-manager nor the apiserver in namespace kube-service-catalog have any readiness or liveness probes AFAICS. I attached the output of `oc export daemonsets -n kube-service-catalog > kube-service-catalog-daemonsets.yml` The Template Service Broker itself was healthy at the time when the service-catalog stopped working. A few days before we had to force a restart on one of the pods of the apiserver daemonset in openshift-template-service-broker. (that one seems to have a readiness probe but it didn't prevent the pod of being effectively dead). When we noticed the problems of not being able to provision templates we first forced another restart of the apiserver of the template service broker because we assumed that it stopped working again. But that didn't fix the issue. Only a forced restart of the controller-manager in kube-service-catalog solved the issues. During the issue we also could not find any infrastructure / networking issue. Created attachment 1484581 [details]
oc export daemonsets -n kube-service-catalog
I 100% agree there should be a probes on both the API Server and the Controller Manager - we have them upstream in Kubernetes which I verified when I wrote comment #2 and failed to verify in OpenShift. We'll get this addressed, thanks much for the bug report. 3.11.z fixed by https://github.com/openshift/openshift-ansible/pull/10625 merged into 3.11.z LGTM, verify it.
1, I install/uninstall the service catalog successfully via the openshift-ansible release-3.11 branch. Details as below:
mac:openshift-ansible jianzhang$ git branch
master
release-3.10
* release-3.11
mac:openshift-ansible jianzhang$ git log
commit d96c05e14bf15882083acf15cb4a1018575037df (HEAD -> release-3.11, origin/release-3.11)
Merge: 51c90a343 de46e6cb5
Author: Scott Dodson <sdodson>
Date: Mon Dec 3 16:30:40 2018 -0500
Merge pull request #10789 from sdodson/bz1644416-2
Also set etcd_cert_config_dir for calico
commit 51c90a34397afc65a8ffbd08c8a61c4a17298557
Author: AOS Automation Release Team <aos-team-art>
Date: Mon Dec 3 00:24:50 2018 -0500
Automatic commit of package [openshift-ansible] release [3.11.51-1].
Created by command:
/usr/bin/tito tag --debug --accept-auto-changelog --keep-version --debug
image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1
[root@ip-172-18-1-1 ~]# oc exec controller-manager-wb4sh -- service-catalog --version
v3.11.51;Upstream:v0.1.35
2, Change the port of the apieserver to 6444. Check the event of the apiserver pod:
We can see the Readiness/Liveiness start to work.
[root@ip-172-18-1-1 ~]# oc describe pods apiserver-6fhnp
...
Warning Unhealthy 4s (x5 over 24s) kubelet, ip-172-18-1-1.ec2.internal Readiness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused
Normal Pulled 2s (x2 over 58s) kubelet, ip-172-18-1-1.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine
Normal Created 2s (x2 over 58s) kubelet, ip-172-18-1-1.ec2.internal Created container
Normal Started 2s (x2 over 57s) kubelet, ip-172-18-1-1.ec2.internal Started container
Warning Unhealthy 2s (x3 over 22s) kubelet, ip-172-18-1-1.ec2.internal Liveness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused
Normal Killing 2s kubelet, ip-172-18-1-1.ec2.internal Killing container with id docker://apiserver:Container failed liveness probe.. Container will be killed and recreated.
...
The apiserver restart, so the Liveliness probe works well.
[root@ip-172-18-1-1 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
apiserver-6fhnp 0/1 Running 1 1m
controller-manager-k7zcs 0/1 CrashLoopBackOff 3 4m
The apiserver cannot server the traffic, so the Readiness probe works well.
[root@ip-172-18-1-1 ~]# oc get ep
NAME ENDPOINTS AGE
apiserver 3h
controller-manager 3h
3, The same operation to the controller-manager of the service catalog.
We can the Readiness/Liveness probes start to work.
[root@ip-172-18-1-1 ~]# oc describe pods controller-manager-xmwrj
...
Warning Unhealthy 20s (x4 over 35s) kubelet, ip-172-18-1-1.ec2.internal Readiness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused
Warning Unhealthy 19s (x3 over 39s) kubelet, ip-172-18-1-1.ec2.internal Liveness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused
Normal Pulled 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine
Normal Created 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Created container
Normal Started 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Started container
Normal Killing 18s kubelet, ip-172-18-1-1.ec2.internal Killing container with id docker://controller-manager:Container failed liveness probe.. Container will be killed and recreated.
The pod restart, so the Liveness probe works well.
[root@ip-172-18-1-1 ~]# oc get pods
NAME READY STATUS RESTARTS AGE
apiserver-2sbn7 1/1 Running 0 9m
controller-manager-xmwrj 0/1 Running 1 1m
The controller-manager cannot server the traffic now, so the Readiness probe works well.
[root@ip-172-18-1-1 ~]# oc get ep
NAME ENDPOINTS AGE
apiserver 10.128.0.12:6443 3h
controller-manager 3h
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3743 |