Bug 1630324
Summary: | Requirement of Liveness or Readiness probe in ds/controller-manager | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sudarshan Chaudhari <suchaudh> | ||||
Component: | Service Catalog | Assignee: | Jay Boyd <jaboyd> | ||||
Status: | CLOSED ERRATA | QA Contact: | Jian Zhang <jiazha> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 3.9.0 | CC: | andreas.eger, jaboyd, jiazha, mrobson, rbost, steven.barre, suchaudh, zitang | ||||
Target Milestone: | --- | ||||||
Target Release: | 3.11.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Liveness & Readiness probes have been added for the Service Catalog API Server and Controller Manager. If these pods stop responding OpenShift will restart the pods. Previously there were no probes to monitor the health of Service Catalog.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1647511 (view as bug list) | Environment: | |||||
Last Closed: | 2018-12-12 14:15:51 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1647511 | ||||||
Attachments: |
|
Description
Sudarshan Chaudhari
2018-09-18 12:09:51 UTC
Both the Service Catalog API Server and the Service Catalog Controller Manager have liveness & readiness probes that cause the pods to be restarted if they fail. From what I can tell here, you are encountering some infrastructure error that is preventing Service Catalog controller to send HTTP requests to the Template Service Broker. Failures to reach Template Service Broker may or may not be an indication of Service Catalog health. I don't think it would be appropriate for the probes to indicate Service Catalog failure because Catalog can't reach a specific Broker (there are usually several brokers that Service Catalog communicates with). I'm inclined to close this as not a bug. Do you disagree? Hi, we experienced this issue originally. We're running this version of openshift: ``` > oc version oc v3.9.41 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://prd-rose-console.runtastic.com:443 openshift v3.9.27 kubernetes v1.9.1+a0ce1bc657 ``` neither the controller-manager nor the apiserver in namespace kube-service-catalog have any readiness or liveness probes AFAICS. I attached the output of `oc export daemonsets -n kube-service-catalog > kube-service-catalog-daemonsets.yml` The Template Service Broker itself was healthy at the time when the service-catalog stopped working. A few days before we had to force a restart on one of the pods of the apiserver daemonset in openshift-template-service-broker. (that one seems to have a readiness probe but it didn't prevent the pod of being effectively dead). When we noticed the problems of not being able to provision templates we first forced another restart of the apiserver of the template service broker because we assumed that it stopped working again. But that didn't fix the issue. Only a forced restart of the controller-manager in kube-service-catalog solved the issues. During the issue we also could not find any infrastructure / networking issue. Created attachment 1484581 [details]
oc export daemonsets -n kube-service-catalog
I 100% agree there should be a probes on both the API Server and the Controller Manager - we have them upstream in Kubernetes which I verified when I wrote comment #2 and failed to verify in OpenShift. We'll get this addressed, thanks much for the bug report. 3.11.z fixed by https://github.com/openshift/openshift-ansible/pull/10625 merged into 3.11.z LGTM, verify it. 1, I install/uninstall the service catalog successfully via the openshift-ansible release-3.11 branch. Details as below: mac:openshift-ansible jianzhang$ git branch master release-3.10 * release-3.11 mac:openshift-ansible jianzhang$ git log commit d96c05e14bf15882083acf15cb4a1018575037df (HEAD -> release-3.11, origin/release-3.11) Merge: 51c90a343 de46e6cb5 Author: Scott Dodson <sdodson> Date: Mon Dec 3 16:30:40 2018 -0500 Merge pull request #10789 from sdodson/bz1644416-2 Also set etcd_cert_config_dir for calico commit 51c90a34397afc65a8ffbd08c8a61c4a17298557 Author: AOS Automation Release Team <aos-team-art> Date: Mon Dec 3 00:24:50 2018 -0500 Automatic commit of package [openshift-ansible] release [3.11.51-1]. Created by command: /usr/bin/tito tag --debug --accept-auto-changelog --keep-version --debug image: registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1 [root@ip-172-18-1-1 ~]# oc exec controller-manager-wb4sh -- service-catalog --version v3.11.51;Upstream:v0.1.35 2, Change the port of the apieserver to 6444. Check the event of the apiserver pod: We can see the Readiness/Liveiness start to work. [root@ip-172-18-1-1 ~]# oc describe pods apiserver-6fhnp ... Warning Unhealthy 4s (x5 over 24s) kubelet, ip-172-18-1-1.ec2.internal Readiness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused Normal Pulled 2s (x2 over 58s) kubelet, ip-172-18-1-1.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine Normal Created 2s (x2 over 58s) kubelet, ip-172-18-1-1.ec2.internal Created container Normal Started 2s (x2 over 57s) kubelet, ip-172-18-1-1.ec2.internal Started container Warning Unhealthy 2s (x3 over 22s) kubelet, ip-172-18-1-1.ec2.internal Liveness probe failed: Get https://10.128.0.11:6443/healthz: dial tcp 10.128.0.11:6443: connect: connection refused Normal Killing 2s kubelet, ip-172-18-1-1.ec2.internal Killing container with id docker://apiserver:Container failed liveness probe.. Container will be killed and recreated. ... The apiserver restart, so the Liveliness probe works well. [root@ip-172-18-1-1 ~]# oc get pods NAME READY STATUS RESTARTS AGE apiserver-6fhnp 0/1 Running 1 1m controller-manager-k7zcs 0/1 CrashLoopBackOff 3 4m The apiserver cannot server the traffic, so the Readiness probe works well. [root@ip-172-18-1-1 ~]# oc get ep NAME ENDPOINTS AGE apiserver 3h controller-manager 3h 3, The same operation to the controller-manager of the service catalog. We can the Readiness/Liveness probes start to work. [root@ip-172-18-1-1 ~]# oc describe pods controller-manager-xmwrj ... Warning Unhealthy 20s (x4 over 35s) kubelet, ip-172-18-1-1.ec2.internal Readiness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused Warning Unhealthy 19s (x3 over 39s) kubelet, ip-172-18-1-1.ec2.internal Liveness probe failed: Get https://10.128.0.13:6443/healthz: dial tcp 10.128.0.13:6443: connect: connection refused Normal Pulled 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Container image "registry.reg-aws.openshift.com:443/openshift3/ose-service-catalog:v3.11.51-1" already present on machine Normal Created 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Created container Normal Started 18s (x2 over 1m) kubelet, ip-172-18-1-1.ec2.internal Started container Normal Killing 18s kubelet, ip-172-18-1-1.ec2.internal Killing container with id docker://controller-manager:Container failed liveness probe.. Container will be killed and recreated. The pod restart, so the Liveness probe works well. [root@ip-172-18-1-1 ~]# oc get pods NAME READY STATUS RESTARTS AGE apiserver-2sbn7 1/1 Running 0 9m controller-manager-xmwrj 0/1 Running 1 1m The controller-manager cannot server the traffic now, so the Readiness probe works well. [root@ip-172-18-1-1 ~]# oc get ep NAME ENDPOINTS AGE apiserver 10.128.0.12:6443 3h controller-manager 3h Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3743 |