Bug 1647511

Summary: Requirement of Liveness or Readiness probe in ds/controller-manager
Product: OpenShift Container Platform Reporter: Jay Boyd <jaboyd>
Component: Service CatalogAssignee: Jay Boyd <jaboyd>
Status: CLOSED ERRATA QA Contact: Jian Zhang <jiazha>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.10.0CC: andreas.eger, chezhang, jaboyd, jiazha, mrobson, rbost, steven.barre, suchaudh, zitang
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Liveness & Readiness probes have been added for the Service Catalog API Server and Controller Manager. If these pods stop responding OpenShift will restart the pods. Previously there were no probes to monitor the health of Service Catalog.
Story Points: ---
Clone Of: 1630324 Environment:
Last Closed: 2018-12-13 17:09:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1630324    
Bug Blocks:    

Comment 1 Jay Boyd 2018-11-07 18:43:59 UTC
Tracks delivering fix to 3.10.z.

Comment 2 Jay Boyd 2018-11-07 18:55:50 UTC
fixed in 3.10.z by https://github.com/openshift/openshift-ansible/pull/10629

Comment 12 Jian Zhang 2018-12-04 06:24:07 UTC
I install/uninstall the ServiceCatalog successfully via the openshift-ansible release-3.10 branch. It works well as we expected. Verify it.

The openshift-ansible info:
mac:openshift-ansible jianzhang$ git branch
  master
* release-3.10
mac:openshift-ansible jianzhang$ git log
commit 12699eb551747059c7db622cadd9237dde84205b (HEAD -> release-3.10, origin/release-3.10)
Author: AOS Automation Release Team <aos-team-art>
Date:   Sat Dec 1 07:38:28 2018 -0500

    Automatic commit of package [openshift-ansible] release [3.10.83-1].
...


When I config another port(such as: 6444) for the controller-manager of the ServiceCatalog, we can see below info:
1) The liveness probe works well.
[root@ip-172-18-9-32 ~]# oc describe pods controller-manager-6qr4k
...
  Normal   Created    13s (x3 over 1m)  kubelet, ip-172-18-9-32.ec2.internal  Created container
  Warning  Unhealthy  13s (x5 over 1m)  kubelet, ip-172-18-9-32.ec2.internal  Liveness probe failed: Get https://10.128.0.10:6443/healthz: dial tcp 10.128.0.10:6443: getsockopt: connection refused
  Normal   Killing    13s (x2 over 1m)  kubelet, ip-172-18-9-32.ec2.internal  Killing container with id docker://controller-manager:Container failed liveness probe.. Container will be killed and recreated.
  Normal   Started    12s (x3 over 1m)  kubelet, ip-172-18-9-32.ec2.internal  Started container

[root@ip-172-18-9-32 ~]# oc get pods
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-gkfjf            1/1       Running   0          42m
controller-manager-6qr4k   0/1       Running   2          1m

2) The pods cannot server the traffic now. The readiness works well.
[root@ip-172-18-9-32 ~]# oc get ep
NAME                 ENDPOINTS         AGE
apiserver            10.128.0.8:6443   42m
controller-manager                     41m

The same operations to the apiserver of ServiceCatalog, it works as we expected.

[root@ip-172-18-9-32 ~]# oc exec controller-manager-sqbcz -- service-catalog --version
v3.10.83;Upstream:v0.1.19

Comment 14 errata-xmlrpc 2018-12-13 17:09:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3750