Bug 1656925

Summary: Upgrade Fails at TASK [ansible_service_broker : Create the Broker resource in the catalog]
Product: OpenShift Container Platform Reporter: Josh Foots <jfoots>
Component: Service CatalogAssignee: Jay Boyd <jaboyd>
Status: CLOSED ERRATA QA Contact: Jian Zhang <jiazha>
Severity: high Docs Contact:
Priority: high    
Version: 3.10.0CC: aos-bugs, chezhang, cshereme, cstark, erjones, jaboyd, jack.ottofaro, jmatthew, jmontleo, jokerman, knakayam, mirollin, mmariyan, mmccomas, ndordet, rbost, sudpande, zitang
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The 3.10.72 update added health and liveness probes for the Service Catalog pods. Install was not waiting for the update rollout to finish before proceeding to update Ansible Service Broker. Because of timing, the Service Catalog pods were unavailable when the Broker attempted to register. Consequence: Ansible Service Broker update failed with an error indicating "the server is currently unable to handle the request (post clusterservicebrokers.servicecatalog.k8s.io)" Fix: Installation was updated to wait for the Service Catalog update rollout to finish before proceeding with installing Ansible Service Broker.
Story Points: ---
Clone Of:
: 1661569 (view as bug list) Environment:
Last Closed: 2019-01-30 15:13:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1661569    
Attachments:
Description Flags
oc describe output in answer to comment #17
none
oc describe etc output in answer to comment #21
none
test fix - wait for Service Catalog rollout to finish none

Comment 17 Jay Boyd 2018-12-12 13:49:48 UTC
@Jack- I'm looking for the 'oc describe pod' output for the apiserver pods in the kube-service-catalog namespace.  ie

oc get pods -n kube-service-catalog

and then for each of the apiserver pods listed:

oc describe pod -n kube-service-catalog  api-server-pod-name

Given https://bugzilla.redhat.com/show_bug.cgi?id=1656925#c16 I have a feeling the events listed are goign to indicate the pod was restarted or taken out of service because of a liveness or readiness probe failure.  It would help if anyone can confirm this.

Comment 18 Jay Boyd 2018-12-12 14:21:34 UTC
Associated comment #17, if the oc describe output indicates the pods are being restarted because of liveness probe failures, I'd really like to get the associated log output for the `apiserver` container within the affected pods during the time interval of the failures.

Comment 19 Jack Ottofaro 2018-12-12 16:53:40 UTC
Created attachment 1513720 [details]
oc describe output in answer to comment #17

Comment 23 Jack Ottofaro 2018-12-13 14:24:08 UTC
Created attachment 1514073 [details]
oc describe etc output in answer to comment #21

Comment 27 btai 2018-12-13 17:39:48 UTC
stumbled on the same bug when tried to upgarde from 3.9 -> 3.10 and failed in post control plane upgrade. For us also it was timining issue 
because the run of this play 
usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_components.yml
launches the api-server and controller manager sucessfully. But the after that asb post fails with this error ,stderr": "Error from server (ServiceUnavailable):

Having verified that the service kube-catalog-service has both api-server and controller-manager  running, i again launched the playbook /usr/share/ansible/openshift-ansible/playbooks/common/openshift-cluster/upgrades/upgrade_components.yml but this time commenting out the the service catalog part

 tasks:
#  - import_role:
#      name: openshift_service_catalog
#      tasks_from: install.yml
#    when:
#    - openshift_enable_service_catalog | default(true) | bool

Then the playbook run succeded.

Comment 30 Jay Boyd 2018-12-14 10:06:20 UTC
*** Bug 1658018 has been marked as a duplicate of this bug. ***

Comment 31 Jay Boyd 2018-12-14 10:07:14 UTC
*** Bug 1659198 has been marked as a duplicate of this bug. ***

Comment 32 Jay Boyd 2018-12-14 10:20:56 UTC
Created attachment 1514320 [details]
test fix - wait for Service Catalog rollout to finish

Test fix that waits for the rollout of Service Catalog before proceeding.  This file would replace openshift-ansible/roles/openshift_service_catalog/tasks/start.yml in both 3.10 and 3.11.

Comment 33 Jay Boyd 2018-12-14 10:23:14 UTC
I have attached a test fix that waits for the rollout of Service Catalog before proceeding.  This file would replace openshift-ansible/roles/openshift_service_catalog/tasks/start.yml in both 3.10 and 3.11.  I'd appreciate feedback from anyone that is encountering this error and is willing to retry with this in place.

Comment 35 Jay Boyd 2018-12-17 17:17:18 UTC
Has anyone else attempted to work through this issue with the attached fix?  We haven't been able to reproduce the original issue here and I'd like to get additional confirmation this test fix works for multiple deployments.

Comment 36 Robert Bost 2018-12-20 22:04:46 UTC
(In reply to Jay Boyd from comment #35)
> Has anyone else attempted to work through this issue with the attached fix? 
> We haven't been able to reproduce the original issue here and I'd like to
> get additional confirmation this test fix works for multiple deployments.

I had a customer seeing this issue. Using the patched start.yml that introduced the wait tasks allowed us to move past the "Create the Broker resource in the catalog" task.

Comment 37 Jay Boyd 2018-12-21 15:37:06 UTC
Thanks Robert.

I have delivered this fix to 3.10.z with https://github.com/openshift/openshift-ansible/pull/10883

and created https://bugzilla.redhat.com/show_bug.cgi?id=1661569 for tracking delivery to 3.11.z

Comment 39 Jian Zhang 2019-01-17 11:25:47 UTC
LGTM, verify it. Details as below:

1, original OCP cluster 3.10.45:
[root@ip-172-18-4-171 ~]# oc version
oc v3.10.45
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-4-171.ec2.internal:8443
openshift v3.10.45
kubernetes v1.10.0+b81c8f8

[root@ip-172-18-4-171 ~]# oc get pods -n kube-service-catalog 
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-x79fv            1/1       Running   0          11m
controller-manager-cptk9   1/1       Running   0          11m
[root@ip-172-18-4-171 ~]# oc get clusterservicebroker
NAME                      AGE
ansible-service-broker    10m
template-service-broker   10m

2, Upgrade it to the latest version of 3.10. Upgrade success.
[root@ip-172-18-4-171 ~]# oc version
oc v3.10.101
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-4-171.ec2.internal:8443
openshift v3.10.102
kubernetes v1.10.0+b81c8f8

[root@ip-172-18-4-171 ~]# oc get pods -n kube-service-catalog
NAME                       READY     STATUS    RESTARTS   AGE
apiserver-s2xt9            1/1       Running   0          12m
controller-manager-k8fjk   1/1       Running   0          12m

Correlating logs:

TASK [openshift_service_catalog : Wait for API Server rollout success] *********
task path: /usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/start.yml:2
Thursday 17 January 2019  11:04:12 +0000 (0:00:00.084)       0:18:35.484 ****** 
ok: [ec2-3-90-13-179.compute-1.amazonaws.com] => {"attempts": 1, "changed": false, "cmd": ["oc", "rollout", "status", "--config=/etc/origin/master/admin.kubeconfig", "-n", "kube-service-catalog", "ds/apiserver"], "delta": "0:00:37.036247", "end": "2019-01-17 06:05:21.965431", "failed": false, "rc": 0, "start": "2019-01-17 06:04:44.929184", "stderr": "", "stderr_lines": [], "stdout": "Waiting for rollout to finish: 0 out of 1 new pods have been updated...\nWaiting for rollout to finish: 0 out of 1 new pods have been updated...\nWaiting for rollout to finish: 0 of 1 updated pods are available...\ndaemon set \"apiserver\" successfully rolled out", "stdout_lines": ["Waiting for rollout to finish: 0 out of 1 new pods have been updated...", "Waiting for rollout to finish: 0 out of 1 new pods have been updated...", "Waiting for rollout to finish: 0 of 1 updated pods are available...", "daemon set \"apiserver\" successfully rolled out"]}

TASK [openshift_service_catalog : Wait for Controller Manager rollout success] ***
task path: /usr/share/ansible/openshift-ansible/roles/openshift_service_catalog/tasks/start.yml:14
Thursday 17 January 2019  11:04:49 +0000 (0:00:37.294)       0:19:12.779 ****** 
ok: [ec2-3-90-13-179.compute-1.amazonaws.com] => {"attempts": 1, "changed": false, "cmd": ["oc", "rollout", "status", "--config=/etc/origin/master/admin.kubeconfig", "-n", "kube-service-catalog", "ds/controller-manager"], "delta": "0:00:07.944249", "end": "2019-01-17 06:05:30.194916", "failed": false, "rc": 0, "start": "2019-01-17 06:05:22.250667", "stderr": "", "stderr_lines": [], "stdout": "Waiting for rollout to finish: 0 of 1 updated pods are available...\ndaemon set \"controller-manager\" successfully rolled out", "stdout_lines": ["Waiting for rollout to finish: 0 of 1 updated pods are available...", "daemon set \"controller-manager\" successfully rolled out"]}

...

TASK [ansible_service_broker : Create the Broker resource in the catalog] ******
task path: /usr/share/ansible/openshift-ansible/roles/ansible_service_broker/tasks/install.yml:217
Thursday 17 January 2019  11:05:23 +0000 (0:00:00.035)       0:19:46.793 ****** 
changed: [ec2-3-90-13-179.compute-1.amazonaws.com] => {"changed": true, "failed": false, "results": {"cmd": "/usr/bin/oc get ClusterServiceBroker ansible-service-broker -o json -n default", "results": [{"apiVersion": "servicecatalog.k8s.io/v1beta1", "kind": "ClusterServiceBroker", "metadata": {"creationTimestamp": "2019-01-17T09:49:30Z", "generation": 1, "name": "ansible-service-broker", "resourceVersion": "14742", "selfLink": "/apis/servicecatalog.k8s.io/v1beta1/clusterservicebrokers/ansible-service-broker", "uid": "2f5888c4-1a3d-11e9-9bbe-0a580a800005"}, "spec": {"authInfo": {"bearer": {"secretRef": {"name": "asb-client", "namespace": "openshift-ansible-service-broker"}}}, "caBundle": "xxx", "relistBehavior": "Duration", "relistDuration": "15m0s", "relistRequests": 0, "url": "https://asb.openshift-ansible-service-broker.svc:1338/ansible-service-broker"}, "status": {"conditions": [{"lastTransitionTime": "2019-01-17T09:50:00Z", "message": "Successfully fetched catalog entries from broker.", "reason": "FetchedCatalog", "status": "True", "type": "Ready"}], "lastCatalogRetrievalTime": "2019-01-17T10:57:40Z", "reconciledGeneration": 1}}], "returncode": 0}, "state": "present"}

Comment 41 errata-xmlrpc 2019-01-30 15:13:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0206