Bug 1541524 - Service Broker does not retry manifest fetch if failed and removedFromBrokerCatalog: true
Summary: Service Broker does not retry manifest fetch if failed and removedFromBrokerC...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Service Broker
Version: 3.7.0
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 3.9.0
Assignee: Jesus M. Rodriguez
QA Contact: Zhang Cheng
URL:
Whiteboard:
Depends On: 1548122
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-02 19:15 UTC by Matthew Robson
Modified: 2018-02-22 19:35 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-02-22 19:35:28 UTC


Attachments (Terms of Use)

Description Matthew Robson 2018-02-02 19:15:27 UTC
Description of problem:

Installed service broker, originally all of the example ABPs existed. After a few days, started seeing some fetch issues and now only MySQL, MariaDB and Postgres show up in the console.

[root@server ~]# oc logs asb-1-s6cgx | egrep -v 'INFO|NOTICE|GET'         

[2018-01-23T19:43:33.983Z] [ERROR] unable to retrieve image names for registry rh - Get https://registry.access.redhat.com/v1/search?q="*-apb":  read tcp 10.131.23.207:43756->209.132.182.63:443: read: connection reset by peer
[2018-01-23T19:43:33.983Z] [WARNING] registry: 0x1ce93f0 was unable to complete bootstrap - Get https://registry.access.redhat.com/v1/search?q="*-apb":  read tcp 10.131.23.207:43756->209.132.182.63:443: read: connection reset by peer

And

[2018-01-31T16:13:42.987Z] [ERROR] unable to fetch specs for registry rh - Get https://registry.access.redhat.com/v2/openshift3/mediawiki-apb/manifests/v3.7: read tcp 10.131.23.207:47606->209.132.182.63:443: read: connection reset by peer

[2018-01-31T16:13:42.987Z] [WARNING] registry: 0x1ce93f0 was unable to complete bootstrap - Get https://registry.access.redhat.com/v2/openshift3/mediawiki-apb/manifests/v3.7: read tcp 10.131.23.207:47606->209.132.182.63:443: read: connection reset by peer

[root@ociopf-t-301 ~]# oc logs asb-1-tvwpd | egrep 'WARN|ERROR'
[2018-02-01T12:32:06.67Z] [ERROR] unable to retrieve image names for registry rh - invalid character 'S' looking for beginning of value
[2018-02-01T12:32:06.67Z] [WARNING] registry: 0x1ce93f0 was unable to complete bootstrap - invalid character 'S' looking for beginning of value

Objects:

[root@server ~]# oc get clusterserviceclasses --all-namespaces -o custom-columns=NAME:.metadata.name,DISPLAYNAME:spec.externalMetadata.displayName | grep APB
03b69500305d9859bb9440d9f9023784       Mediawiki (APB)
2c259ddd8059b9bc65081e07bf20058f       MariaDB (APB)
73ead67495322cc462794387fa9884f5       MySQL (APB)
d5915e05b253df421efe6e41fb6a66ba       PostgreSQL (APB)

Shows MediaWiki as removed (unsure why): removedFromBrokerCatalog: true

[root@server ~]# oc get clusterserviceclasses 03b69500305d9859bb9440d9f9023784 -o yaml
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ClusterServiceClass
metadata:
  creationTimestamp: 2018-01-13T04:07:19Z
  name: 03b69500305d9859bb9440d9f9023784
  resourceVersion: "159618080"
  selfLink: /apis/servicecatalog.k8s.io/v1beta1/clusterserviceclasses/03b69500305d9859bb9440d9f9023784
  uid: 3ffe8bf6-f817-11e7-a25c-0a580a831206
spec:
  bindable: false
  clusterServiceBrokerName: ansible-service-broker
  description: Mediawiki123 apb implementation
  externalID: 03b69500305d9859bb9440d9f9023784
  externalMetadata:
    console.openshift.io/iconClass: icon-mediawiki
    dependencies:
    - registry.access.redhat.com/openshift3/mediawiki-123:latest
    displayName: Mediawiki (APB)
    documentationUrl: https://www.mediawiki.org/wiki/Documentation
    longDescription: An apb that deploys Mediawiki 1.23
    providerDisplayName: Red Hat, Inc.
  externalName: rh-mediawiki-apb
  planUpdatable: false
status:
  removedFromBrokerCatalog: true

Hitting the URL from the node and from inside the pod work fine, so it may have been something transient, but they do not come back.

Deleting and re-spawning the pod had no impact on bring it back.


Version-Release number of selected component (if applicable):

3.7.14-1.git.0.593a50e


How reproducible:

Random

Steps to Reproduce:
1.
2.
3.

Actual results:

Missing APB examples.

Expected results:

Even if there is a transient error, it should retry or recovery.

Additional info:

Comment 1 Jesus M. Rodriguez 2018-02-06 15:07:55 UTC
I was able to look at the sosreport but the /var/log/messages was tailed for February 2nd. The logs with the errors are dated January 31st. Is there any possibility of getting the /var/log/messages from January 31st?

Comment 3 Jesus M. Rodriguez 2018-02-22 18:53:42 UTC
After much debugging, I found that the problem is not with the Automation (Ansible) Broker but with how the Service Catalog resyncs the ServiceClasses. As seen in the original comment, if a ServiceClass was provisioned and not seen by the Service Catalog during a resync, it will be marked as removedFromBrokerCatalog. When the Broker recovers with the services again, the Service Catalog SHOULD flip the flag but it is not.

I will open a bug against the Service Catalog and make this bug depend on that new one.


Note You need to log in before you can comment on or make changes to this bug.