Bug 1541524

Summary: Service Broker does not retry manifest fetch if failed and removedFromBrokerCatalog: true
Product: OpenShift Container Platform Reporter: Matthew Robson <mrobson>
Component: Service BrokerAssignee: Jesus M. Rodriguez <jesusr>
Status: CLOSED NOTABUG QA Contact: Zhang Cheng <chezhang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, jesusr, mrobson
Target Milestone: ---   
Target Release: 3.9.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-22 19:35:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1548122    
Bug Blocks:    

Description Matthew Robson 2018-02-02 19:15:27 UTC
Description of problem:

Installed service broker, originally all of the example ABPs existed. After a few days, started seeing some fetch issues and now only MySQL, MariaDB and Postgres show up in the console.

[root@server ~]# oc logs asb-1-s6cgx | egrep -v 'INFO|NOTICE|GET'         

[2018-01-23T19:43:33.983Z] [ERROR] unable to retrieve image names for registry rh - Get https://registry.access.redhat.com/v1/search?q="*-apb":  read tcp 10.131.23.207:43756->209.132.182.63:443: read: connection reset by peer
[2018-01-23T19:43:33.983Z] [WARNING] registry: 0x1ce93f0 was unable to complete bootstrap - Get https://registry.access.redhat.com/v1/search?q="*-apb":  read tcp 10.131.23.207:43756->209.132.182.63:443: read: connection reset by peer

And

[2018-01-31T16:13:42.987Z] [ERROR] unable to fetch specs for registry rh - Get https://registry.access.redhat.com/v2/openshift3/mediawiki-apb/manifests/v3.7: read tcp 10.131.23.207:47606->209.132.182.63:443: read: connection reset by peer

[2018-01-31T16:13:42.987Z] [WARNING] registry: 0x1ce93f0 was unable to complete bootstrap - Get https://registry.access.redhat.com/v2/openshift3/mediawiki-apb/manifests/v3.7: read tcp 10.131.23.207:47606->209.132.182.63:443: read: connection reset by peer

[root@ociopf-t-301 ~]# oc logs asb-1-tvwpd | egrep 'WARN|ERROR'
[2018-02-01T12:32:06.67Z] [ERROR] unable to retrieve image names for registry rh - invalid character 'S' looking for beginning of value
[2018-02-01T12:32:06.67Z] [WARNING] registry: 0x1ce93f0 was unable to complete bootstrap - invalid character 'S' looking for beginning of value

Objects:

[root@server ~]# oc get clusterserviceclasses --all-namespaces -o custom-columns=NAME:.metadata.name,DISPLAYNAME:spec.externalMetadata.displayName | grep APB
03b69500305d9859bb9440d9f9023784       Mediawiki (APB)
2c259ddd8059b9bc65081e07bf20058f       MariaDB (APB)
73ead67495322cc462794387fa9884f5       MySQL (APB)
d5915e05b253df421efe6e41fb6a66ba       PostgreSQL (APB)

Shows MediaWiki as removed (unsure why): removedFromBrokerCatalog: true

[root@server ~]# oc get clusterserviceclasses 03b69500305d9859bb9440d9f9023784 -o yaml
apiVersion: servicecatalog.k8s.io/v1beta1
kind: ClusterServiceClass
metadata:
  creationTimestamp: 2018-01-13T04:07:19Z
  name: 03b69500305d9859bb9440d9f9023784
  resourceVersion: "159618080"
  selfLink: /apis/servicecatalog.k8s.io/v1beta1/clusterserviceclasses/03b69500305d9859bb9440d9f9023784
  uid: 3ffe8bf6-f817-11e7-a25c-0a580a831206
spec:
  bindable: false
  clusterServiceBrokerName: ansible-service-broker
  description: Mediawiki123 apb implementation
  externalID: 03b69500305d9859bb9440d9f9023784
  externalMetadata:
    console.openshift.io/iconClass: icon-mediawiki
    dependencies:
    - registry.access.redhat.com/openshift3/mediawiki-123:latest
    displayName: Mediawiki (APB)
    documentationUrl: https://www.mediawiki.org/wiki/Documentation
    longDescription: An apb that deploys Mediawiki 1.23
    providerDisplayName: Red Hat, Inc.
  externalName: rh-mediawiki-apb
  planUpdatable: false
status:
  removedFromBrokerCatalog: true

Hitting the URL from the node and from inside the pod work fine, so it may have been something transient, but they do not come back.

Deleting and re-spawning the pod had no impact on bring it back.


Version-Release number of selected component (if applicable):

3.7.14-1.git.0.593a50e


How reproducible:

Random

Steps to Reproduce:
1.
2.
3.

Actual results:

Missing APB examples.

Expected results:

Even if there is a transient error, it should retry or recovery.

Additional info:

Comment 1 Jesus M. Rodriguez 2018-02-06 15:07:55 UTC
I was able to look at the sosreport but the /var/log/messages was tailed for February 2nd. The logs with the errors are dated January 31st. Is there any possibility of getting the /var/log/messages from January 31st?

Comment 3 Jesus M. Rodriguez 2018-02-22 18:53:42 UTC
After much debugging, I found that the problem is not with the Automation (Ansible) Broker but with how the Service Catalog resyncs the ServiceClasses. As seen in the original comment, if a ServiceClass was provisioned and not seen by the Service Catalog during a resync, it will be marked as removedFromBrokerCatalog. When the Broker recovers with the services again, the Service Catalog SHOULD flip the flag but it is not.

I will open a bug against the Service Catalog and make this bug depend on that new one.