Bug 1541350

Summary:	Namespace goes in "terminating" state due to unprovisioned ServiceInstance
Product:	OpenShift Container Platform	Reporter:	Suhaas Bhat <subhat>
Component:	Service Broker	Assignee:	Jay Boyd <jaboyd>
Status:	CLOSED ERRATA	QA Contact:	Zihan Tang <zitang>
Severity:	high	Docs Contact:
Priority:	urgent
Version:	3.7.1	CC:	andcosta, aos-bugs, asolanas, chezhang, clasohm, dcaldwel, dmoessne, erich, gucore, jaboyd, jmatthew, knakayam, pmorie, rekhan, snalawad, subhat
Target Milestone:	---
Target Release:	3.9.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-06-27 18:01:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Suhaas Bhat 2018-02-02 10:50:21 UTC

Description of problem:
Whenever the Service Instance is aborted/fails deleting the project goes in "terminating" state and cannot delete the Service Instance also.

We have a bug for 3.9 already filed.
https://bugzilla.redhat.com/show_bug.cgi?id=1476173

Version-Release number of selected component (if applicable):
oc v3.7.14
kubernetes v1.7.6+a08f5eeb62

How Reproducible :
Always

[root@master1 ~]# oc export project xxx
apiVersion: v1
kind: Project
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: xxx
    openshift.io/sa.scc.mcs: s0:c11,c10
    openshift.io/sa.scc.supplemental-groups: 1000130000/10000
    openshift.io/sa.scc.uid-range: 1000130000/10000
  creationTimestamp: null
  name: xxx
spec:
  finalizers:
  - kubernetes
status:
  phase: Terminating 

[root@master1 ~]# oc export -n xxx ServiceInstance xxx-8j757
apiVersion: servicecatalog.k8s.io/xxx
kind: ServiceInstance
metadata:
  creationTimestamp: null
  deletionGracePeriodSeconds: 0
  deletionTimestamp: null
  finalizers:
  - kubernetes-incubator/service-catalog
  generateName: xxx
  generation: 2
  name: ""
  namespace: ""                             
  resourceVersion: ""
  selfLink: ""
  uid: ""
spec:
  clusterServiceClassExternalName: xxx
  clusterServiceClassRef:
    name: a54f9162-f61e-11e7-a6d7-0050569f6bb4
  clusterServicePlanExternalName: default
  clusterServicePlanRef:
    name: a54f9162-f61e-11e7-a6d7-0050569f6bb4
  externalID: 7053ccc2-3afa-4095-9a44-cb1bc50fb6ed
  parametersFrom:
  - secretKeyRef:
      key: parameters
      name: xxx-parametersqvhz8
  updateRequests: 0
  userInfo:
    groups:
    - system:serviceaccounts
    - system:serviceaccounts:kube-system
    - system:authenticated
    uid: ""
    username: system:serviceaccount:kube-system:namespace-controller
status:
  asyncOpInProgress: true
  conditions:
  - lastTransitionTime: 2018-01-23T13:13:53Z
    message: The instance is being provisioned asynchronously
    reason: Provisioning
    status: "False"
    type: Ready
  currentOperation: Provision
  deprovisionStatus: Required                                     
  inProgressProperties:
    clusterServicePlanExternalID: a54f9162-f61e-11e7-a6d7-0050569f6bb4
    clusterServicePlanExternalName: default
   clusterServicePlanExternalName: default
    parameterChecksum: af58cded7f0dd7c2152e54a2e0944fd0a89932258b1a30c16231b070575fea10
    parameters:
      DATABASE_ENGINE: <redacted>
      DATABASE_NAME: <redacted>
      DATABASE_SERVICE_NAME: <redacted>
      DATABASE_USER: <redacted>
      MEMORY_LIMIT: <redacted>
      MEMORY_MYSQL_LIMIT: <redacted>
      NAME: <redacted>
      NAMESPACE: <redacted>
      OPCACHE_REVALIDATE_FREQ: <redacted>
      SOURCE_REPOSITORY_URL: <redacted>
    userInfo:
      extra:
        scopes.authorization.openshift.io:
        - user:full
      groups:
      - basefarm
      - ziggo
      - system:authenticated:oauth
      - system:authenticated
      uid: ""
      username: xxx
  lastOperation: provisioning                       <-----stuck/aborted 
  operationStartTime: 2018-01-23T13:13:53Z
 orphanMitigationInProgress: false
  reconciledGeneration: 0
[root@master1 ~]# oc delete -n xxx ServiceInstance xxx-8j757 --force --grace-period=0
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
serviceinstance "xxx-8j757" deleted
[root@spc-ocps-master1 ~]# oc get -n xxx ServiceInstance
NAME                          AGE
xxx-8j757   1d

Actual results:
1. The bindings and instance still exist after delete project
2. Project exists in terminating state

Expected results:
1. bindings and instance should be removed after delete project

Additional Info :
https://github.com/openshift/origin/issues/18125

Comment 1 Zhang Cheng 2018-02-02 12:20:38 UTC

Look like similar bug 1539308

Comment 3 Paul Morie 2018-02-07 23:23:04 UTC

As a workaround for this issue, you can fix a resource in this state by editing it and deleting the finalizer token.  For example, edit this:

apiVersion: servicecatalog.k8s.io/xxx
kind: ServiceInstance
metadata:
  creationTimestamp: null
  deletionGracePeriodSeconds: 0
  deletionTimestamp: null
  finalizers:
  - kubernetes-incubator/service-catalog
  generateName: xxx
  generation: 2

to be like so:

apiVersion: servicecatalog.k8s.io/xxx
kind: ServiceInstance
metadata:
  creationTimestamp: null
  deletionGracePeriodSeconds: 0
  deletionTimestamp: null
  generateName: xxx
  generation: 2

After this, the resource should be deleted immediately. This should be a viable stopgap measure to unblock people experiencing this issue. We are working on changes to make force deletion work correctly.

Comment 5 Paul Morie 2018-02-09 15:26:27 UTC

Just to level set and make an important concept clear here:

- It is fully expected and correct that when a namespace with a ServiceInstance (and possibly service bindings) is deleted, that the namespace may not immediately be fully deleted until the catalog resources are deleted

- When a ServiceBinding is deleted by the user, the ServiceBinding resource is not fully deleted immediately because the catalog has to contact the broker and invoke the Unbind operation at the broker. The finalizer 'kubernetes-incubator/service-catalog' is used to represent that the service catalog still has work to perform for the ServiceBinding resource before that resource can be fully deleted. At this stage, if the broker is unreachable, or the broker itself has a bug, the ServiceBinding will remain in the catalog until either the unbind operation is retried and completes successfully at the broker (the catalog will retry operations that fail), in which case the catalog will remove the finalizer and the binding will be fully deleted, OR the user manually removes the finalizer.  When the user manually removes the finalizer, the catalog controller will do no more work for that resource, meaning that resources created by the broker may not be cleaned up.

- Similarly, when a ServiceInstance is deleted by the user, the ServiceInstance resource is not fully deleted immediately because the catalog has to contact the broker and invoke the Deprovision operation at the broker. The finalizer 'kubernetes-incubator/service-catalog' is used to represent that the service catalog still has work to perform for the ServiceInstance resource before that resource can be fully deleted. Note that if a ServiceInstance is deleted, and that ServiceInstance still has ServiceBindings remaining, that the catalog will not contact the broker to deprovision the ServiceInstance until the ServiceBinding resource are fully deleted from the catalog. Additionally, once the catalog contacts the broker to perform the deprovision, if the broker is unreachable or the broker itself has a bug, the ServiceInstance will remain in the catalog until either the deprovision operation completes successfully (the catalog will retry operations that fail) OR the user manually removes the finalizer.  When the user manually removes the finalizer, the catalog controller will do no more work for that resource, meaning that resources created by the broker may not be cleaned up.

To be very clear: removing the finalizer should be considered a last resort if a broker is unreachable (ie, permanently down), the broker has a bug that has resulted in the service catalog resource entering a bad state, or the catalog itself has a bug that prevents the resource from being fully deleted after the unbind/deprovision operation has completed successfully.

We should set the expectation with users that this system is eventually consistent and that async operations (for example, an async deprovision) may take time to complete.  A user should only manually remove a finalizer if they are 100% certain that the system is in a bad state.

We will work to improve our documentation so that this is clear, and work to resolve bugs in the catalog and our brokers that have resulted in customers encountering this situation.

Comment 6 Eric Rich 2018-02-09 18:13:33 UTC

(In reply to Paul Morie from comment #5)
> We should set the expectation with users that this system is eventually
> consistent and that async operations (for example, an async deprovision) may
> take time to complete.  A user should only manually remove a finalizer if
> they are 100% certain that the system is in a bad state.
> 

I think we need to focus on this, and implement some form of diagnostics (oadm diagnostics) that allow us to check for this state simply. 

At the very least we should be able to determine if we're in a state (with some tooling like this), where removing the finalizer is safe/produces the results we want.

Comment 7 Zhang Cheng 2018-02-24 01:50:55 UTC

(In reply to Paul Morie from comment #5)

> We will work to improve our documentation so that this is clear, and work to
> resolve bugs in the catalog and our brokers that have resulted in customers
> encountering this situation.

I logged a bug to trace doc improvement: https://bugzilla.redhat.com/show_bug.cgi?id=1548618

Comment 8 Jay Boyd 2018-02-28 15:50:14 UTC

We are doing some upstream work to make the Service Instance lifecycle more robust and make failures more understandable and fixable by the end user:
* 4xx, 5xx and Connection timeout should be retriable (not terminal errors): https://github.com/kubernetes-incubator/service-catalog/pull/1765
* Allow retries for instances with Failed condition after spec changes: https://github.com/kubernetes-incubator/service-catalog/pull/1751
* Add ObservedGeneration and Provisioned into ServiceInstanceStatus: https://github.com/kubernetes-incubator/service-catalog/pull/1748
* Adding the ability to sync a service instance: https://github.com/kubernetes-incubator/service-catalog/pull/1762
* handle instance deletion that occurs during async provisioning or async update: https://github.com/kubernetes-incubator/service-catalog/pull/1708

i.e., there has been a lot of effort put in recently to fix some non-happy path scenarios.

I'm closing this as the core issue has been addressed and we continue to enhance robustness in this area and address serviceability.

Comment 9 Zihan Tang 2018-03-01 05:10:23 UTC

According to comment 4 and comment 8, It's OK for QE, so mark it as VERIFIED.

And the related doc update bug is 
https://bugzilla.redhat.com/show_bug.cgi?id=1548618

Comment 25 errata-xmlrpc 2018-06-27 18:01:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2013