Description of problem: Whenever the Service Instance is aborted/fails deleting the project goes in "terminating" state and cannot delete the Service Instance also. We have a bug for 3.9 already filed. https://bugzilla.redhat.com/show_bug.cgi?id=1476173 Version-Release number of selected component (if applicable): oc v3.7.14 kubernetes v1.7.6+a08f5eeb62 How Reproducible : Always [root@master1 ~]# oc export project xxx apiVersion: v1 kind: Project metadata: annotations: openshift.io/description: "" openshift.io/display-name: "" openshift.io/requester: xxx openshift.io/sa.scc.mcs: s0:c11,c10 openshift.io/sa.scc.supplemental-groups: 1000130000/10000 openshift.io/sa.scc.uid-range: 1000130000/10000 creationTimestamp: null name: xxx spec: finalizers: - kubernetes status: phase: Terminating [root@master1 ~]# oc export -n xxx ServiceInstance xxx-8j757 apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null finalizers: - kubernetes-incubator/service-catalog generateName: xxx generation: 2 name: "" namespace: "" resourceVersion: "" selfLink: "" uid: "" spec: clusterServiceClassExternalName: xxx clusterServiceClassRef: name: a54f9162-f61e-11e7-a6d7-0050569f6bb4 clusterServicePlanExternalName: default clusterServicePlanRef: name: a54f9162-f61e-11e7-a6d7-0050569f6bb4 externalID: 7053ccc2-3afa-4095-9a44-cb1bc50fb6ed parametersFrom: - secretKeyRef: key: parameters name: xxx-parametersqvhz8 updateRequests: 0 userInfo: groups: - system:serviceaccounts - system:serviceaccounts:kube-system - system:authenticated uid: "" username: system:serviceaccount:kube-system:namespace-controller status: asyncOpInProgress: true conditions: - lastTransitionTime: 2018-01-23T13:13:53Z message: The instance is being provisioned asynchronously reason: Provisioning status: "False" type: Ready currentOperation: Provision deprovisionStatus: Required inProgressProperties: clusterServicePlanExternalID: a54f9162-f61e-11e7-a6d7-0050569f6bb4 clusterServicePlanExternalName: default clusterServicePlanExternalName: default parameterChecksum: af58cded7f0dd7c2152e54a2e0944fd0a89932258b1a30c16231b070575fea10 parameters: DATABASE_ENGINE: <redacted> DATABASE_NAME: <redacted> DATABASE_SERVICE_NAME: <redacted> DATABASE_USER: <redacted> MEMORY_LIMIT: <redacted> MEMORY_MYSQL_LIMIT: <redacted> NAME: <redacted> NAMESPACE: <redacted> OPCACHE_REVALIDATE_FREQ: <redacted> SOURCE_REPOSITORY_URL: <redacted> userInfo: extra: scopes.authorization.openshift.io: - user:full groups: - basefarm - ziggo - system:authenticated:oauth - system:authenticated uid: "" username: xxx lastOperation: provisioning <-----stuck/aborted operationStartTime: 2018-01-23T13:13:53Z orphanMitigationInProgress: false reconciledGeneration: 0 [root@master1 ~]# oc delete -n xxx ServiceInstance xxx-8j757 --force --grace-period=0 warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely. serviceinstance "xxx-8j757" deleted [root@spc-ocps-master1 ~]# oc get -n xxx ServiceInstance NAME AGE xxx-8j757 1d Actual results: 1. The bindings and instance still exist after delete project 2. Project exists in terminating state Expected results: 1. bindings and instance should be removed after delete project Additional Info : https://github.com/openshift/origin/issues/18125
Look like similar bug 1539308
As a workaround for this issue, you can fix a resource in this state by editing it and deleting the finalizer token. For example, edit this: apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null finalizers: - kubernetes-incubator/service-catalog generateName: xxx generation: 2 to be like so: apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null generateName: xxx generation: 2 After this, the resource should be deleted immediately. This should be a viable stopgap measure to unblock people experiencing this issue. We are working on changes to make force deletion work correctly.
Just to level set and make an important concept clear here: - It is fully expected and correct that when a namespace with a ServiceInstance (and possibly service bindings) is deleted, that the namespace may not immediately be fully deleted until the catalog resources are deleted - When a ServiceBinding is deleted by the user, the ServiceBinding resource is not fully deleted immediately because the catalog has to contact the broker and invoke the Unbind operation at the broker. The finalizer 'kubernetes-incubator/service-catalog' is used to represent that the service catalog still has work to perform for the ServiceBinding resource before that resource can be fully deleted. At this stage, if the broker is unreachable, or the broker itself has a bug, the ServiceBinding will remain in the catalog until either the unbind operation is retried and completes successfully at the broker (the catalog will retry operations that fail), in which case the catalog will remove the finalizer and the binding will be fully deleted, OR the user manually removes the finalizer. When the user manually removes the finalizer, the catalog controller will do no more work for that resource, meaning that resources created by the broker may not be cleaned up. - Similarly, when a ServiceInstance is deleted by the user, the ServiceInstance resource is not fully deleted immediately because the catalog has to contact the broker and invoke the Deprovision operation at the broker. The finalizer 'kubernetes-incubator/service-catalog' is used to represent that the service catalog still has work to perform for the ServiceInstance resource before that resource can be fully deleted. Note that if a ServiceInstance is deleted, and that ServiceInstance still has ServiceBindings remaining, that the catalog will not contact the broker to deprovision the ServiceInstance until the ServiceBinding resource are fully deleted from the catalog. Additionally, once the catalog contacts the broker to perform the deprovision, if the broker is unreachable or the broker itself has a bug, the ServiceInstance will remain in the catalog until either the deprovision operation completes successfully (the catalog will retry operations that fail) OR the user manually removes the finalizer. When the user manually removes the finalizer, the catalog controller will do no more work for that resource, meaning that resources created by the broker may not be cleaned up. To be very clear: removing the finalizer should be considered a last resort if a broker is unreachable (ie, permanently down), the broker has a bug that has resulted in the service catalog resource entering a bad state, or the catalog itself has a bug that prevents the resource from being fully deleted after the unbind/deprovision operation has completed successfully. We should set the expectation with users that this system is eventually consistent and that async operations (for example, an async deprovision) may take time to complete. A user should only manually remove a finalizer if they are 100% certain that the system is in a bad state. We will work to improve our documentation so that this is clear, and work to resolve bugs in the catalog and our brokers that have resulted in customers encountering this situation.
(In reply to Paul Morie from comment #5) > We should set the expectation with users that this system is eventually > consistent and that async operations (for example, an async deprovision) may > take time to complete. A user should only manually remove a finalizer if > they are 100% certain that the system is in a bad state. > I think we need to focus on this, and implement some form of diagnostics (oadm diagnostics) that allow us to check for this state simply. At the very least we should be able to determine if we're in a state (with some tooling like this), where removing the finalizer is safe/produces the results we want.
(In reply to Paul Morie from comment #5) > We will work to improve our documentation so that this is clear, and work to > resolve bugs in the catalog and our brokers that have resulted in customers > encountering this situation. I logged a bug to trace doc improvement: https://bugzilla.redhat.com/show_bug.cgi?id=1548618
We are doing some upstream work to make the Service Instance lifecycle more robust and make failures more understandable and fixable by the end user: * 4xx, 5xx and Connection timeout should be retriable (not terminal errors): https://github.com/kubernetes-incubator/service-catalog/pull/1765 * Allow retries for instances with Failed condition after spec changes: https://github.com/kubernetes-incubator/service-catalog/pull/1751 * Add ObservedGeneration and Provisioned into ServiceInstanceStatus: https://github.com/kubernetes-incubator/service-catalog/pull/1748 * Adding the ability to sync a service instance: https://github.com/kubernetes-incubator/service-catalog/pull/1762 * handle instance deletion that occurs during async provisioning or async update: https://github.com/kubernetes-incubator/service-catalog/pull/1708 i.e., there has been a lot of effort put in recently to fix some non-happy path scenarios. I'm closing this as the core issue has been addressed and we continue to enhance robustness in this area and address serviceability.
According to comment 4 and comment 8, It's OK for QE, so mark it as VERIFIED. And the related doc update bug is https://bugzilla.redhat.com/show_bug.cgi?id=1548618
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2013