Bug 1541350
Summary: | Namespace goes in "terminating" state due to unprovisioned ServiceInstance | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Suhaas Bhat <subhat> |
Component: | Service Broker | Assignee: | Jay Boyd <jaboyd> |
Status: | CLOSED ERRATA | QA Contact: | Zihan Tang <zitang> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 3.7.1 | CC: | andcosta, aos-bugs, asolanas, chezhang, clasohm, dcaldwel, dmoessne, erich, gucore, jaboyd, jmatthew, knakayam, pmorie, rekhan, snalawad, subhat |
Target Milestone: | --- | ||
Target Release: | 3.9.0 | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: |
undefined
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-06-27 18:01:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Suhaas Bhat
2018-02-02 10:50:21 UTC
Look like similar bug 1539308 As a workaround for this issue, you can fix a resource in this state by editing it and deleting the finalizer token. For example, edit this: apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null finalizers: - kubernetes-incubator/service-catalog generateName: xxx generation: 2 to be like so: apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null generateName: xxx generation: 2 After this, the resource should be deleted immediately. This should be a viable stopgap measure to unblock people experiencing this issue. We are working on changes to make force deletion work correctly. Just to level set and make an important concept clear here: - It is fully expected and correct that when a namespace with a ServiceInstance (and possibly service bindings) is deleted, that the namespace may not immediately be fully deleted until the catalog resources are deleted - When a ServiceBinding is deleted by the user, the ServiceBinding resource is not fully deleted immediately because the catalog has to contact the broker and invoke the Unbind operation at the broker. The finalizer 'kubernetes-incubator/service-catalog' is used to represent that the service catalog still has work to perform for the ServiceBinding resource before that resource can be fully deleted. At this stage, if the broker is unreachable, or the broker itself has a bug, the ServiceBinding will remain in the catalog until either the unbind operation is retried and completes successfully at the broker (the catalog will retry operations that fail), in which case the catalog will remove the finalizer and the binding will be fully deleted, OR the user manually removes the finalizer. When the user manually removes the finalizer, the catalog controller will do no more work for that resource, meaning that resources created by the broker may not be cleaned up. - Similarly, when a ServiceInstance is deleted by the user, the ServiceInstance resource is not fully deleted immediately because the catalog has to contact the broker and invoke the Deprovision operation at the broker. The finalizer 'kubernetes-incubator/service-catalog' is used to represent that the service catalog still has work to perform for the ServiceInstance resource before that resource can be fully deleted. Note that if a ServiceInstance is deleted, and that ServiceInstance still has ServiceBindings remaining, that the catalog will not contact the broker to deprovision the ServiceInstance until the ServiceBinding resource are fully deleted from the catalog. Additionally, once the catalog contacts the broker to perform the deprovision, if the broker is unreachable or the broker itself has a bug, the ServiceInstance will remain in the catalog until either the deprovision operation completes successfully (the catalog will retry operations that fail) OR the user manually removes the finalizer. When the user manually removes the finalizer, the catalog controller will do no more work for that resource, meaning that resources created by the broker may not be cleaned up. To be very clear: removing the finalizer should be considered a last resort if a broker is unreachable (ie, permanently down), the broker has a bug that has resulted in the service catalog resource entering a bad state, or the catalog itself has a bug that prevents the resource from being fully deleted after the unbind/deprovision operation has completed successfully. We should set the expectation with users that this system is eventually consistent and that async operations (for example, an async deprovision) may take time to complete. A user should only manually remove a finalizer if they are 100% certain that the system is in a bad state. We will work to improve our documentation so that this is clear, and work to resolve bugs in the catalog and our brokers that have resulted in customers encountering this situation. (In reply to Paul Morie from comment #5) > We should set the expectation with users that this system is eventually > consistent and that async operations (for example, an async deprovision) may > take time to complete. A user should only manually remove a finalizer if > they are 100% certain that the system is in a bad state. > I think we need to focus on this, and implement some form of diagnostics (oadm diagnostics) that allow us to check for this state simply. At the very least we should be able to determine if we're in a state (with some tooling like this), where removing the finalizer is safe/produces the results we want. (In reply to Paul Morie from comment #5) > We will work to improve our documentation so that this is clear, and work to > resolve bugs in the catalog and our brokers that have resulted in customers > encountering this situation. I logged a bug to trace doc improvement: https://bugzilla.redhat.com/show_bug.cgi?id=1548618 We are doing some upstream work to make the Service Instance lifecycle more robust and make failures more understandable and fixable by the end user: * 4xx, 5xx and Connection timeout should be retriable (not terminal errors): https://github.com/kubernetes-incubator/service-catalog/pull/1765 * Allow retries for instances with Failed condition after spec changes: https://github.com/kubernetes-incubator/service-catalog/pull/1751 * Add ObservedGeneration and Provisioned into ServiceInstanceStatus: https://github.com/kubernetes-incubator/service-catalog/pull/1748 * Adding the ability to sync a service instance: https://github.com/kubernetes-incubator/service-catalog/pull/1762 * handle instance deletion that occurs during async provisioning or async update: https://github.com/kubernetes-incubator/service-catalog/pull/1708 i.e., there has been a lot of effort put in recently to fix some non-happy path scenarios. I'm closing this as the core issue has been addressed and we continue to enhance robustness in this area and address serviceability. According to comment 4 and comment 8, It's OK for QE, so mark it as VERIFIED. And the related doc update bug is https://bugzilla.redhat.com/show_bug.cgi?id=1548618 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2013 |