Bug 1577076
Summary: | Should be successfully deprovision the instance which returned an orphaned resource | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Jian Zhang <jiazha> |
Component: | Service Catalog | Assignee: | Jay Boyd <jaboyd> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Jian Zhang <jiazha> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.10.0 | CC: | chezhang, jaboyd, pmorie, zitang |
Target Milestone: | --- | ||
Target Release: | 3.11.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: If Service Catalog hits a non-terminal error during provisioning (for example if the request to the broker times out or the broker returns a 500 series error) it would immediately retry provisioning (potentially sending a request to delete the instance first to ensure its not in some invalid state). There was no pause between these operations and it would repeat until the instance was successfully provisioned by the broker or the OpenShift service instance was deleted by the admin.
Consequence: Service Catalog would continuously retry provisioning with no backoff. This may cause high CPU on both the Service Catalog controller pod and the Broker.
Fix: Service Catalog will do an exponential backoff between retries. The maximum delay is 20 minutes before retrying.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2018-12-21 15:16:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Jian Zhang
2018-05-11 07:23:07 UTC
reported upstream as https://github.com/kubernetes-incubator/service-catalog/issues/2025 The issue is the Catalog is in a tight loop attempting to provision the instance, hitting a failure error from the broker, deprovisions the instance, and then immediately catalog attempts to provision the instance again. Issue 2025 is asking for an exponential backoff and retry which I think is the correct behavior. Jay, Yes, I agree with you. The correct behavior should be like this: `(start provisioning -> fail, requires orphan mitigation -> perform orphan mitigation -> exponential delay) x maxRetries`. This is now in place service instance provisioning and updates. Only for service instances. This is a temporary work around, there are still issues with properly backing off and retrying for other operations, see https://github.com/kubernetes-incubator/service-catalog/issues/2076 for further details. fixed by https://github.com/kubernetes-incubator/service-catalog/pull/2065 For verification, you need a build of Service Catalog with v0.1.29 or greater. This was merged into OpenShift on August 20. regarding comment 3 - this is fixed in 3.11, not yet back ported to 3.10.z Jay, I test it with the latest version: v0.1.29 on OCP 3.11 but still got the same errors. [root@sharer4-master-etcd-1 ~]# oc get serviceinstance -n jian ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning RetryBackoff 2m service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-08-27 07:20:33.919087076 +0000 UTC m=+15392.451927528 Warning RetryBackoff 2m (x2 over 2m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-08-27 07:20:36.281338712 +0000 UTC m=+15394.814179159 Warning RetryBackoff 2m (x2 over 2m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-08-27 07:20:40.364137474 +0000 UTC m=+15398.896977916 Warning RetryBackoff 2m (x2 over 2m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-08-27 07:20:48.565192547 +0000 UTC m=+15407.098032980 Warning StartingInstanceOrphanMitigation 2m (x5 over 2m) service-catalog-controller-manager The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource Normal OrphanMitigationSuccessful 2m (x5 over 2m) service-catalog-controller-manager Orphan mitigation was completed successfully Warning RetryBackoff 2m (x2 over 2m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-08-27 07:21:04.759359755 +0000 UTC m=+15423.292200224 Warning ProvisionCallFailed 1m (x6 over 2m) service-catalog-controller-manager Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil> [root@sharer4-master-etcd-1 ~]# oc get pods -n jian No resources found. [root@sharer4-master-etcd-1 ~]# oc logs -f ups-broker-6b69ddfbb8-fwxd5 I0827 06:42:08.089938 1 server.go:95] Starting server on %!d(string=:8080) I0827 06:44:16.780775 1 server.go:112] Get Service Broker Catalog... I0827 06:44:16.781254 1 controller.go:57] Catalog() ... I0827 07:20:32.799704 1 server.go:138] CreateServiceInstance ac9cfc99-a9c9-11e8-a183-0a580a020004... W0827 07:20:32.801562 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0827 07:20:32.803770 1 controller.go:81] CreateServiceInstance() I0827 07:20:32.803979 1 controller.go:112] Created User Provided Service Instance: &{ac9cfc99-a9c9-11e8-a183-0a580a020004 0xc42000e5d8} I0827 07:20:32.932481 1 server.go:177] RemoveServiceInstance ac9cfc99-a9c9-11e8-a183-0a580a020004... W0827 07:20:32.945228 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0827 07:20:32.945804 1 controller.go:132] RemoveServiceInstance() I0827 07:20:34.238594 1 server.go:138] CreateServiceInstance ac9cfc99-a9c9-11e8-a183-0a580a020004... W0827 07:20:34.239318 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0827 07:20:34.239531 1 controller.go:81] CreateServiceInstance() I0827 07:20:34.239670 1 controller.go:112] Created User Provided Service Instance: &{ac9cfc99-a9c9-11e8-a183-0a580a020004 0xc420094488} I0827 07:20:34.315865 1 server.go:177] RemoveServiceInstance ac9cfc99-a9c9-11e8-a183-0a580a020004... W0827 07:20:34.316126 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0827 07:20:34.316160 1 controller.go:132] RemoveServiceInstance() I0827 07:20:36.321635 1 server.go:138] CreateServiceInstance ac9cfc99-a9c9-11e8-a183-0a580a020004... W0827 07:20:36.322405 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0827 07:20:36.322638 1 controller.go:81] CreateServiceInstance() ... And, this instance always stays in the "Pending" status, could NOT deprovision success. From the excerpt above it looks like the backoff is working properly. It's doing a retry after an exponentially increasing backoff. It's a 20 minute maximum backoff and will keep retrying indefinitely once every 20 minutes. You said it can not deprovision - can you please provide Service Catalog controller logs and the full output from the describe serviceinstance after you attempt to delete it? Thanks! Jay Sorry for the late to reply. The serviceinstance can be deleted successfully by manual running the Deprovision command. And, the output from the describe serviceinstance looks good. Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 06:50:27.005226809 +0000 UTC m=+17950.124172904 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 06:50:29.171831166 +0000 UTC m=+17952.290777279 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 06:50:33.373362083 +0000 UTC m=+17956.492308194 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 06:50:41.610866126 +0000 UTC m=+17964.729812273 Warning StartingInstanceOrphanMitigation 1m (x5 over 1m) service-catalog-controller-manager The instance provision call failed with an ambiguous error; attempting to deprovision the instance in order to mitigate an orphaned resource Normal OrphanMitigationSuccessful 1m (x5 over 1m) service-catalog-controller-manager Orphan mitigation was completed successfully Warning ProvisionCallFailed 1m (x5 over 1m) service-catalog-controller-manager Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 205; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil> Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 06:50:57.82222646 +0000 UTC m=+17980.941172566 But, the output from the broker pods always pop the "RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b...", is this as expected? [root@qe-jiazha-master-etcd-1 ~]# oc logs -f ups-broker-6b69ddfbb8-2cwg4 -n kube-service-catalog I0926 06:47:30.850742 1 server.go:95] Starting server on %!d(string=:8080) I0926 06:48:06.332782 1 server.go:112] Get Service Broker Catalog... I0926 06:48:06.332811 1 controller.go:57] Catalog() I0926 06:50:25.988339 1 server.go:138] CreateServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:25.988548 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0926 06:50:25.988790 1 controller.go:81] CreateServiceInstance() I0926 06:50:25.988804 1 controller.go:112] Created User Provided Service Instance: &{72629cc8-c158-11e8-9145-0a580a80000b 0xc42000e1a8} I0926 06:50:26.148178 1 server.go:177] RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:26.148204 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0926 06:50:26.148216 1 controller.go:132] RemoveServiceInstance() I0926 06:50:27.152162 1 server.go:138] CreateServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:27.152298 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0926 06:50:27.152317 1 controller.go:81] CreateServiceInstance() I0926 06:50:27.152328 1 controller.go:112] Created User Provided Service Instance: &{72629cc8-c158-11e8-9145-0a580a80000b 0xc4200fe280} I0926 06:50:27.347895 1 server.go:177] RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:27.347919 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0926 06:50:27.347931 1 controller.go:132] RemoveServiceInstance() I0926 06:50:29.350544 1 server.go:138] CreateServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:29.350680 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0926 06:50:29.350725 1 controller.go:81] CreateServiceInstance() I0926 06:50:29.350740 1 controller.go:112] Created User Provided Service Instance: &{72629cc8-c158-11e8-9145-0a580a80000b 0xc42000e228} I0926 06:50:29.548142 1 server.go:177] RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:29.548170 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0926 06:50:29.548182 1 controller.go:132] RemoveServiceInstance() I0926 06:50:33.577449 1 server.go:138] CreateServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:33.577658 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0926 06:50:33.577700 1 controller.go:81] CreateServiceInstance() I0926 06:50:33.577711 1 controller.go:112] Created User Provided Service Instance: &{72629cc8-c158-11e8-9145-0a580a80000b 0xc42000e238} I0926 06:50:33.749339 1 server.go:177] RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:33.749366 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0926 06:50:33.749378 1 controller.go:132] RemoveServiceInstance() I0926 06:50:41.785505 1 server.go:138] CreateServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:41.785749 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0926 06:50:41.785784 1 controller.go:81] CreateServiceInstance() I0926 06:50:41.785796 1 controller.go:112] Created User Provided Service Instance: &{72629cc8-c158-11e8-9145-0a580a80000b 0xc4200fe2a0} I0926 06:50:41.949806 1 server.go:177] RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:41.949982 1 server.go:180] (ignore)removeServiceInstance operation, the fake status code is: 200 I0926 06:50:41.950039 1 controller.go:132] RemoveServiceInstance() I0926 06:50:57.978349 1 server.go:138] CreateServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b... W0926 06:50:57.978466 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 205 I0926 06:50:57.978479 1 controller.go:81] CreateServiceInstance() I0926 06:50:57.978489 1 controller.go:112] Created User Provided Service Instance: &{72629cc8-c158-11e8-9145-0a580a80000b 0xc4200fe2c0} One more thing, the orphan mitigation status of the returned code 408 is " NotRequired", is it correct? Per doc: https://github.com/openservicebrokerapi/servicebroker/blob/v2.12/spec.md#orphans, the platform should run the orphan mitigation. COrrect me if I'm wrong. The version info: sh-4.2$ service-catalog --version v3.11.15;Upstream:v0.1.31 # oc logs -f ups-broker-587776d4f9-fw7c8 ... I0926 07:18:51.286501 1 server.go:138] CreateServiceInstance 6ac1cd63-c15c-11e8-9145-0a580a80000b... W0926 07:18:51.286627 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 408 I0926 07:18:51.286641 1 controller.go:81] CreateServiceInstance() I0926 07:18:51.286653 1 controller.go:112] Created User Provided Service Instance: &{6ac1cd63-c15c-11e8-9145-0a580a80000b 0xc42000e1d8} I0926 07:18:52.459793 1 server.go:138] CreateServiceInstance 6ac1cd63-c15c-11e8-9145-0a580a80000b... W0926 07:18:52.459890 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 408 I0926 07:18:52.459903 1 controller.go:81] CreateServiceInstance() I0926 07:18:52.459914 1 controller.go:112] Created User Provided Service Instance: &{6ac1cd63-c15c-11e8-9145-0a580a80000b 0xc42000e1e8} I0926 07:18:54.659717 1 server.go:138] CreateServiceInstance 6ac1cd63-c15c-11e8-9145-0a580a80000b... W0926 07:18:54.659852 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 408 I0926 07:18:54.659866 1 controller.go:81] CreateServiceInstance() I0926 07:18:54.659877 1 controller.go:112] Created User Provided Service Instance: &{6ac1cd63-c15c-11e8-9145-0a580a80000b 0xc42000e1f8} I0926 07:18:58.886426 1 server.go:138] CreateServiceInstance 6ac1cd63-c15c-11e8-9145-0a580a80000b... W0926 07:18:58.886549 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 408 I0926 07:18:58.886562 1 controller.go:81] CreateServiceInstance() I0926 07:18:58.886574 1 controller.go:112] Created User Provided Service Instance: &{6ac1cd63-c15c-11e8-9145-0a580a80000b 0xc42000e208} I0926 07:19:07.089188 1 server.go:138] CreateServiceInstance 6ac1cd63-c15c-11e8-9145-0a580a80000b... W0926 07:19:07.089279 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 408 I0926 07:19:07.089296 1 controller.go:81] CreateServiceInstance() I0926 07:19:07.089305 1 controller.go:112] Created User Provided Service Instance: &{6ac1cd63-c15c-11e8-9145-0a580a80000b 0xc42007a4a0} I0926 07:19:23.288389 1 server.go:138] CreateServiceInstance 6ac1cd63-c15c-11e8-9145-0a580a80000b... W0926 07:19:23.288492 1 server.go:155] (ignore)createServiceInstance operation, the fake status code is: 408 I0926 07:19:23.288506 1 controller.go:81] CreateServiceInstance() I0926 07:19:23.288518 1 controller.go:112] Created User Provided Service Instance: &{6ac1cd63-c15c-11e8-9145-0a580a80000b 0xc42007a4b8} [root@qe-jiazha-master-etcd-1 ~]# oc get serviceinstance -n jian NAME CLASS PLAN STATUS AGE user-provided-service-pwf6l ClusterServiceClass/user-provided-service default ProvisionCallFailed 1m [root@qe-jiazha-master-etcd-1 ~]# oc describe serviceinstance user-provided-service-pwf6l -n jian Name: user-provided-service-pwf6l Namespace: jian Labels: <none> Annotations: <none> API Version: servicecatalog.k8s.io/v1beta1 Kind: ServiceInstance Metadata: Creation Timestamp: 2018-09-26T07:18:50Z Finalizers: kubernetes-incubator/service-catalog Generate Name: user-provided-service- Generation: 1 Resource Version: 51665 Self Link: /apis/servicecatalog.k8s.io/v1beta1/namespaces/jian/serviceinstances/user-provided-service-pwf6l UID: 6ac1cdf5-c15c-11e8-9145-0a580a80000b Spec: Cluster Service Class External Name: user-provided-service Cluster Service Class Ref: Name: 4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468 Cluster Service Plan External Name: default Cluster Service Plan Ref: Name: 86064792-7ea2-467b-af93-ac9694d96d52 External ID: 6ac1cd63-c15c-11e8-9145-0a580a80000b Update Requests: 0 User Info: Extra: Scopes . Authorization . Openshift . Io: user:full Groups: system:authenticated:oauth system:authenticated UID: Username: jiazha Status: Async Op In Progress: false Conditions: Last Transition Time: 2018-09-26T07:18:51Z Message: Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 408; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil> Reason: ProvisionCallFailed Status: False Type: Ready Current Operation: Provision Deprovision Status: NotRequired In Progress Properties: Cluster Service Plan External ID: 86064792-7ea2-467b-af93-ac9694d96d52 Cluster Service Plan External Name: default User Info: Extra: Scopes . Authorization . Openshift . Io: user:full Groups: system:authenticated:oauth system:authenticated UID: Username: jiazha Observed Generation: 1 Operation Start Time: 2018-09-26T07:18:51Z Orphan Mitigation In Progress: false Provision Status: Reconciled Generation: 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:18:52.302415558 +0000 UTC m=+19655.421361658 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:18:54.47884747 +0000 UTC m=+19657.597793563 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:18:58.676695836 +0000 UTC m=+19661.795641932 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:19:06.905254366 +0000 UTC m=+19670.024200470 Warning RetryBackoff 1m (x2 over 1m) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:19:23.107375213 +0000 UTC m=+19686.226321316 Warning RetryBackoff 44s (x2 over 44s) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:19:55.317066994 +0000 UTC m=+19718.436013105 Warning ProvisionCallFailed 12s (x7 over 1m) service-catalog-controller-manager Error provisioning ServiceInstance of ClusterServiceClass (K8S: "4f6e6cf6-ffdd-425f-a2c7-3c9258ad2468" ExternalName: "user-provided-service") at ClusterServiceBroker "ups-broker": Status: 408; ErrorMessage: <nil>; Description: <nil>; ResponseError: <nil> Warning RetryBackoff 12s (x2 over 12s) service-catalog-controller-manager Delaying provision retry, next attempt will be after 2018-09-26 07:20:59.49015212 +0000 UTC m=+19782.609098228 Regarding comment #7: this behavior looks good, exactly what I was hoping to see. When the Controller attempts to create the instance, the broker is returning a status 205. The Controller interprets this as an error, this causes the controller to do orphan mitigation, so in the broker you see the "RemoveServiceInstance 72629cc8-c158-11e8-9145-0a580a80000b..." immediately following. After orphan mitigation, we see an exponential backoff and then a retry, and this repeats over and over. This all looks good to me, working as we would expect. Additionally, if you delete the K8s ServiceInstance while it's doing a backoff (waiting/sleeping before the next attempt), the K8s instance is successfully deleted and this ends the retries. In comment #8 you are referencing version 2.12 of the the OSB API spec here. If you take a look at master: https://github.com/openservicebrokerapi/servicebroker/blob/master/spec.md and search on 408 you will see the spec was updated - - at one point (version 2.12) a 408 was supposed to do orphan mitigation, but in 2.13 that was changed. The correct behavior as specified in the latest spec is to NOT do orphan mitigation for any 400 series status code. PR https://github.com/openservicebrokerapi/servicebroker/pull/456 contains the change for that. Thanks for all the details in comments 7 & 8 - - with this information, I'm going to update this as fixed. Jay, Thanks a lot for your clarification! it works on version Upstream:v0.1.31, 3.11, LGTM, verify it. Closing bugs that were verified and targeted for GA but for some reason were not picked up by errata. This bug fix should be present in current 3.11 release content. |