Created attachment 1406979 [details] oc get servicebindings -n demo-project -o yaml I'm using v3.9 images for everything: docker.io/openshift/origin-service-catalog:v3.9: docker.io/openshift/origin-service-catalog@sha256:8fcc079089663c5f31a9764a4ad6d6cafe7e577dcf85d5c01840523b8e767b67 docker.io/openshift/origin-template-service-broker:v3.9: docker.io/openshift/origin-template-service-broker@sha256:7cce2960621d567161f5e72421b3d11972b0975683b042900fd8d74a5b532f8b I used the webui to create (non-ephemeral) jenkins and told it to create a secret/binding. I played around with jenkins, then I delete the namespace. The namespace won't delete. I look at the servicebinding and see: 'Error unbinding from ServiceInstance "demo-project/jenkins-persistent-wdbkx" of ClusterServiceClass (K8S: "93cfef20-2480-11e8-b305-0adf9c2d8066" ExternalName: "jenkins-persistent") at ClusterServiceBroker "template-service-broker": Status: 403; ErrorMessage: <nil>; Description: templateinstances.template.openshift.io "f6fa12e8-a5e1-4a20-be3f-085ff85d6177" is forbidden: User "system:serviceaccount:kube-system:namespace-controller" cannot update templateinstances.template.openshift.io in project "demo-project"; ResponseError: <nil>' Now I can't delete the serviceinstance at all...
I think the reporter is not using a downstream apb registry, and the instance created by jenkins apb that is not our publish apb images in ocp3.9.
BTW, I logged a doc bug for service-catalog resources clean up https://bugzilla.redhat.com/show_bug.cgi?id=1548618 As a workaround for this issue, you can fix a resource in this state by editing it and deleting the finalizer token. apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null finalizers: - kubernetes-incubator/service-catalog generateName: xxx generation: 2 to be like so: apiVersion: servicecatalog.k8s.io/xxx kind: ServiceInstance metadata: creationTimestamp: null deletionGracePeriodSeconds: 0 deletionTimestamp: null generateName: xxx generation: 2 After this, the resource should be deleted immediately. This should be a viable stopgap measure to unblock people experiencing this issue.
@chezhang - from eparis's description on the bug I can tell this was not created using the ASB and APBs, this was done using the Template Service Broker and the Jenkins ephemeral template. The error seems to be coming from the TSB not having permissions to delete template instances. @eparis - how did you install this cluster?
Moving over to Ben as this looks to be a template broker issue with 3.9
eparis confirmed this was an ansible install
@zhang is thanks you are mostly correct. In this case it was the servicebinding, not the service instance where I had to delete the finalizer. But doing so caused things to clean up.
To summarize the theory from the irc discussion: 1) namespace is deleted, finalizer( kubernetes-incubator/service-catalog) is invoked 2) finalizer attempts to unbind the service instance, it apparently does so using the system:serviceaccount:kube-system:namespace-controller user 3) the TSB rejects the unbind request because the system:serviceaccount:kube-system:namespace-controller user does not have permission to update templateinstances (which is a necessary step in performing an unbind) imho the TSB is doing the right thing here. If you want to unbind something, you need to do so with a user that has appropriate permissions to perform an unbind.
Possible solutions: 1) give the finalizer account (system:serviceaccount:kube-system:namespace-controller?) permission to update templateinstances 2) find a way to make the finalizer run as a different user (or at least invoke the unbind as a different user - namely the user who deleted the namespace) 3) update the TSB to allow unbinds to occur w/o confirming the user has templateinstance update permissions (2) is the right answer imho, because this won't necessarily be the last time we hit an issue like this. We're essentially saying "system:serviceaccount:kube-system:namespace-controller" should be able to invoke any broker's deprovision and unbind apis, successfully. That means it needs to have any permission those apis require, which could be different for any given broker, or change over time.
Also i'm unable to recreate this behavior using a v3.9 cluster using oc cluster up, though i do see the finalizer reference on both the serviceinstance and the servicebinding, so it would be good to understand if there is either a possible race condition here, or something has changed in the finalizer behavior (either the service catalog's specific finalizer implementation, or the openshift finalizer framework).
After my demo on wednesday I can give the cluster where this is happening. I'm trying to get another cluster running, if I do, I'll see if I can recreate it there.
unless someone disagrees w/ my summary of the current theory in comment 8, i don't know that we need an env for reproducing it, though I am curious why i'm not seeing it in my cluster.
correct, I guess I can recreate this w/ cluster up. You don't get an error from "oc delete project" but the project is not actually deleted.
So the reason the unbind is being done w/ the namespace-controller user is that the servicebinding object was updated to have the namespace-controller value in the ServiceBinding.spec.userInfo field: $ oc get servicebindings -o yaml apiVersion: v1 items: - apiVersion: servicecatalog.k8s.io/v1beta1 kind: ServiceBinding metadata: creationTimestamp: 2018-03-12T17:05:59Z deletionGracePeriodSeconds: 0 deletionTimestamp: 2018-03-12T17:25:53Z finalizers: - kubernetes-incubator/service-catalog generateName: jenkins-persistent-fhqt9- generation: 2 name: jenkins-persistent-fhqt9-ctcdv namespace: j1 resourceVersion: "1228" selfLink: /apis/servicecatalog.k8s.io/v1beta1/namespaces/j1/servicebindings/jenkins-persistent-fhqt9-ctcdv uid: a30da6ca-2617-11e8-9c63-0242ac110002 spec: externalID: 40554559-f241-4dff-bcaa-dfa4497d4eb3 instanceRef: name: jenkins-persistent-fhqt9 secretName: jenkins-persistent-fhqt9-credentials-8b8sn userInfo: groups: - system:serviceaccounts - system:serviceaccounts:kube-system - system:authenticated uid: "" username: system:serviceaccount:kube-system:namespace-controller Clayton thinks this is because there is an update performed before the delete, and the update is performed by the namespace-controller and the service catalog is setting the userinfo field to the identity of whomever did the update. This ultimately leads to the SC performing an unbind call with the binding.spec.userinfo value. So this is an issue w/ the SC, it needs to somehow protect the userinfo field during updates.
to recreate: 1) do a provision/bind 2) confirm the servicebinding userinfo has the info of the user who did the bind 3) delete the namespace (it won't actually go away) 4) retrieve the binding, see that userinfo now contains the id of the namespace-controller, not the user from (1).
fix/workaround just for the TSB (other brokers could have this problem too): https://github.com/openshift/origin/pull/18948
taking this back to do the fix/workaround in the TSB (comment 15) but it sounds like ASB can run into this also.
Commit pushed to master at https://github.com/openshift/origin https://github.com/openshift/origin/commit/b296e3d62ad6f61dc6e7e97ee6a72739c0856e92 require templateinstance delete, not update, on unbind bug 1554141 https://bugzilla.redhat.com/show_bug.cgi?id=1554141
I can reproduce this is both v3.7.23 and v3.9.4 with below steps: 1. Login cluster with cluster-admin user, 2. In web UI, provision/bind jenkins-persistent template, 3. Delete the project 4. Describe servicebing, below message will appear: Message: Error unbinding from ServiceInstance "lasttry/jenkins-persistent-kmb58" of ClusterServiceClass (K8S: "5d2ccbdd-2599-11e8-bb2d-fa163e154d74" ExternalName: "jenkins-persistent") at ClusterServiceBroker "template-service-broker": Status: 403; ErrorMessage: <nil>; Description: templateinstances.template.openshift.io "3598d5ef-1ea3-47d8-ad8d-ec0bf068380a" is forbidden: User "system:serviceaccount:kube-system:namespace-controller" cannot update templateinstances.template.openshift.io in project "lasttry"; ResponseError: <nil> This is the error from v3.7.23 which is different from 3.9 with same steps: Message: Unbind call failed. Error unbinding from ServiceInstance "37tsb/jenkins-persistent-gwkfc" of ClusterServiceClass (K8S: "0b04c401-2669-11e8-9745-fa163ed402fb" ExternalName: "jenkins-persistent") at ClusterServiceBroker "template-service-broker": Status: 500; ErrorMessage: <nil>; Description: templateinstances.template.openshift.io "202a5160-0187-4830-961c-f2e927b5a95b" not found; ResponseError: <nil>
Normal user can reproduce too, the project is not actually deleted(it can be seen being terminating as cluster-admin)
As Ben mentioned in Comment 16 and mail "However Shawn has noted that the ASB also appears to have a similar issue (as could any broker implementation)", do we need any changes in asb side?
The broker issue is being tracked here: https://bugzilla.redhat.com/process_bug.cgi
*** Bug 1555053 has been marked as a duplicate of this bug. ***
Shawn, Maybe you have typo in Comment 21, I cannot access the bug link.
This is the ASB issue being tracked: https://bugzilla.redhat.com/show_bug.cgi?id=1554239
Verified with below version: openshift v3.9.9 kubernetes v1.9.1+a0ce1bc657 etcd 3.2.16
*** Bug 1567047 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2013