|Summary:||Cluster upgrade fails due to orphaned serviceinstance object|
|Product:||OpenShift Container Platform||Reporter:||Luke Stanton <lstanton>|
|Component:||Service Catalog||Assignee:||Dan Geoffroy <dageoffr>|
|Status:||CLOSED ERRATA||QA Contact:||Jian Zhang <jiazha>|
|Version:||3.9.0||CC:||aos-bugs, cshereme, dcaldwel, jokerman, jpeeler, lstanton, mmccomas, nils.ketelsen, pmorie|
|Fixed In Version:||Doc Type:||No Doc Update|
|Doc Text:||Story Points:||---|
|Last Closed:||2019-06-04 10:40:21 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Luke Stanton 2018-06-12 20:16:48 UTC
Description of problem: When trying to upgrade to OpenShift 3.9.30, the installer failed with: ----- "E0607 17:59:45.531019 error: -n ns-is-17 serviceinstances/mysql-ephemeral-vckw5: namespaces \"nps-is-17\" not found\nsummary: total=2452 errors=1 ignored=0 unchanged=2449 migrated=2\ninfo: to rerun only failing resources, add --include=serviceinstances\nerror: 1 resources failed to migrate" ----- Manually trying to migrate storage showed the same error: ----- oc adm migrate storage --confirm --include='*' --loglevel=6 ... E0608 15:02:26.805744 error: -n nps-is-17 serviceinstances/mysql-ephemeral-vckw5: namespaces "nps-is-17" not found ----- Further investigation showed that the 'mysql-ephemeral-vckw5' serviceinstance object existed but it's project did not. Somehow the serviceinstance had been provisioned but wasn't cleanly removed during deprovisioning as shown in its status: "message: 'Error deprovisioning, ClusterServiceClass (K8S: "169971c5-4267-11e8-ae0e-001a4aa8660c" ExternalName: "mysql-ephemeral") at ClusterServiceBroker "template-service-broker": Delete https://apiserver.openshift-template-service-broker.svc:443/brokers/template.openshift.io/v2/service_instances/505d29f7-da94-4d76-adef-443612258cbf?accepts_incomplete=true&plan_id=169971c5-4267-11e8-ae0e-001a4aa8660c&service_id=169971c5-4267-11e8-ae0e-001a4aa8660c: dial tcp 172.24.0.38:443: connect: cannot assign requested address' reason: DeprovisionCallFailed" How reproducible: Uncertain Steps to Reproduce: Uncertain. However, it seems that it shouldn't be possible to get into this state. Actual results: 'serviceinstance' object gets orphaned causing other errors. Expected results: 'serviceinstance' object either gets deleted cleanly or else it's parent namespace/project remains intact to prevent errors related to orphan objects.
Comment 2 Jay Boyd 2018-06-22 19:06:50 UTC
Luke, the service catalog team agrees: this shouldn't be possible. Namespaces are prevented from being deleted until all the objects within them are removed. Are you able to get any details on the past history of the service mysql-ephemeral-vckw5 or the nps-is-17 namespace? Was there prior difficulty with deleting the namespace or the service? That error encountered on the delete ('cannot assign requested address') may indicate a resource issue, but it should have been an error and retried. Have you tried deleting it again? ie oc delete serviceinstance... I imagine the delete will fail. You can try removing the finalizer (oc edit serviceinstance mysql-ephemeral-vckw5 -n nps-is-17 and then delete the service catalog finalizer) and then the instance should be removed next time the reconciler runs, but there could be errors instead since the NS doesn't exist. If that is the case we may need to manually delete it from etcd storage.
Comment 3 Jeff Peeler 2018-06-29 15:47:03 UTC
Just wanted to confirm that the finalizer on the namespace wasn't removed in order to (attempt a) force delete? The type of validation being hit during storage migration isn't a bug, the bug is having the existing resources not being deleted before the namespace is completely terminated. On the catalog side, even in situations where the instance has timed out due to an "ErrorReconciliationRetryTimeout" the cluster administrator should clean up the instances. This is to ensure that the broker doesn't have orphaned instances, potentially costing customers money if the instance isn't hosted on their cloud.
Comment 4 Nils Ketelsen 2018-07-02 09:36:47 UTC
(In reply to Jay Boyd from comment #2) > That error encountered on the delete ('cannot assign requested address') may > indicate a resource issue, but it should have been an error and retried. > Have you tried deleting it again? ie oc delete serviceinstance... > > I imagine the delete will fail. I seem to have the same issue and can confirm your assumption: $ oc adm migrate storage --include=* --confirm E0702 09:19:53.820965 error: -n test serviceinstances/nginx-example-prjjq: namespaces "test" not found summary: total=2670 errors=1 ignored=0 unchanged=2667 migrated=2 info: to rerun only failing resources, add --include=serviceinstances error: 1 resources failed to migrate $ oc delete project test Error from server (NotFound): namespaces "test" not found Also first encountered this when trying to update (in this case from 3.9.30 to 3.9.31). Nils
Comment 5 Nils Ketelsen 2018-07-02 11:31:41 UTC
Possible workaround for others running into this issue: Re-Create the missing namespace, do the storage migrate manually, remove the namespace. I guess the storage migrate done manually is not required and it would also work without it, but can not test this anymore, because now the issue is gone and I have no idea how to reproduce. But update playbook now executes the storage migrate step successfully so this is a workaround that seems to help: (In my case the ressource was in the (deleted) namespace "test") $ oc new-project test Now using project "test" on server "https://console-openshift-test.example.com:8443". You can add applications to this project with the 'new-app' command. For example, try: oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git to build a new example application in Ruby. $ oc adm migrate storage --include=* --confirm summary: total=2754 errors=0 ignored=0 unchanged=2753 migrated=1 $ oc delete project test project "test" deleted
Comment 6 Jeff Peeler 2018-07-02 15:04:24 UTC
Nils, any chance you know more details about what state your cluster was in pre-upgrade, specifically with the test namespace? Sounds like you didn't modify the namespace finalizers, in which case something during the upgrade process is causing the namespace to be deleted before all resources are cleaned up.
Comment 7 Nils Ketelsen 2018-07-03 05:09:42 UTC
(In reply to Jeff Peeler from comment #6) > Nils, any chance you know more details about what state your cluster was in > pre-upgrade, specifically with the test namespace? Sounds like you didn't > modify the namespace finalizers, in which case something during the upgrade > process is causing the namespace to be deleted before all resources are > cleaned up. Hi Jeff, the namespace has been deleted long before the update. I can not say if the delete ever worked or if it was pending all the time - with the update doing something that resulted in the orphaned object. Maybe this state has been there before and just not been noticed. I have not modified namespace finalizers - I am rather new to openshift and until reading this ticket I did not even know such a thing existed ;-) Sorry I can give no more details, Nils
Comment 8 Jeff Peeler 2018-07-03 16:19:14 UTC
Nils - your feedback has been valuable, thanks!
Comment 9 Jeff Peeler 2018-07-03 16:35:49 UTC
I'm not sure which component to assign this to. The bug here is that a namespace shouldn't be able to be deleted without deleting all the resources in it first.
Comment 11 Luke Stanton 2018-07-12 14:21:28 UTC
I apologize for not responding before now. I've been out of the office quite a bit the last few weeks. Nils, thank you for sharing your experience! We were able to use the same workaround you mentioned of temporarily creating the namespace so the object could be deleted. Unfortunately I wasn't able to get much detail on the service history that caused the issue, just that it was a test environment with a lot of developer activity and at some point a serviceinstance was left even though it's namespace had been removed.
Comment 13 Jeff Peeler 2018-08-03 15:35:46 UTC
This looks like the same issue (with a reproducer and fix included): https://github.com/kubernetes-incubator/service-catalog/issues/2254
Comment 14 Jay Boyd 2019-03-07 13:25:44 UTC
This is a core bug in K8s, fixed in Kube 1.11.3 or 1.12+ by https://github.com/kubernetes/kubernetes/pull/67154 (see https://github.com/kubernetes-incubator/service-catalog/issues/2254 for lots of details). Marking as fixed in 4.0. If this needs to be addressed in earlier versions we'd need master team to investigate backporting these two fixes: https://github.com/kubernetes/kubernetes/pull/66932 Include unavailable API services in discovery response https://github.com/kubernetes/kubernetes/pull/67433 allow failed discovery on initial quota controller start
Comment 16 Jian Zhang 2019-03-11 07:54:41 UTC
[jzhang@dhcp-140-18 ocp-09]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-06-074438 True False 41h Cluster version is 4.0.0-0.nightly-2019-03-06-074438 Verify steps: 1, Install the Service Catalog and ups-broker. [jzhang@dhcp-140-18 ocp-09]$ oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-4zx64 1/1 Running 0 10m apiserver-ptdwp 1/1 Running 0 10m apiserver-vpwq8 1/1 Running 0 10m ups-broker-677dd6cdd-524fl 1/1 Running 0 28m 2, Create a namespace called "test". 3, Create a serviceinstance in the "test" namespace. [jzhang@dhcp-140-18 ocp-09]$ oc get serviceinstance -n test NAME CLASS PLAN STATUS AGE ups-instance ClusterServiceClass/user-provided-service default Ready 20s 4, Undeploy Service Catalog API server [jzhang@dhcp-140-18 ocp-09]$ oc scale --replicas=0 deployment/openshift-svcat-apiserver-operator -n openshift-svcat-apiserver-operator deployment.extensions/openshift-svcat-apiserver-operator scaled [jzhang@dhcp-140-18 ocp-09]$ oc get pods -n openshift-svcat-apiserver-operator No resources found. [jzhang@dhcp-140-18 ocp-09]$ oc delete ds/apiserver -n kube-service-catalog daemonset.extensions "apiserver" deleted [jzhang@dhcp-140-18 ocp-09]$ oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE ups-broker-677dd6cdd-524fl 1/1 Running 0 9m5s 5, Delete the test namespace. [jzhang@dhcp-140-18 ocp-09]$ oc delete ns test namespace "test" deleted The "test" namespace cannot be deleted. LGTM. [jzhang@dhcp-140-18 ocp-09]$ oc get ns ... test Terminating 18m 6, Redeploy Service Catalog API server [jzhang@dhcp-140-18 ocp-09]$ oc get pods -n kube-service-catalog NAME READY STATUS RESTARTS AGE apiserver-4zx64 1/1 Running 0 10m apiserver-ptdwp 1/1 Running 0 10m apiserver-vpwq8 1/1 Running 0 10m ups-broker-677dd6cdd-524fl 1/1 Running 0 28m [jzhang@dhcp-140-18 ocp-09]$ oc get serviceinstance NAME CLASS PLAN STATUS AGE ups-instance ClusterServiceClass/user-provided-service default Ready 16m Edit the ups-instance, delete the two lines: finalizers: - kubernetes-incubator/service-catalog The "test" namespace was deleted. 7, Recreate the "test" namespace, no serviceinstance found. LGTM, verify it. [jzhang@dhcp-140-18 ocp-09]$ oc get serviceinstance -n test No resources found.
Comment 20 errata-xmlrpc 2019-06-04 10:40:21 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758