Bug 1590546

Summary: Cluster upgrade fails due to orphaned serviceinstance object
Product: OpenShift Container Platform Reporter: Luke Stanton <lstanton>
Component: Service CatalogAssignee: Dan Geoffroy <dageoffr>
Status: CLOSED ERRATA QA Contact: Jian Zhang <jiazha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, cshereme, dcaldwel, jokerman, jpeeler, lstanton, mmccomas, nils.ketelsen, pmorie
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:40:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Luke Stanton 2018-06-12 20:16:48 UTC
Description of problem:

When trying to upgrade to OpenShift 3.9.30, the installer failed with:
-----
"E0607 17:59:45.531019 error:     -n ns-is-17 serviceinstances/mysql-ephemeral-vckw5: namespaces \"nps-is-17\" not found\nsummary: total=2452 errors=1 ignored=0 unchanged=2449 migrated=2\ninfo: to rerun only failing resources, add --include=serviceinstances\nerror: 1 resources failed to migrate"
-----

Manually trying to migrate storage showed the same error:
-----
oc adm migrate storage --confirm --include='*' --loglevel=6
...
E0608 15:02:26.805744 error:     -n nps-is-17 serviceinstances/mysql-ephemeral-vckw5: namespaces "nps-is-17" not found
-----

Further investigation showed that the 'mysql-ephemeral-vckw5' serviceinstance object existed but it's project did not. Somehow the serviceinstance had been provisioned but wasn't cleanly removed during deprovisioning as shown in its status:

"message: 'Error deprovisioning, ClusterServiceClass (K8S: "169971c5-4267-11e8-ae0e-001a4aa8660c"
  ExternalName: "mysql-ephemeral") at ClusterServiceBroker "template-service-broker":
  Delete https://apiserver.openshift-template-service-broker.svc:443/brokers/template.openshift.io/v2/service_instances/505d29f7-da94-4d76-adef-443612258cbf?accepts_incomplete=true&plan_id=169971c5-4267-11e8-ae0e-001a4aa8660c&service_id=169971c5-4267-11e8-ae0e-001a4aa8660c:
  dial tcp 172.24.0.38:443: connect: cannot assign requested address'
reason: DeprovisionCallFailed"

How reproducible:
Uncertain

Steps to Reproduce:
Uncertain. However, it seems that it shouldn't be possible to get into this state.

Actual results:
'serviceinstance' object gets orphaned causing other errors.

Expected results:
'serviceinstance' object either gets deleted cleanly or else it's parent namespace/project remains intact to prevent errors related to orphan objects.

Comment 2 Jay Boyd 2018-06-22 19:06:50 UTC
Luke, the service catalog team agrees: this shouldn't be possible.  Namespaces are prevented from being deleted until all the objects within them are removed.  Are you able to get any details on the past history of the service mysql-ephemeral-vckw5 or the nps-is-17 namespace?  Was there prior difficulty with deleting the namespace or the service?

That error encountered on the delete ('cannot assign requested address') may indicate a resource issue, but it should have been an error and retried.  Have you tried deleting it again?  ie oc delete serviceinstance...

I imagine the delete will fail.  You can try removing the finalizer (oc edit serviceinstance mysql-ephemeral-vckw5 -n nps-is-17   and then delete the service catalog finalizer) and then the instance should be removed next time the reconciler runs, but there could be errors instead since the NS doesn't exist.  If that is the case we may need to manually delete it from etcd storage.

Comment 3 Jeff Peeler 2018-06-29 15:47:03 UTC
Just wanted to confirm that the finalizer on the namespace wasn't removed in order to (attempt a) force delete?

The type of validation being hit during storage migration isn't a bug, the bug is having the existing resources not being deleted before the namespace is completely terminated. On the catalog side, even in situations where the instance has timed out due to an "ErrorReconciliationRetryTimeout" the cluster administrator should clean up the instances. This is to ensure that the broker doesn't have orphaned instances, potentially costing customers money if the instance isn't hosted on their cloud.

Comment 4 Nils Ketelsen 2018-07-02 09:36:47 UTC
(In reply to Jay Boyd from comment #2)

> That error encountered on the delete ('cannot assign requested address') may
> indicate a resource issue, but it should have been an error and retried. 
> Have you tried deleting it again?  ie oc delete serviceinstance...
> 
> I imagine the delete will fail.  

I seem to have the same issue and can confirm your assumption:

$ oc adm migrate storage --include=* --confirm
E0702 09:19:53.820965 error:     -n test serviceinstances/nginx-example-prjjq: namespaces "test" not found
summary: total=2670 errors=1 ignored=0 unchanged=2667 migrated=2
info: to rerun only failing resources, add --include=serviceinstances
error: 1 resources failed to migrate

$ oc delete project test
Error from server (NotFound): namespaces "test" not found

Also first encountered this when trying to update (in this case from 3.9.30 to 3.9.31).


Nils

Comment 5 Nils Ketelsen 2018-07-02 11:31:41 UTC
Possible workaround for others running into this issue:

Re-Create the missing namespace, do the storage migrate manually, remove the namespace. I guess the storage migrate done manually is not required and it would also work without it, but can not test this anymore, because now the issue is gone and I have no idea how to reproduce. But update playbook now executes the storage migrate step successfully so this is a workaround that seems to help:

(In my case the ressource was in the (deleted) namespace "test")

$ oc new-project test
Now using project "test" on server "https://console-openshift-test.example.com:8443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app centos/ruby-22-centos7~https://github.com/openshift/ruby-ex.git

to build a new example application in Ruby.

$ oc adm migrate storage --include=* --confirm
summary: total=2754 errors=0 ignored=0 unchanged=2753 migrated=1

$ oc delete project test
project "test" deleted

Comment 6 Jeff Peeler 2018-07-02 15:04:24 UTC
Nils, any chance you know more details about what state your cluster was in pre-upgrade, specifically with the test namespace? Sounds like you didn't modify the namespace finalizers, in which case something during the upgrade process is causing the namespace to be deleted before all resources are cleaned up.

Comment 7 Nils Ketelsen 2018-07-03 05:09:42 UTC
(In reply to Jeff Peeler from comment #6)
> Nils, any chance you know more details about what state your cluster was in
> pre-upgrade, specifically with the test namespace? Sounds like you didn't
> modify the namespace finalizers, in which case something during the upgrade
> process is causing the namespace to be deleted before all resources are
> cleaned up.

Hi Jeff,

the namespace has been deleted long before the update.

I can not say if the delete ever worked or if it was pending all the time - with the update doing something that resulted in the orphaned object. Maybe this state has been there before and just not been noticed.

I have not modified namespace finalizers - I am rather new to openshift and until reading this ticket I did not even know such a thing existed ;-)

Sorry I can give no more details,

      Nils

Comment 8 Jeff Peeler 2018-07-03 16:19:14 UTC
Nils - your feedback has been valuable, thanks!

Comment 9 Jeff Peeler 2018-07-03 16:35:49 UTC
I'm not sure which component to assign this to. The bug here is that a namespace shouldn't be able to be deleted without deleting all the resources in it first.

Comment 11 Luke Stanton 2018-07-12 14:21:28 UTC
I apologize for not responding before now. I've been out of the office quite a bit the last few weeks. Nils, thank you for sharing your experience! We were able to use the same workaround you mentioned of temporarily creating the namespace so the object could be deleted.

Unfortunately I wasn't able to get much detail on the service history that caused the issue, just that it was a test environment with a lot of developer activity and at some point a serviceinstance was left even though it's namespace had been removed.

Comment 13 Jeff Peeler 2018-08-03 15:35:46 UTC
This looks like the same issue (with a reproducer and fix included):

https://github.com/kubernetes-incubator/service-catalog/issues/2254

Comment 14 Jay Boyd 2019-03-07 13:25:44 UTC
This is a core bug in K8s, fixed in Kube 1.11.3 or 1.12+ by https://github.com/kubernetes/kubernetes/pull/67154  (see https://github.com/kubernetes-incubator/service-catalog/issues/2254 for lots of details).

Marking as fixed in 4.0.  If this needs to be addressed in earlier versions we'd need master team to investigate backporting these two fixes:

https://github.com/kubernetes/kubernetes/pull/66932  Include unavailable API services in discovery response
https://github.com/kubernetes/kubernetes/pull/67433  allow failed discovery on initial quota controller start

Comment 16 Jian Zhang 2019-03-11 07:54:41 UTC
[jzhang@dhcp-140-18 ocp-09]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-06-074438   True        False         41h     Cluster version is 4.0.0-0.nightly-2019-03-06-074438

Verify steps:
1, Install the Service Catalog and ups-broker.
[jzhang@dhcp-140-18 ocp-09]$ oc get pods -n kube-service-catalog  
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-4zx64              1/1     Running   0          10m
apiserver-ptdwp              1/1     Running   0          10m
apiserver-vpwq8              1/1     Running   0          10m
ups-broker-677dd6cdd-524fl   1/1     Running   0          28m

2, Create a namespace called "test".
3, Create a serviceinstance in the "test" namespace.
[jzhang@dhcp-140-18 ocp-09]$ oc get serviceinstance -n test
NAME           CLASS                                       PLAN      STATUS   AGE
ups-instance   ClusterServiceClass/user-provided-service   default   Ready    20s

4, Undeploy Service Catalog API server 
[jzhang@dhcp-140-18 ocp-09]$ oc scale --replicas=0  deployment/openshift-svcat-apiserver-operator -n openshift-svcat-apiserver-operator  
deployment.extensions/openshift-svcat-apiserver-operator scaled
[jzhang@dhcp-140-18 ocp-09]$ oc get pods -n openshift-svcat-apiserver-operator  
No resources found.
[jzhang@dhcp-140-18 ocp-09]$ oc delete  ds/apiserver  -n kube-service-catalog  
daemonset.extensions "apiserver" deleted
[jzhang@dhcp-140-18 ocp-09]$ oc get pods -n kube-service-catalog  
NAME                         READY   STATUS    RESTARTS   AGE
ups-broker-677dd6cdd-524fl   1/1     Running   0          9m5s

5, Delete the test namespace.
[jzhang@dhcp-140-18 ocp-09]$ oc delete ns test
namespace "test" deleted

The "test" namespace cannot be deleted. LGTM.
[jzhang@dhcp-140-18 ocp-09]$ oc get ns
...
test                                          Terminating   18m

6, Redeploy Service Catalog API server
[jzhang@dhcp-140-18 ocp-09]$ oc get pods -n kube-service-catalog  
NAME                         READY   STATUS    RESTARTS   AGE
apiserver-4zx64              1/1     Running   0          10m
apiserver-ptdwp              1/1     Running   0          10m
apiserver-vpwq8              1/1     Running   0          10m
ups-broker-677dd6cdd-524fl   1/1     Running   0          28m

[jzhang@dhcp-140-18 ocp-09]$ oc get serviceinstance
NAME           CLASS                                       PLAN      STATUS   AGE
ups-instance   ClusterServiceClass/user-provided-service   default   Ready    16m

Edit the ups-instance, delete the two lines:
  finalizers:
  - kubernetes-incubator/service-catalog

The "test" namespace was deleted.
7, Recreate the "test" namespace, no serviceinstance found. LGTM, verify it.
[jzhang@dhcp-140-18 ocp-09]$ oc get serviceinstance -n test
No resources found.

Comment 20 errata-xmlrpc 2019-06-04 10:40:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758