1746174 – Projects do not terminate on 4.1.x (including 4.1.11)

Bug 1746174 - Projects do not terminate on 4.1.x (including 4.1.11)

Summary: Projects do not terminate on 4.1.x (including 4.1.11)

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Service Catalog
Sub Component:
Version:	4.1.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Jesus M. Rodriguez
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-27 20:34 UTC by Wolfgang Kulhanek
Modified:	2019-09-10 14:53 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: The apiserver would not get deleted by service catalog which caused deletions to hang. So a last minute change was made to service catalog to prevent the hang by not deleting itself. Consequence: - prod cluster: service catalog leaves serviceinstance object with kubernetes-incubator/service-catalog finalizer and does not do its job of deleting them. Hence, the namespace controller cannot finish namespace deletion. Workaround (if any): You must remove the finalizers from the servicebindings and serviceinstances and remove them manually. Then remove the apiservice for the servicecatalog. Result: With the workaround we should be able to continue deleting projects without the service catalog causing an issue.
Clone Of:
Environment:
Last Closed:	2019-09-06 20:48:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Wolfgang Kulhanek 2019-08-27 20:34:32 UTC

Description of problem:
Projects that have been deleted stay in "Terminating" forever.

Version-Release number of selected component (if applicable):
4.1.11


How reproducible:
Randomly. Once it starts happening 100%

Steps to Reproduce:
1. Create and delete projects
2. At some point projects will refuse to terminate

Actual results:


Expected results:


Additional info:

Had a conversation with David Eads about this topic. He walked me through some debugging.

oc get apiservers shows all api servers to be Available.

However reading the kube-controller-manager pod logs shows a few errors: Initially there are a lot of these:

kube-controller-manager-ip-10-0-144-91.us-west-2.compute.internal kube-controller-manager-12 E0827 17:57:55.771687       1 namespace_controller.go:148] unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, authorization.openshift.io/v1: the server is currently unable to handle the request, build.openshift.io/v1: the server is currently unable to handle the request, image.openshift.io/v1: the server is currently unable to handle the request, mutators.kubedb.com/v1alpha1: the server is currently unable to handle the request, oauth.openshift.io/v1: the server is currently unable to handle the request, packages.operators.coreos.com/v1: the server is currently unable to handle the request, project.openshift.io/v1: the server is currently unable to handle the request, quota.openshift.io/v1: the server is currently unable to handle the request, route.openshift.io/v1: the server is currently unable to handle the request, security.openshift.io/v1: the server is currently unable to handle the request, servicecatalog.k8s.io/v1beta1: the server is currently unable to handle the request, template.openshift.io/v1: the server is currently unable to handle the request, user.openshift.io/v1: the server is currently unable to handle the request, validators.kubedb.com/v1alpha1: the server is currently unable to handle the request


After a while the error changes to:

kube-controller-manager-ip-10-0-144-91.us-west-2.compute.internal kube-controller-manager-12 E0827 17:59:24.537248       1 namespace_controller.go:148] unable to retrieve the complete list of server APIs: mutators.kubedb.com/v1alpha1: the server could not find the requested resource, packages.operators.coreos.com/v1: the server is currently unable to handle the request, validators.kubedb.com/v1alpha1: the server could not find the requested resource


Also commit 3193c39722126914c05a6f6d22c1eb3c04a2b9d6 should have produced an annotation outlining the deletion failure in the status of the namespace:

namespace-controller.kcm.openshift.io/deletion-error

This annotation is missing even in 4.1.11

Comment 1 Stefan Schimanski 2019-08-27 21:08:56 UTC

We need must-gather output to analyse this (https://docs.openshift.com/container-platform/4.1/cli_reference/administrator-cli-commands.html#must-gather).

Comment 2 Wolfgang Kulhanek 2019-08-28 12:33:43 UTC

Uploaded must-gather output from one of our worst clusters to https://drive.google.com/open?id=1xTguCi9pHZ6IkWqhvq_NybPikdyHAIkq

Also shared with Stefan and David directly (just in case)

Comment 3 Stefan Schimanski 2019-08-28 15:25:36 UTC

Analyzed the two clusters:

- test cluster: kubedb is installed which provides mutating and validating admission webhooks served through an aggregated API server. This API server does not serve /apis/{mutators,validators}.kubedb.com/v1alpha1. Hence, the namespace controller inside kube-controller-manager falls over and stop deleting namespaces.
- prod cluster: service catalog leaves serviceinstance object with kubernetes-incubator/service-catalog finalizer and does not do its job of deleting them. Hence, the namespace controller cannot finish namespace deletion.

Comment 4 Stefan Schimanski 2019-08-28 15:26:25 UTC

From the service catalog controller manager:

0828 15:16:00.829683       1 event.go:221] Event(v1.ObjectReference{Kind:"ServiceInstance", Namespace:"f025-demo-templates", Name:"f025-nodejs-mongodb-demo", UID:"03af3c2a-c8e2-11e9-9770-0a580a800134", APIVersion:"servicecatalog.k8s.io/v1beta1", ResourceVersion:"64034520", FieldPath:""}): type: 'Warning' reason: 'DeprovisionBlockedByExistingCredentials' All associated ServiceBindings must be removed before this ServiceInstance can be deleted

Note You need to log in before you can comment on or make changes to this bug.