Bug 1684652

Summary:	Running "oc delete kubeapiserver.operator cluster" leads to permanent cluster death
Product:	OpenShift Container Platform	Reporter:	Mo <mkhan>
Component:	kube-apiserver	Assignee:	David Eads <deads>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.1.0	CC:	aos-bugs, ccoleman, decarr, erich, jokerman, jupierce, mfojtik, mkhan, mmccomas, yinzhou
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-10-16 06:27:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mo 2019-03-01 18:49:44 UTC

Description of problem:

It is impossible to recover from deleting the kube api server operator's config.

Version-Release number of selected component (if applicable):
Any 4.0+

How reproducible:
Always

Steps to Reproduce:
1. oc delete kubeapiserver.operator cluster

Actual results:
Cluster death.

Expected results:
Not cluster death.

Additional info:
In general, what is the expected behavior of an operator when its config is deleted.

Comment 4 Eric Rich 2019-04-05 15:00:59 UTC

Why don't we whitelist/blacklist resources like this (in oc) to prompt a user for confirmation? We do this with `oc delete all --all-namespaces`. 

While I understand this is "working as designed" from an API perspective, we need to find a way to put safeguards in place to keep a cluster admin or other user (with sufficient privileges) from crippling a cluster. 

> Resetting flags for re-consideration in 4.1, CEE see this as a Customer Impacting issue.

Comment 8 Clayton Coleman 2019-04-08 14:20:33 UTC

Clusters must be recoverable from all scenarios.  You can defer bugs like this but not close them "as designed".

Comment 10 Eric Rich 2019-04-08 21:39:36 UTC

As far as I can tell; it puts you into a state where (bad things could happen)? 

# $ oc get events -n openshift-kube-apiserver-operator
> LAST SEEN   TYPE      REASON           OBJECT                               MESSAGE
> 3m34s       Warning   StatusNotFound   deployment/kube-apiserver-operator   Unable to determine current operator status for kube-apiserver

# oc logs kube-apiserver-operator-586d9c8944-dsxhw -n openshift-kube-apiserver-operator
>I0408 21:17:51.028172       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"d9a9d1e0-5a23-11e9-be53-0af7472a96cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'StatusNotFound' Unable to determine current operator status for kube-apiserver
> E0408 21:17:51.032291       1 targetconfigcontroller.go:343] key failed with : kubeapiservers.operator.openshift.io "cluster" not found
> E0408 21:17:51.034993       1 revision_controller.go:316] key failed with : kubeapiservers.operator.openshift.io "cluster" not found

I see no alerts or warning from the alert manager, etc. 

The operator does not even show up in the operator list to tell you that (if this pod crashes that you have an issue). 

$ oc get clusteroperator
NAME                                 VERSION     AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.0.0-0.9   True        False         False     3h50m
cloud-credential                     4.0.0-0.9   True        False         False     4h1m
cluster-autoscaler                   4.0.0-0.9   True        False         False     4h
console                              4.0.0-0.9   True        False         False     3h50m
dns                                  4.0.0-0.9   True        False         False     4h
image-registry                       4.0.0-0.9   True        False         False     3h54m
ingress                              4.0.0-0.9   True        False         False     3h54m
kube-controller-manager              4.0.0-0.9   True        False         False     3h57m
kube-scheduler                       4.0.0-0.9   True        False         False     3h58m
machine-api                          4.0.0-0.9   True        False         False     4h
machine-config                       4.0.0-0.9   True        False         False     4h
marketplace                          4.0.0-0.9   True        False         False     3h54m
monitoring                           4.0.0-0.9   True        False         False     3h49m
network                              4.0.0-0.9   True        False         False     4h1m
node-tuning                          4.0.0-0.9   True        False         False     3h54m
openshift-apiserver                  4.0.0-0.9   True        False         False     3h56m
openshift-controller-manager         4.0.0-0.9   True        False         False     3h59m
openshift-samples                    4.0.0-0.9   True        False         False     3h53m
operator-lifecycle-manager           4.0.0-0.9   True        False         False     4h1m
operator-lifecycle-manager-catalog   4.0.0-0.9   True        False         False     4h1m
service-ca                           4.0.0-0.9   True        False         False     4h
service-catalog-apiserver            4.0.0-0.9   True        False         False     3h54m
service-catalog-controller-manager   4.0.0-0.9   True        False         False     3h55m
storage                              4.0.0-0.9   True        False         False     3h55m

In short, you don't lose your cluster until you delete the kube-apiserver (I'm not even sure you lose the cluster, I never had an oc command fail in my testing of this.) 

> $ oc delete $(oc get pods -n openshift-kube-apiserver-operator -o name) -n openshift-kube-apiserver-operator

Runnign the above command someone has triggers the operator to restart and for it to show up (and being to correct the incorrect state).

Comment 12 David Eads 2019-07-29 12:27:39 UTC

At some point we fixed this.  It took about 20 minutes for some cycle to hit it, but it did auto-recover for me locally.

Comment 13 zhou ying 2019-07-30 05:20:38 UTC

Confirmed with Payload: 4.2.0-0.nightly-2019-07-28-222114, the cluster can auto-recover :

[root@dhcp-140-138 ~]# oc delete kubeapiserver.operator cluster
kubeapiserver.operator.openshift.io "cluster" deleted

[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-07-28-222114   True        False         26h     Cluster version is 4.2.0-0.nightly-2019-07-28-222114
[root@dhcp-140-138 ~]# oc get clusteroperator
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
cloud-credential                           4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
cluster-autoscaler                         4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
console                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
dns                                        4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
image-registry                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      107m
ingress                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
kube-controller-manager                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
kube-scheduler                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
machine-api                                4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
machine-config                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
marketplace                                4.2.0-0.nightly-2019-07-28-222114   True        False         False      101m
monitoring                                 4.2.0-0.nightly-2019-07-28-222114   True        False         False      98m
network                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
node-tuning                                4.2.0-0.nightly-2019-07-28-222114   True        False         False      101m
openshift-apiserver                        4.2.0-0.nightly-2019-07-28-222114   True        False         False      101m
openshift-controller-manager               4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
openshift-samples                          4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
operator-lifecycle-manager                 4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-07-28-222114   True        False         False      102m
service-ca                                 4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
service-catalog-apiserver                  4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
service-catalog-controller-manager         4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
storage                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
support                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
[root@dhcp-140-138 ~]# oc get kubeapiserver.operator
No resources found.
[root@dhcp-140-138 ~]# oc get kubeapiserver.operator cluster
NAME      AGE
cluster   7s
[root@dhcp-140-138 ~]# oc get kubeapiserver.operator cluster
NAME      AGE
cluster   88m

Comment 15 errata-xmlrpc 2019-10-16 06:27:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922