1684652 – Running "oc delete kubeapiserver.operator cluster" leads to permanent cluster death

Bug 1684652 - Running "oc delete kubeapiserver.operator cluster" leads to permanent cluster death

Summary: Running "oc delete kubeapiserver.operator cluster" leads to permanent cluster...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	David Eads
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-03-01 18:49 UTC by Mo
Modified:	2019-10-16 06:27 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:27:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:27:49 UTC

Description Mo 2019-03-01 18:49:44 UTC

Description of problem:

It is impossible to recover from deleting the kube api server operator's config.

Version-Release number of selected component (if applicable):
Any 4.0+

How reproducible:
Always

Steps to Reproduce:
1. oc delete kubeapiserver.operator cluster

Actual results:
Cluster death.

Expected results:
Not cluster death.

Additional info:
In general, what is the expected behavior of an operator when its config is deleted.

Comment 4 Eric Rich 2019-04-05 15:00:59 UTC

Why don't we whitelist/blacklist resources like this (in oc) to prompt a user for confirmation? We do this with `oc delete all --all-namespaces`. 

While I understand this is "working as designed" from an API perspective, we need to find a way to put safeguards in place to keep a cluster admin or other user (with sufficient privileges) from crippling a cluster. 

> Resetting flags for re-consideration in 4.1, CEE see this as a Customer Impacting issue.

Comment 8 Clayton Coleman 2019-04-08 14:20:33 UTC

Clusters must be recoverable from all scenarios.  You can defer bugs like this but not close them "as designed".

Comment 10 Eric Rich 2019-04-08 21:39:36 UTC

As far as I can tell; it puts you into a state where (bad things could happen)? 

# $ oc get events -n openshift-kube-apiserver-operator
> LAST SEEN   TYPE      REASON           OBJECT                               MESSAGE
> 3m34s       Warning   StatusNotFound   deployment/kube-apiserver-operator   Unable to determine current operator status for kube-apiserver

# oc logs kube-apiserver-operator-586d9c8944-dsxhw -n openshift-kube-apiserver-operator
>I0408 21:17:51.028172       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"d9a9d1e0-5a23-11e9-be53-0af7472a96cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'StatusNotFound' Unable to determine current operator status for kube-apiserver
> E0408 21:17:51.032291       1 targetconfigcontroller.go:343] key failed with : kubeapiservers.operator.openshift.io "cluster" not found
> E0408 21:17:51.034993       1 revision_controller.go:316] key failed with : kubeapiservers.operator.openshift.io "cluster" not found

I see no alerts or warning from the alert manager, etc. 

The operator does not even show up in the operator list to tell you that (if this pod crashes that you have an issue). 

$ oc get clusteroperator
NAME                                 VERSION     AVAILABLE   PROGRESSING   FAILING   SINCE
authentication                       4.0.0-0.9   True        False         False     3h50m
cloud-credential                     4.0.0-0.9   True        False         False     4h1m
cluster-autoscaler                   4.0.0-0.9   True        False         False     4h
console                              4.0.0-0.9   True        False         False     3h50m
dns                                  4.0.0-0.9   True        False         False     4h
image-registry                       4.0.0-0.9   True        False         False     3h54m
ingress                              4.0.0-0.9   True        False         False     3h54m
kube-controller-manager              4.0.0-0.9   True        False         False     3h57m
kube-scheduler                       4.0.0-0.9   True        False         False     3h58m
machine-api                          4.0.0-0.9   True        False         False     4h
machine-config                       4.0.0-0.9   True        False         False     4h
marketplace                          4.0.0-0.9   True        False         False     3h54m
monitoring                           4.0.0-0.9   True        False         False     3h49m
network                              4.0.0-0.9   True        False         False     4h1m
node-tuning                          4.0.0-0.9   True        False         False     3h54m
openshift-apiserver                  4.0.0-0.9   True        False         False     3h56m
openshift-controller-manager         4.0.0-0.9   True        False         False     3h59m
openshift-samples                    4.0.0-0.9   True        False         False     3h53m
operator-lifecycle-manager           4.0.0-0.9   True        False         False     4h1m
operator-lifecycle-manager-catalog   4.0.0-0.9   True        False         False     4h1m
service-ca                           4.0.0-0.9   True        False         False     4h
service-catalog-apiserver            4.0.0-0.9   True        False         False     3h54m
service-catalog-controller-manager   4.0.0-0.9   True        False         False     3h55m
storage                              4.0.0-0.9   True        False         False     3h55m

In short, you don't lose your cluster until you delete the kube-apiserver (I'm not even sure you lose the cluster, I never had an oc command fail in my testing of this.) 

> $ oc delete $(oc get pods -n openshift-kube-apiserver-operator -o name) -n openshift-kube-apiserver-operator

Runnign the above command someone has triggers the operator to restart and for it to show up (and being to correct the incorrect state).

Comment 12 David Eads 2019-07-29 12:27:39 UTC

At some point we fixed this.  It took about 20 minutes for some cycle to hit it, but it did auto-recover for me locally.

Comment 13 zhou ying 2019-07-30 05:20:38 UTC

Confirmed with Payload: 4.2.0-0.nightly-2019-07-28-222114, the cluster can auto-recover :

[root@dhcp-140-138 ~]# oc delete kubeapiserver.operator cluster
kubeapiserver.operator.openshift.io "cluster" deleted

[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-07-28-222114   True        False         26h     Cluster version is 4.2.0-0.nightly-2019-07-28-222114
[root@dhcp-140-138 ~]# oc get clusteroperator
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
cloud-credential                           4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
cluster-autoscaler                         4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
console                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
dns                                        4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
image-registry                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      107m
ingress                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
kube-controller-manager                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
kube-scheduler                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
machine-api                                4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
machine-config                             4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
marketplace                                4.2.0-0.nightly-2019-07-28-222114   True        False         False      101m
monitoring                                 4.2.0-0.nightly-2019-07-28-222114   True        False         False      98m
network                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
node-tuning                                4.2.0-0.nightly-2019-07-28-222114   True        False         False      101m
openshift-apiserver                        4.2.0-0.nightly-2019-07-28-222114   True        False         False      101m
openshift-controller-manager               4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
openshift-samples                          4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
operator-lifecycle-manager                 4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-07-28-222114   True        False         False      102m
service-ca                                 4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
service-catalog-apiserver                  4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
service-catalog-controller-manager         4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
storage                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
support                                    4.2.0-0.nightly-2019-07-28-222114   True        False         False      26h
[root@dhcp-140-138 ~]# oc get kubeapiserver.operator
No resources found.
[root@dhcp-140-138 ~]# oc get kubeapiserver.operator cluster
NAME      AGE
cluster   7s
[root@dhcp-140-138 ~]# oc get kubeapiserver.operator cluster
NAME      AGE
cluster   88m

Comment 15 errata-xmlrpc 2019-10-16 06:27:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.