Description of problem: It is impossible to recover from deleting the kube api server operator's config. Version-Release number of selected component (if applicable): Any 4.0+ How reproducible: Always Steps to Reproduce: 1. oc delete kubeapiserver.operator cluster Actual results: Cluster death. Expected results: Not cluster death. Additional info: In general, what is the expected behavior of an operator when its config is deleted.
Why don't we whitelist/blacklist resources like this (in oc) to prompt a user for confirmation? We do this with `oc delete all --all-namespaces`. While I understand this is "working as designed" from an API perspective, we need to find a way to put safeguards in place to keep a cluster admin or other user (with sufficient privileges) from crippling a cluster. > Resetting flags for re-consideration in 4.1, CEE see this as a Customer Impacting issue.
Clusters must be recoverable from all scenarios. You can defer bugs like this but not close them "as designed".
As far as I can tell; it puts you into a state where (bad things could happen)? # $ oc get events -n openshift-kube-apiserver-operator > LAST SEEN TYPE REASON OBJECT MESSAGE > 3m34s Warning StatusNotFound deployment/kube-apiserver-operator Unable to determine current operator status for kube-apiserver # oc logs kube-apiserver-operator-586d9c8944-dsxhw -n openshift-kube-apiserver-operator >I0408 21:17:51.028172 1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"d9a9d1e0-5a23-11e9-be53-0af7472a96cc", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'StatusNotFound' Unable to determine current operator status for kube-apiserver > E0408 21:17:51.032291 1 targetconfigcontroller.go:343] key failed with : kubeapiservers.operator.openshift.io "cluster" not found > E0408 21:17:51.034993 1 revision_controller.go:316] key failed with : kubeapiservers.operator.openshift.io "cluster" not found I see no alerts or warning from the alert manager, etc. The operator does not even show up in the operator list to tell you that (if this pod crashes that you have an issue). $ oc get clusteroperator NAME VERSION AVAILABLE PROGRESSING FAILING SINCE authentication 4.0.0-0.9 True False False 3h50m cloud-credential 4.0.0-0.9 True False False 4h1m cluster-autoscaler 4.0.0-0.9 True False False 4h console 4.0.0-0.9 True False False 3h50m dns 4.0.0-0.9 True False False 4h image-registry 4.0.0-0.9 True False False 3h54m ingress 4.0.0-0.9 True False False 3h54m kube-controller-manager 4.0.0-0.9 True False False 3h57m kube-scheduler 4.0.0-0.9 True False False 3h58m machine-api 4.0.0-0.9 True False False 4h machine-config 4.0.0-0.9 True False False 4h marketplace 4.0.0-0.9 True False False 3h54m monitoring 4.0.0-0.9 True False False 3h49m network 4.0.0-0.9 True False False 4h1m node-tuning 4.0.0-0.9 True False False 3h54m openshift-apiserver 4.0.0-0.9 True False False 3h56m openshift-controller-manager 4.0.0-0.9 True False False 3h59m openshift-samples 4.0.0-0.9 True False False 3h53m operator-lifecycle-manager 4.0.0-0.9 True False False 4h1m operator-lifecycle-manager-catalog 4.0.0-0.9 True False False 4h1m service-ca 4.0.0-0.9 True False False 4h service-catalog-apiserver 4.0.0-0.9 True False False 3h54m service-catalog-controller-manager 4.0.0-0.9 True False False 3h55m storage 4.0.0-0.9 True False False 3h55m In short, you don't lose your cluster until you delete the kube-apiserver (I'm not even sure you lose the cluster, I never had an oc command fail in my testing of this.) > $ oc delete $(oc get pods -n openshift-kube-apiserver-operator -o name) -n openshift-kube-apiserver-operator Runnign the above command someone has triggers the operator to restart and for it to show up (and being to correct the incorrect state).
At some point we fixed this. It took about 20 minutes for some cycle to hit it, but it did auto-recover for me locally.
Confirmed with Payload: 4.2.0-0.nightly-2019-07-28-222114, the cluster can auto-recover : [root@dhcp-140-138 ~]# oc delete kubeapiserver.operator cluster kubeapiserver.operator.openshift.io "cluster" deleted [root@dhcp-140-138 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.2.0-0.nightly-2019-07-28-222114 True False 26h Cluster version is 4.2.0-0.nightly-2019-07-28-222114 [root@dhcp-140-138 ~]# oc get clusteroperator NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.2.0-0.nightly-2019-07-28-222114 True False False 26h cloud-credential 4.2.0-0.nightly-2019-07-28-222114 True False False 26h cluster-autoscaler 4.2.0-0.nightly-2019-07-28-222114 True False False 26h console 4.2.0-0.nightly-2019-07-28-222114 True False False 26h dns 4.2.0-0.nightly-2019-07-28-222114 True False False 26h image-registry 4.2.0-0.nightly-2019-07-28-222114 True False False 107m ingress 4.2.0-0.nightly-2019-07-28-222114 True False False 26h kube-controller-manager 4.2.0-0.nightly-2019-07-28-222114 True False False 26h kube-scheduler 4.2.0-0.nightly-2019-07-28-222114 True False False 26h machine-api 4.2.0-0.nightly-2019-07-28-222114 True False False 26h machine-config 4.2.0-0.nightly-2019-07-28-222114 True False False 26h marketplace 4.2.0-0.nightly-2019-07-28-222114 True False False 101m monitoring 4.2.0-0.nightly-2019-07-28-222114 True False False 98m network 4.2.0-0.nightly-2019-07-28-222114 True False False 26h node-tuning 4.2.0-0.nightly-2019-07-28-222114 True False False 101m openshift-apiserver 4.2.0-0.nightly-2019-07-28-222114 True False False 101m openshift-controller-manager 4.2.0-0.nightly-2019-07-28-222114 True False False 26h openshift-samples 4.2.0-0.nightly-2019-07-28-222114 True False False 26h operator-lifecycle-manager 4.2.0-0.nightly-2019-07-28-222114 True False False 26h operator-lifecycle-manager-catalog 4.2.0-0.nightly-2019-07-28-222114 True False False 26h operator-lifecycle-manager-packageserver 4.2.0-0.nightly-2019-07-28-222114 True False False 102m service-ca 4.2.0-0.nightly-2019-07-28-222114 True False False 26h service-catalog-apiserver 4.2.0-0.nightly-2019-07-28-222114 True False False 26h service-catalog-controller-manager 4.2.0-0.nightly-2019-07-28-222114 True False False 26h storage 4.2.0-0.nightly-2019-07-28-222114 True False False 26h support 4.2.0-0.nightly-2019-07-28-222114 True False False 26h [root@dhcp-140-138 ~]# oc get kubeapiserver.operator No resources found. [root@dhcp-140-138 ~]# oc get kubeapiserver.operator cluster NAME AGE cluster 7s [root@dhcp-140-138 ~]# oc get kubeapiserver.operator cluster NAME AGE cluster 88m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922