Description of problem: The CVO is unable to repair the deletion of the openshift-monitoring namespace. The CVO will report that the monitoring clusteroperator is healthy, when in fact, it is not. How reproducible: consistent Steps to Reproduce: 1. `oc delete namespace openshift-monitoring` 2. Watch CVO logs to observe that it will be unable to recover from this condition 3. Actual results: CVO will be unable to recover from this condition for three reasons: - The aggregated metrics server (apiservices/v1beta1.metrics.k8s.io) will be unreachable, and as such, the namespace will never actually complete deletion. - The validation webhooks (validatingwebhookconfigurations/prometheusrules.openshift.io) will prevent creation of prometheusrules causing the CVO reconcile loop to hang. This si because the webhook server will be offline. - The cluster operator (clusteroperator/monitoring) will thing the namespace is already created, and as such, it will attempt to apply resources but doesn't actually check to ensure namespace is created first. Expected results: The CVO is able to self heal from this, and actually observe the monitoring clusteroperator status. Additional info: To workaround this issue, the following command can be used: oc delete validatingwebhookconfigurations/prometheusrules.openshift.io apiservices/v1beta1.metrics.k8s.io clusteroperator/monitoring This deletes the configurations that prevent the CVO from restoring the monitoring clusteroperator to it's proper status, and those objections will be recreated as the monitoring stack redeploys.
Recreated this issue on 4.7 cluster: 1. Deleted namespace openshift-monitoring. Delete does not complete. CVO gets the following error: task.go:112] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (586 of 669): operatorgroups.operators.coreos.com "openshift-cluster-monitoring" is forbidden: unable to create new content in namespace openshift-monitoring because it is being terminated and sync_worker.go:941] Update error 610 of 669: UpdatePayloadClusterError Could not update prometheusrule "openshift-dns-operator/dns" (610 of 669): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found) 2. Removed namespace finalizer which allowed the namespace deletion to complete. Get this error: task.go:112] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (586 of 669): namespaces "openshift-monitoring" not found and still get: sync_worker.go:941] Update error 610 of 669: UpdatePayloadClusterError Could not update prometheusrule "openshift-dns-operator/dns" (610 of 669): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found) 3. Recreating namespace openshift-monitoring was no help.
IMO this is not high priority bug as deleting the namespace not causing workload disruptions and deleting the namespace is not something we suggest customers to do.
As stated before, the openshift-monitoring namespace should never be deleted. But if it is we should be able to recover. First thing that has to happen for the cluster to recover is that the namespace deletion must be allowed to complete by removing the kube finalizers. Once the namespace deletion completes CVO is still unable to recreate the namespace resources and CVO continues to get: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found However if the prometheusrules.openshift.io webhook's failurePolicy is changed from Fail to Ignore CVO is able to recreate the openshift-monitoring resources and the cluster recovers. Reached out to #forum-monitoing (https://coreos.slack.com/archives/C0VMT03S5/p1622819064320400) to let them know that the bug is being transferred to them to investigate making the failurePolicy change.
We've already talked about switching the prometheusrules.openshift.io webhook's failurePolicy from Fail to Ignore for other reasons [1]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1949840#c1
The blog post https://www.openshift.com/blog/the-hidden-dangers-of-terminating-namespaces details the implications of `terminating namespace`.
tested with 4.8.0-0.nightly-2021-06-10-224448, issue is fixed # oc get ValidatingWebhookConfiguration prometheusrules.openshift.io -oyaml | grep failurePolicy failurePolicy: Ignore delete openshift-monitoring project # oc delete apiservices v1beta1.metrics.k8s.io; oc delete project openshift-monitoring apiservice.apiregistration.k8s.io "v1beta1.metrics.k8s.io" deleted project.project.openshift.io "openshift-monitoring" deleted wait for a while # oc get ns openshift-monitoring Error from server (NotFound): namespaces "openshift-monitoring" not found wait for recoverage of openshift-monitoring # oc get ns openshift-monitoring NAME STATUS AGE openshift-monitoring Active 64s # oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 5/5 Running 0 56s alertmanager-main-1 5/5 Running 0 55s alertmanager-main-2 5/5 Running 0 55s cluster-monitoring-operator-848fbf4664-kgk8d 2/2 Running 1 65s grafana-694d7988cb-2wb4d 2/2 Running 0 55s kube-state-metrics-76c9787585-46p8g 3/3 Running 0 62s node-exporter-4vcwn 2/2 Running 0 61s node-exporter-72qmq 2/2 Running 0 61s node-exporter-9rl8w 2/2 Running 0 61s node-exporter-b4jzp 2/2 Running 0 62s node-exporter-g8bch 2/2 Running 0 61s node-exporter-ghrpn 2/2 Running 0 61s openshift-state-metrics-5cd4f4fdd5-r5bqb 3/3 Running 0 62s prometheus-adapter-d565dcc4b-k9pcw 1/1 Running 0 42s prometheus-adapter-d565dcc4b-lfmcd 1/1 Running 0 42s prometheus-k8s-0 7/7 Running 1 53s prometheus-k8s-1 7/7 Running 1 53s prometheus-operator-85d498d687-fkjgh 2/2 Running 0 62s telemeter-client-579c9fdc96-k5scn 3/3 Running 0 56s thanos-querier-7ff77445db-7tg57 5/5 Running 0 54s thanos-querier-7ff77445db-z68qb 5/5 Running 0 54s
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438