Bug 1949711

Summary: cvo unable to reconcile deletion of openshift-monitoring namespace
Product: OpenShift Container Platform Reporter: Christoph Blecker <cblecker>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.7CC: alegrand, anpicker, aos-bugs, dofinn, erooth, jack.ottofaro, jokerman, kakkoyun, lmohanty, pkrupa, sthaha, wgordon, wking, yanyang
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 23:00:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Christoph Blecker 2021-04-14 21:21:21 UTC
Description of problem:
The CVO is unable to repair the deletion of the openshift-monitoring namespace. The CVO will report that the monitoring clusteroperator is healthy, when in fact, it is not.

How reproducible:
consistent

Steps to Reproduce:
1. `oc delete namespace openshift-monitoring`
2. Watch CVO logs to observe that it will be unable to recover from this condition
3.

Actual results:
CVO will be unable to recover from this condition for three reasons:
- The aggregated metrics server (apiservices/v1beta1.metrics.k8s.io) will be unreachable, and as such, the namespace will never actually complete deletion.
- The validation webhooks (validatingwebhookconfigurations/prometheusrules.openshift.io) will prevent creation of prometheusrules causing the CVO reconcile loop to hang. This si because the webhook server will be offline.
- The cluster operator (clusteroperator/monitoring) will thing the namespace is already created, and as such, it will attempt to apply resources but doesn't actually check to ensure namespace is created first.

Expected results:
The CVO is able to self heal from this, and actually observe the monitoring clusteroperator status.

Additional info:
To workaround this issue, the following command can be used:
oc delete validatingwebhookconfigurations/prometheusrules.openshift.io apiservices/v1beta1.metrics.k8s.io clusteroperator/monitoring

This deletes the configurations that prevent the CVO from restoring the monitoring clusteroperator to it's proper status, and those objections will be recreated as the monitoring stack redeploys.

Comment 1 Jack Ottofaro 2021-05-07 14:24:21 UTC
Recreated this issue on 4.7 cluster:

1. Deleted namespace openshift-monitoring. Delete does not complete. CVO gets the following error:

task.go:112] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (586 of 669): operatorgroups.operators.coreos.com "openshift-cluster-monitoring" is forbidden: unable to create new content in namespace openshift-monitoring because it is being terminated

and

sync_worker.go:941] Update error 610 of 669: UpdatePayloadClusterError Could not update prometheusrule "openshift-dns-operator/dns" (610 of 669): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found)

2. Removed namespace finalizer which allowed the namespace deletion to complete. Get this error:

task.go:112] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (586 of 669): namespaces "openshift-monitoring" not found

and still get:

sync_worker.go:941] Update error 610 of 669: UpdatePayloadClusterError Could not update prometheusrule "openshift-dns-operator/dns" (610 of 669): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found)

3. Recreating namespace openshift-monitoring was no help.

Comment 2 Lalatendu Mohanty 2021-05-26 20:29:46 UTC
IMO this is not high priority bug as deleting the namespace not causing workload disruptions and deleting the namespace is not something we suggest customers to do.

Comment 3 Jack Ottofaro 2021-06-04 15:39:16 UTC
As stated before, the openshift-monitoring namespace should never be deleted. But if it is we should be able to recover. First thing that has to happen for the cluster to recover is that the namespace deletion must be allowed to complete by removing the kube finalizers.

Once the namespace deletion completes CVO is still unable to recreate the namespace resources and CVO continues to get:

Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found

However if the prometheusrules.openshift.io webhook's failurePolicy is changed from Fail to Ignore CVO is able to recreate the openshift-monitoring resources and the cluster recovers.

Reached out to #forum-monitoing (https://coreos.slack.com/archives/C0VMT03S5/p1622819064320400) to let them know that the bug is being transferred to them to investigate making the failurePolicy change.

Comment 4 Simon Pasquier 2021-06-04 15:48:38 UTC
We've already talked about switching the prometheusrules.openshift.io webhook's failurePolicy from Fail to Ignore for other reasons [1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1949840#c1

Comment 6 Arunprasad Rajkumar 2021-06-10 07:14:37 UTC
The blog post https://www.openshift.com/blog/the-hidden-dangers-of-terminating-namespaces details the implications of `terminating namespace`.

Comment 9 Junqi Zhao 2021-06-11 08:11:58 UTC
tested with 4.8.0-0.nightly-2021-06-10-224448, issue is fixed
# oc get ValidatingWebhookConfiguration prometheusrules.openshift.io  -oyaml | grep failurePolicy
  failurePolicy: Ignore

delete openshift-monitoring project
# oc delete apiservices v1beta1.metrics.k8s.io; oc delete project openshift-monitoring
apiservice.apiregistration.k8s.io "v1beta1.metrics.k8s.io" deleted
project.project.openshift.io "openshift-monitoring" deleted

wait for a while
# oc get ns openshift-monitoring
Error from server (NotFound): namespaces "openshift-monitoring" not found

wait for recoverage of openshift-monitoring
# oc get ns openshift-monitoring
NAME                   STATUS   AGE
openshift-monitoring   Active   64s
# oc -n openshift-monitoring get pod
NAME                                           READY   STATUS    RESTARTS   AGE
alertmanager-main-0                            5/5     Running   0          56s
alertmanager-main-1                            5/5     Running   0          55s
alertmanager-main-2                            5/5     Running   0          55s
cluster-monitoring-operator-848fbf4664-kgk8d   2/2     Running   1          65s
grafana-694d7988cb-2wb4d                       2/2     Running   0          55s
kube-state-metrics-76c9787585-46p8g            3/3     Running   0          62s
node-exporter-4vcwn                            2/2     Running   0          61s
node-exporter-72qmq                            2/2     Running   0          61s
node-exporter-9rl8w                            2/2     Running   0          61s
node-exporter-b4jzp                            2/2     Running   0          62s
node-exporter-g8bch                            2/2     Running   0          61s
node-exporter-ghrpn                            2/2     Running   0          61s
openshift-state-metrics-5cd4f4fdd5-r5bqb       3/3     Running   0          62s
prometheus-adapter-d565dcc4b-k9pcw             1/1     Running   0          42s
prometheus-adapter-d565dcc4b-lfmcd             1/1     Running   0          42s
prometheus-k8s-0                               7/7     Running   1          53s
prometheus-k8s-1                               7/7     Running   1          53s
prometheus-operator-85d498d687-fkjgh           2/2     Running   0          62s
telemeter-client-579c9fdc96-k5scn              3/3     Running   0          56s
thanos-querier-7ff77445db-7tg57                5/5     Running   0          54s
thanos-querier-7ff77445db-z68qb                5/5     Running   0          54s

Comment 12 errata-xmlrpc 2021-07-27 23:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438