Bug 1949711
Summary: | cvo unable to reconcile deletion of openshift-monitoring namespace | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Christoph Blecker <cblecker> |
Component: | Monitoring | Assignee: | Arunprasad Rajkumar <arajkuma> |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.7 | CC: | alegrand, anpicker, aos-bugs, dofinn, erooth, jack.ottofaro, jokerman, kakkoyun, lmohanty, pkrupa, sthaha, wgordon, wking, yanyang |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 23:00:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Christoph Blecker
2021-04-14 21:21:21 UTC
Recreated this issue on 4.7 cluster: 1. Deleted namespace openshift-monitoring. Delete does not complete. CVO gets the following error: task.go:112] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (586 of 669): operatorgroups.operators.coreos.com "openshift-cluster-monitoring" is forbidden: unable to create new content in namespace openshift-monitoring because it is being terminated and sync_worker.go:941] Update error 610 of 669: UpdatePayloadClusterError Could not update prometheusrule "openshift-dns-operator/dns" (610 of 669): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found) 2. Removed namespace finalizer which allowed the namespace deletion to complete. Get this error: task.go:112] error running apply for operatorgroup "openshift-monitoring/openshift-cluster-monitoring" (586 of 669): namespaces "openshift-monitoring" not found and still get: sync_worker.go:941] Update error 610 of 669: UpdatePayloadClusterError Could not update prometheusrule "openshift-dns-operator/dns" (610 of 669): the server is reporting an internal error (*errors.StatusError: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found) 3. Recreating namespace openshift-monitoring was no help. IMO this is not high priority bug as deleting the namespace not causing workload disruptions and deleting the namespace is not something we suggest customers to do. As stated before, the openshift-monitoring namespace should never be deleted. But if it is we should be able to recover. First thing that has to happen for the cluster to recover is that the namespace deletion must be allowed to complete by removing the kube finalizers. Once the namespace deletion completes CVO is still unable to recreate the namespace resources and CVO continues to get: Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": service "prometheus-operator" not found However if the prometheusrules.openshift.io webhook's failurePolicy is changed from Fail to Ignore CVO is able to recreate the openshift-monitoring resources and the cluster recovers. Reached out to #forum-monitoing (https://coreos.slack.com/archives/C0VMT03S5/p1622819064320400) to let them know that the bug is being transferred to them to investigate making the failurePolicy change. We've already talked about switching the prometheusrules.openshift.io webhook's failurePolicy from Fail to Ignore for other reasons [1]. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1949840#c1 The blog post https://www.openshift.com/blog/the-hidden-dangers-of-terminating-namespaces details the implications of `terminating namespace`. tested with 4.8.0-0.nightly-2021-06-10-224448, issue is fixed # oc get ValidatingWebhookConfiguration prometheusrules.openshift.io -oyaml | grep failurePolicy failurePolicy: Ignore delete openshift-monitoring project # oc delete apiservices v1beta1.metrics.k8s.io; oc delete project openshift-monitoring apiservice.apiregistration.k8s.io "v1beta1.metrics.k8s.io" deleted project.project.openshift.io "openshift-monitoring" deleted wait for a while # oc get ns openshift-monitoring Error from server (NotFound): namespaces "openshift-monitoring" not found wait for recoverage of openshift-monitoring # oc get ns openshift-monitoring NAME STATUS AGE openshift-monitoring Active 64s # oc -n openshift-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-main-0 5/5 Running 0 56s alertmanager-main-1 5/5 Running 0 55s alertmanager-main-2 5/5 Running 0 55s cluster-monitoring-operator-848fbf4664-kgk8d 2/2 Running 1 65s grafana-694d7988cb-2wb4d 2/2 Running 0 55s kube-state-metrics-76c9787585-46p8g 3/3 Running 0 62s node-exporter-4vcwn 2/2 Running 0 61s node-exporter-72qmq 2/2 Running 0 61s node-exporter-9rl8w 2/2 Running 0 61s node-exporter-b4jzp 2/2 Running 0 62s node-exporter-g8bch 2/2 Running 0 61s node-exporter-ghrpn 2/2 Running 0 61s openshift-state-metrics-5cd4f4fdd5-r5bqb 3/3 Running 0 62s prometheus-adapter-d565dcc4b-k9pcw 1/1 Running 0 42s prometheus-adapter-d565dcc4b-lfmcd 1/1 Running 0 42s prometheus-k8s-0 7/7 Running 1 53s prometheus-k8s-1 7/7 Running 1 53s prometheus-operator-85d498d687-fkjgh 2/2 Running 0 62s telemeter-client-579c9fdc96-k5scn 3/3 Running 0 56s thanos-querier-7ff77445db-7tg57 5/5 Running 0 54s thanos-querier-7ff77445db-z68qb 5/5 Running 0 54s Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |