Bug 1899459
Summary: | Failed to start monitoring pods once the operator removed from override list of CVO | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Praveen Kumar <prkumar> | ||||
Component: | crc | Assignee: | Praveen Kumar <prkumar> | ||||
Status: | CLOSED ERRATA | QA Contact: | Tomáš Sedmík <tsedmik> | ||||
Severity: | low | Docs Contact: | Kevin Owen <kowen> | ||||
Priority: | low | ||||||
Version: | 4.6.z | CC: | alegrand, anpicker, aos-bugs, bbrownin, cfergeau, erooth, gbraad, jokerman, kakkoyun, kowen, lcosic, pkrupa, surbania, tsedmik, veillard, wking, yanyang | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.7.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-02-24 15:34:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Praveen Kumar
2020-11-19 10:01:30 UTC
After the monitoring stack is removed from CVO override list, are the following objects created: - openshift-monitoring namespace - openshift-user-workload-monitoring namespace - cluster-monitoring-operator Deployment in openshift-monitoring namespace @Pawel following is what we have from the monitoring namespace (Also I attached the must-gather logs) ``` $ oc get ns | grep -i monitor openshift-monitoring Active 5d23h openshift-user-workload-monitoring Active 5d23h $ oc get all -n openshift-monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-main ClusterIP 172.25.107.41 <none> 9094/TCP,9092/TCP 5d23h service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 5d23h service/cluster-monitoring-operator ClusterIP None <none> 8443/TCP 5d23h service/grafana ClusterIP 172.25.83.98 <none> 3000/TCP 5d23h service/kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 5d23h service/node-exporter ClusterIP None <none> 9100/TCP 5d23h service/openshift-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 5d23h service/prometheus-adapter ClusterIP 172.25.9.169 <none> 443/TCP 5d23h service/prometheus-k8s ClusterIP 172.25.165.157 <none> 9091/TCP,9092/TCP 5d23h service/prometheus-operated ClusterIP None <none> 9090/TCP,10901/TCP 5d23h service/prometheus-operator ClusterIP None <none> 8443/TCP,8080/TCP 5d23h service/telemeter-client ClusterIP None <none> 8443/TCP 5d23h service/thanos-querier ClusterIP 172.25.47.12 <none> 9091/TCP,9092/TCP,9093/TCP 5d23h NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/alertmanager-main alertmanager-main-openshift-monitoring.apps-crc.testing alertmanager-main web reencrypt/Redirect None route.route.openshift.io/grafana grafana-openshift-monitoring.apps-crc.testing grafana https reencrypt/Redirect None route.route.openshift.io/prometheus-k8s prometheus-k8s-openshift-monitoring.apps-crc.testing prometheus-k8s web reencrypt/Redirect None route.route.openshift.io/thanos-querier thanos-querier-openshift-monitoring.apps-crc.testing thanos-querier web reencrypt/Redirect None ``` It looks like CVO didn't create a Deployment for cluster-monitoring-operator as such this seems like a bug in CVO. Reassigning to CVO team for further investigation. Just an observation from the CVO pod logs when we make the change in override list (removing the monitoring from it) following error occur in the pod log. ``` $ oc logs cluster-version-operator-7f8f59786d-b8pbz -n openshift-cluster-version | grep ^E1 [...] E1119 14:44:37.767574 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:01.509417 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:19.080031 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:37.299890 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:53.845171 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:46:12.888237 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:46:36.996495 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:46:54.885644 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:47:17.741783 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:47:39.737876 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:47:53.123715 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" [...] ``` CRC will need to ensure the admission webhook for PrometheusRule does not exist if the prometheus-operator pod is not deployed. And, ensure it does exist once the prometheus-operator pod is deployed. Otherwise, you'll run into the issue here where you've disabled the prometheus-operator but not the admission webhook and any admission requests that attempt to create PrometheusRule instances fail. Tested with generated bundle, marking it verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |