Created attachment 1730864 [details] must gather from cluster Description of problem: As part of CRC we provision the openshift cluster on a single node and then add some of the operators to the override list of CVO so we can remove the workload from those operators to save some resources. One of the operator is monitoring which is part of this override list and we remove all the workload on that namespace. - https://github.com/code-ready/snc/blob/master/snc.sh#L430-L435 - https://github.com/code-ready/snc/blob/master/snc.sh#L269-L279 Till 4.5.x when a user use CRC and remove the monitoring from override list of CVO then monitoring again enabled to the cluster and user can able to use it. With 4.6.x even user remove the monitoring from override list then also CVO is not able to provision monitoring back to cluster. Version-Release number of selected component (if applicable): $ oc version Client Version: 4.6.3 Server Version: 4.6.3 Kubernetes Version: v1.19.0+9f84db3 Steps to Reproduce: 1. Download the latest version of CRC release from http://mirror.openshift.com/pub/openshift-v4/clients/crc/latest/ 2. Extract it 3. crc setup && crc start 4 Follow https://code-ready.github.io/crc/#starting-monitoring-alerting-telemetry_gsg (which used to work till 4.5.x) Actual results: CVO not able to provision the monitoring even it is not in the override list anymore. Expected results: Monitoring should able to run once it is removed from the CVO override list. Additional info: Added the must gather from a crc instance.
After the monitoring stack is removed from CVO override list, are the following objects created: - openshift-monitoring namespace - openshift-user-workload-monitoring namespace - cluster-monitoring-operator Deployment in openshift-monitoring namespace
@Pawel following is what we have from the monitoring namespace (Also I attached the must-gather logs) ``` $ oc get ns | grep -i monitor openshift-monitoring Active 5d23h openshift-user-workload-monitoring Active 5d23h $ oc get all -n openshift-monitoring NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alertmanager-main ClusterIP 172.25.107.41 <none> 9094/TCP,9092/TCP 5d23h service/alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 5d23h service/cluster-monitoring-operator ClusterIP None <none> 8443/TCP 5d23h service/grafana ClusterIP 172.25.83.98 <none> 3000/TCP 5d23h service/kube-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 5d23h service/node-exporter ClusterIP None <none> 9100/TCP 5d23h service/openshift-state-metrics ClusterIP None <none> 8443/TCP,9443/TCP 5d23h service/prometheus-adapter ClusterIP 172.25.9.169 <none> 443/TCP 5d23h service/prometheus-k8s ClusterIP 172.25.165.157 <none> 9091/TCP,9092/TCP 5d23h service/prometheus-operated ClusterIP None <none> 9090/TCP,10901/TCP 5d23h service/prometheus-operator ClusterIP None <none> 8443/TCP,8080/TCP 5d23h service/telemeter-client ClusterIP None <none> 8443/TCP 5d23h service/thanos-querier ClusterIP 172.25.47.12 <none> 9091/TCP,9092/TCP,9093/TCP 5d23h NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route.route.openshift.io/alertmanager-main alertmanager-main-openshift-monitoring.apps-crc.testing alertmanager-main web reencrypt/Redirect None route.route.openshift.io/grafana grafana-openshift-monitoring.apps-crc.testing grafana https reencrypt/Redirect None route.route.openshift.io/prometheus-k8s prometheus-k8s-openshift-monitoring.apps-crc.testing prometheus-k8s web reencrypt/Redirect None route.route.openshift.io/thanos-querier thanos-querier-openshift-monitoring.apps-crc.testing thanos-querier web reencrypt/Redirect None ```
It looks like CVO didn't create a Deployment for cluster-monitoring-operator as such this seems like a bug in CVO. Reassigning to CVO team for further investigation.
Just an observation from the CVO pod logs when we make the change in override list (removing the monitoring from it) following error occur in the pod log. ``` $ oc logs cluster-version-operator-7f8f59786d-b8pbz -n openshift-cluster-version | grep ^E1 [...] E1119 14:44:37.767574 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:01.509417 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:19.080031 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:37.299890 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:45:53.845171 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:46:12.888237 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:46:36.996495 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:46:54.885644 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:47:17.741783 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:47:39.737876 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" E1119 14:47:53.123715 1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator" [...] ```
CRC will need to ensure the admission webhook for PrometheusRule does not exist if the prometheus-operator pod is not deployed. And, ensure it does exist once the prometheus-operator pod is deployed. Otherwise, you'll run into the issue here where you've disabled the prometheus-operator but not the admission webhook and any admission requests that attempt to create PrometheusRule instances fail.
Seems like a CRC fix per comment 4 and comment 5.
Tested with generated bundle, marking it verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633