1899459 – Failed to start monitoring pods once the operator removed from override list of CVO

Bug 1899459 - Failed to start monitoring pods once the operator removed from override list of CVO

Summary: Failed to start monitoring pods once the operator removed from override list ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	crc
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Praveen Kumar
QA Contact:	Tomáš Sedmík
Docs Contact:	Kevin Owen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-19 10:01 UTC by Praveen Kumar
Modified:	2021-02-24 15:35 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:34:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
must gather from cluster (7.90 MB, application/x-xz) 2020-11-19 10:01 UTC, Praveen Kumar	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	code-ready snc pull 277	0	None	closed	BUG 1899459: Delete prometheus validation admission webhook	2021-02-07 04:45:53 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:35:06 UTC

Description Praveen Kumar 2020-11-19 10:01:30 UTC

Created attachment 1730864 [details]
must gather from cluster

Description of problem: 

As part of CRC we provision the openshift cluster on a single node and then add some of the operators to the override list of CVO so we can remove the workload from those operators to save some resources. One of the operator is monitoring which is part of this override list and we remove all the workload on that namespace.

- https://github.com/code-ready/snc/blob/master/snc.sh#L430-L435
- https://github.com/code-ready/snc/blob/master/snc.sh#L269-L279

Till 4.5.x when a user use CRC and remove the monitoring from override list of CVO then monitoring again enabled to the cluster and user can able to use it.

With 4.6.x even user remove the monitoring from override list then also CVO is not able to provision monitoring back to cluster.

Version-Release number of selected component (if applicable):
$ oc version
Client Version: 4.6.3
Server Version: 4.6.3
Kubernetes Version: v1.19.0+9f84db3


Steps to Reproduce:
1. Download the latest version of CRC release from http://mirror.openshift.com/pub/openshift-v4/clients/crc/latest/
2. Extract it 
3. crc setup && crc start
4 Follow https://code-ready.github.io/crc/#starting-monitoring-alerting-telemetry_gsg (which used to work till 4.5.x) 

Actual results:
CVO not able to provision the monitoring even it is not in the override list anymore.


Expected results:
Monitoring should able to run once it is removed from the CVO override list.

Additional info:
Added the must gather from a crc instance.

Comment 1 Pawel Krupa 2020-11-19 10:11:55 UTC

After the monitoring stack is removed from CVO override list, are the following objects created:
- openshift-monitoring namespace
- openshift-user-workload-monitoring namespace
- cluster-monitoring-operator Deployment in openshift-monitoring namespace

Comment 2 Praveen Kumar 2020-11-19 11:17:50 UTC

@Pawel following is what we have from the monitoring namespace (Also I attached the must-gather logs)

```
$ oc get ns | grep -i monitor
openshift-monitoring                               Active   5d23h
openshift-user-workload-monitoring                 Active   5d23h

$ oc get all -n openshift-monitoring
NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/alertmanager-main             ClusterIP   172.25.107.41    <none>        9094/TCP,9092/TCP            5d23h
service/alertmanager-operated         ClusterIP   None             <none>        9093/TCP,9094/TCP,9094/UDP   5d23h
service/cluster-monitoring-operator   ClusterIP   None             <none>        8443/TCP                     5d23h
service/grafana                       ClusterIP   172.25.83.98     <none>        3000/TCP                     5d23h
service/kube-state-metrics            ClusterIP   None             <none>        8443/TCP,9443/TCP            5d23h
service/node-exporter                 ClusterIP   None             <none>        9100/TCP                     5d23h
service/openshift-state-metrics       ClusterIP   None             <none>        8443/TCP,9443/TCP            5d23h
service/prometheus-adapter            ClusterIP   172.25.9.169     <none>        443/TCP                      5d23h
service/prometheus-k8s                ClusterIP   172.25.165.157   <none>        9091/TCP,9092/TCP            5d23h
service/prometheus-operated           ClusterIP   None             <none>        9090/TCP,10901/TCP           5d23h
service/prometheus-operator           ClusterIP   None             <none>        8443/TCP,8080/TCP            5d23h
service/telemeter-client              ClusterIP   None             <none>        8443/TCP                     5d23h
service/thanos-querier                ClusterIP   172.25.47.12     <none>        9091/TCP,9092/TCP,9093/TCP   5d23h

NAME                                         HOST/PORT                                                 PATH   SERVICES            PORT    TERMINATION          WILDCARD
route.route.openshift.io/alertmanager-main   alertmanager-main-openshift-monitoring.apps-crc.testing          alertmanager-main   web     reencrypt/Redirect   None
route.route.openshift.io/grafana             grafana-openshift-monitoring.apps-crc.testing                    grafana             https   reencrypt/Redirect   None
route.route.openshift.io/prometheus-k8s      prometheus-k8s-openshift-monitoring.apps-crc.testing             prometheus-k8s      web     reencrypt/Redirect   None
route.route.openshift.io/thanos-querier      thanos-querier-openshift-monitoring.apps-crc.testing             thanos-querier      web     reencrypt/Redirect   None
```

Comment 3 Pawel Krupa 2020-11-19 12:11:41 UTC

It looks like CVO didn't create a Deployment for cluster-monitoring-operator as such this seems like a bug in CVO. Reassigning to CVO team for further investigation.

Comment 4 Praveen Kumar 2020-11-19 14:53:47 UTC

Just an observation from the CVO pod logs when we make the change in override list (removing the monitoring from it) following error occur in the pod log.

```
$ oc logs cluster-version-operator-7f8f59786d-b8pbz -n openshift-cluster-version  | grep ^E1
[...]
E1119 14:44:37.767574       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:45:01.509417       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:45:19.080031       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:45:37.299890       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:45:53.845171       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:46:12.888237       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:46:36.996495       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:46:54.885644       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:47:17.741783       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:47:39.737876       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
E1119 14:47:53.123715       1 task.go:81] error running apply for prometheusrule "openshift-cluster-version/cluster-version-operator" (9 of 617): Internal error occurred: failed calling webhook "prometheusrules.openshift.io": Post "https://prometheus-operator.openshift-monitoring.svc:8080/admission-prometheusrules/validate?timeout=5s": no endpoints available for service "prometheus-operator"
[...]
```

Comment 5 Ben Browning 2020-11-19 16:56:19 UTC

CRC will need to ensure the admission webhook for PrometheusRule does not exist if the prometheus-operator pod is not deployed. And, ensure it does exist once the prometheus-operator pod is deployed. Otherwise, you'll run into the issue here where you've disabled the prometheus-operator but not the admission webhook and any admission requests that attempt to create PrometheusRule instances fail.

Comment 6 W. Trevor King 2020-11-19 17:18:10 UTC

Seems like a CRC fix per comment 4 and comment 5.

Comment 7 Praveen Kumar 2020-11-20 13:59:18 UTC

Tested with generated bundle, marking it verified.

Comment 11 errata-xmlrpc 2021-02-24 15:34:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.