Description of problem: CVO should apply its ServiceMonitor and not rely on Cluster Monitoring Operator to do it. Work on this was started [1] but SM was blacklisted in next PR [2] and change was reverted causing CMO to apply this SM (as seen by failing e2e test [3]). [1]: https://github.com/openshift/cluster-version-operator/pull/214 [2]: https://github.com/openshift/cluster-version-operator/pull/221 [3]: https://github.com/openshift/cluster-monitoring-operator/pull/390
(In reply to Pawel Krupa from comment #0) > Description of problem: > > CVO should apply its ServiceMonitor and not rely on Cluster Monitoring > Operator to do it. > > Work on this was started [1] but SM was blacklisted in next PR [2] [2] only turns off the service monitor from being rendered to disk when CVO is in bootstrap-mode. The service monitor is definitely being applied. looking at the logs from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log (https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227) ``` 2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350 1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431) ``` The service monitor was applied to the cluster > and > change was reverted causing CMO to apply this SM (as seen by failing e2e > test [3]). > > > [1]: https://github.com/openshift/cluster-version-operator/pull/214 > [2]: https://github.com/openshift/cluster-version-operator/pull/221 > [3]: https://github.com/openshift/cluster-monitoring-operator/pull/390
Checking a recent run, the failing error was [1] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal] fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:156]: Unexpected error: <*errors.errorString | 0xc0002733a0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Grepping the build log: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/build-log.txt | grep 'missing some targets:' | tail -n1 Aug 6 15:18:27.110: INFO: missing some targets: [no match for map[job:cluster-version-operator] with health up and scrape URL ^http://.*/metrics$] Checking the logs Abhinav was looking at in comment 1: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'servicemonitor.*cluster-version-operator' | head -n 6 2019-08-06T14:52:28.131616402Z I0806 14:52:28.131527 1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431) 2019-08-06T14:52:35.439216286Z E0806 14:52:35.439154 1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2019-08-06T14:53:00.214679873Z E0806 14:53:00.214022 1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2019-08-06T14:53:23.162464672Z E0806 14:53:23.162435 1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2019-08-06T14:53:49.259272268Z I0806 14:53:49.258902 1 request.go:530] Throttling request took 91.807834ms, request: GET:https://127.0.0.1:6443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-version/servicemonitors/cluster-version-operator 2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350 1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431) so that's landing comfortably before the final missing target log at 15:18. The ServiceMonitor being removed in 390 had the openshift-monitoring namespace, while the version being pushed by the CVO has the openshift-cluster-version namespace; I dunno if that matters. Otherwise they look the same. Unfortunately, that job does not seem to have a pods/ with gathered pod logs [2], and must-gather doesn't list the ServiceMonitor where I can find it [3]. Would we expect to see something in one of these monitoring pods [4]? [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227#0:build-log.txt%3A7406 [2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/ [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/ [4]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-monitoring/pods/
As of now this should be fixed. Problem was that namespace didn't have a label and cluster-monitoring-operator couldn't pick up ServiceMonitor. Right now we are seeing data flowing into prometheus from CVO and manifests are not in CMO repository, but in CVO one.
Created attachment 1605706 [details] screenshot
It is verified with version `4.2.0-0.nightly-2019-08-18-222019` and the following steps # oc get Role -n openshift-cluster-version prometheus-k8s -o yaml - apiGroups: - "" resources: - services - endpoints - pods verbs: - get - list - watch # oc get RoleBinding -n openshift-cluster-version prometheus-k8s -o yaml subjects: - kind: ServiceAccount name: prometheus-k8s namespace: openshift-monitoring # oc get Namespace openshift-cluster-version -o yaml |grep cluster-monitoring openshift.io/cluster-monitoring: "true" # oc logs -f cluster-version-operator-956b48c68-swkfb -n openshift-cluster-version | grep 'servicemonitor.*cluster-version-operator' I0819 07:04:59.951493 1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410) I0819 07:08:48.067461 1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922