1738527 – cluster-version-operator is not applying Service Monitor

Bug 1738527 - cluster-version-operator is not applying Service Monitor

Summary: cluster-version-operator is not applying Service Monitor

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Abhinav Dahiya
QA Contact:	sheng.lao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-08-07 11:38 UTC by Pawel Krupa
Modified:	2019-10-16 06:35 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:35:15 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
screenshot (34.51 KB, image/png) 2019-08-19 09:21 UTC, sheng.lao	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 234	None	closed	Bug 1738527: install: add role bindings for prometheus service account	2020-12-07 12:35:03 UTC
Github	openshift cluster-version-operator pull 235	None	closed	Bug 1738527: install: enable prometheus-operator to watch over namespace in search for ServiceMonitor	2020-12-07 12:35:04 UTC
Red Hat Product Errata	RHBA-2019:2922	None	None	None	2019-10-16 06:35:25 UTC

Description Pawel Krupa 2019-08-07 11:38:14 UTC

Description of problem:

CVO should apply its ServiceMonitor and not rely on Cluster Monitoring Operator to do it.

Work on this was started [1] but SM was blacklisted in next PR [2] and change was reverted causing CMO to apply this SM (as seen by failing e2e test [3]).


[1]: https://github.com/openshift/cluster-version-operator/pull/214
[2]: https://github.com/openshift/cluster-version-operator/pull/221
[3]: https://github.com/openshift/cluster-monitoring-operator/pull/390

Comment 1 Abhinav Dahiya 2019-08-07 22:34:58 UTC

(In reply to Pawel Krupa from comment #0)
> Description of problem:
> 
> CVO should apply its ServiceMonitor and not rely on Cluster Monitoring
> Operator to do it.
> 
> Work on this was started [1] but SM was blacklisted in next PR [2] 

[2] only turns off the service monitor from being rendered to disk when CVO is in bootstrap-mode. The service monitor is definitely being applied.

looking at the logs from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log

(https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227)

```
2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350       1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431)
```

The service monitor was applied to the cluster

> and
> change was reverted causing CMO to apply this SM (as seen by failing e2e
> test [3]).
> 
> 
> [1]: https://github.com/openshift/cluster-version-operator/pull/214
> [2]: https://github.com/openshift/cluster-version-operator/pull/221
> [3]: https://github.com/openshift/cluster-monitoring-operator/pull/390

Comment 2 W. Trevor King 2019-08-12 22:49:53 UTC

Checking a recent run, the failing error was [1]

  [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal]

  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:156]: Unexpected error:
    <*errors.errorString | 0xc0002733a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
  occurred 

Grepping the build log:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/build-log.txt | grep 'missing some targets:' | tail -n1
  Aug  6 15:18:27.110: INFO: missing some targets: [no match for map[job:cluster-version-operator] with health up and scrape URL ^http://.*/metrics$]

Checking the logs Abhinav was looking at in comment 1:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'servicemonitor.*cluster-version-operator' | head -n 6
  2019-08-06T14:52:28.131616402Z I0806 14:52:28.131527       1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431)
  2019-08-06T14:52:35.439216286Z E0806 14:52:35.439154       1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2019-08-06T14:53:00.214679873Z E0806 14:53:00.214022       1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2019-08-06T14:53:23.162464672Z E0806 14:53:23.162435       1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2019-08-06T14:53:49.259272268Z I0806 14:53:49.258902       1 request.go:530] Throttling request took 91.807834ms, request: GET:https://127.0.0.1:6443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-version/servicemonitors/cluster-version-operator
  2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350       1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431)

so that's landing comfortably before the final missing target log at 15:18.

The ServiceMonitor being removed in 390 had the openshift-monitoring namespace, while the version being pushed by the CVO has the openshift-cluster-version namespace; I dunno if that matters.  Otherwise they look the same.

Unfortunately, that job does not seem to have a pods/ with gathered pod logs [2], and must-gather doesn't list the ServiceMonitor where I can find it [3].  Would we expect to see something in one of these monitoring pods [4]?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227#0:build-log.txt%3A7406
[2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/
[4]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-monitoring/pods/

Comment 6 Pawel Krupa 2019-08-19 08:52:17 UTC

As of now this should be fixed. Problem was that namespace didn't have a label and cluster-monitoring-operator couldn't pick up ServiceMonitor. Right now we are seeing data flowing into prometheus from CVO and manifests are not in CMO repository, but in CVO one.

Comment 7 sheng.lao 2019-08-19 09:21:32 UTC

Created attachment 1605706 [details]
screenshot

Comment 8 sheng.lao 2019-08-19 09:23:02 UTC

It is verified with version `4.2.0-0.nightly-2019-08-18-222019` and the following steps 

# oc get Role -n openshift-cluster-version prometheus-k8s -o yaml
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch

# oc get RoleBinding -n openshift-cluster-version prometheus-k8s -o yaml    
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: openshift-monitoring

# oc get Namespace  openshift-cluster-version -o yaml |grep cluster-monitoring
    openshift.io/cluster-monitoring: "true"

# oc logs -f cluster-version-operator-956b48c68-swkfb -n openshift-cluster-version | grep 'servicemonitor.*cluster-version-operator'

I0819 07:04:59.951493       1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410)
I0819 07:08:48.067461       1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410)

Comment 9 errata-xmlrpc 2019-10-16 06:35:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.