Bug 1738527

Summary: cluster-version-operator is not applying Service Monitor
Product: OpenShift Container Platform Reporter: Pawel Krupa <pkrupa>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: sheng.lao <shlao>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: wking
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-16 06:35:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
screenshot none

Description Pawel Krupa 2019-08-07 11:38:14 UTC
Description of problem:

CVO should apply its ServiceMonitor and not rely on Cluster Monitoring Operator to do it.

Work on this was started [1] but SM was blacklisted in next PR [2] and change was reverted causing CMO to apply this SM (as seen by failing e2e test [3]).


[1]: https://github.com/openshift/cluster-version-operator/pull/214
[2]: https://github.com/openshift/cluster-version-operator/pull/221
[3]: https://github.com/openshift/cluster-monitoring-operator/pull/390

Comment 1 Abhinav Dahiya 2019-08-07 22:34:58 UTC
(In reply to Pawel Krupa from comment #0)
> Description of problem:
> 
> CVO should apply its ServiceMonitor and not rely on Cluster Monitoring
> Operator to do it.
> 
> Work on this was started [1] but SM was blacklisted in next PR [2] 

[2] only turns off the service monitor from being rendered to disk when CVO is in bootstrap-mode. The service monitor is definitely being applied.

looking at the logs from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log

(https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227)

```
2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350       1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431)
```

The service monitor was applied to the cluster

> and
> change was reverted causing CMO to apply this SM (as seen by failing e2e
> test [3]).
> 
> 
> [1]: https://github.com/openshift/cluster-version-operator/pull/214
> [2]: https://github.com/openshift/cluster-version-operator/pull/221
> [3]: https://github.com/openshift/cluster-monitoring-operator/pull/390

Comment 2 W. Trevor King 2019-08-12 22:49:53 UTC
Checking a recent run, the failing error was [1]

  [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal]

  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:156]: Unexpected error:
    <*errors.errorString | 0xc0002733a0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
  occurred 

Grepping the build log:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/build-log.txt | grep 'missing some targets:' | tail -n1
  Aug  6 15:18:27.110: INFO: missing some targets: [no match for map[job:cluster-version-operator] with health up and scrape URL ^http://.*/metrics$]

Checking the logs Abhinav was looking at in comment 1:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'servicemonitor.*cluster-version-operator' | head -n 6
  2019-08-06T14:52:28.131616402Z I0806 14:52:28.131527       1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431)
  2019-08-06T14:52:35.439216286Z E0806 14:52:35.439154       1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2019-08-06T14:53:00.214679873Z E0806 14:53:00.214022       1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2019-08-06T14:53:23.162464672Z E0806 14:53:23.162435       1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
  2019-08-06T14:53:49.259272268Z I0806 14:53:49.258902       1 request.go:530] Throttling request took 91.807834ms, request: GET:https://127.0.0.1:6443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-version/servicemonitors/cluster-version-operator
  2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350       1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431)

so that's landing comfortably before the final missing target log at 15:18.

The ServiceMonitor being removed in 390 had the openshift-monitoring namespace, while the version being pushed by the CVO has the openshift-cluster-version namespace; I dunno if that matters.  Otherwise they look the same.

Unfortunately, that job does not seem to have a pods/ with gathered pod logs [2], and must-gather doesn't list the ServiceMonitor where I can find it [3].  Would we expect to see something in one of these monitoring pods [4]?

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227#0:build-log.txt%3A7406
[2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/
[3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/
[4]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-monitoring/pods/

Comment 6 Pawel Krupa 2019-08-19 08:52:17 UTC
As of now this should be fixed. Problem was that namespace didn't have a label and cluster-monitoring-operator couldn't pick up ServiceMonitor. Right now we are seeing data flowing into prometheus from CVO and manifests are not in CMO repository, but in CVO one.

Comment 7 sheng.lao 2019-08-19 09:21:32 UTC
Created attachment 1605706 [details]
screenshot

Comment 8 sheng.lao 2019-08-19 09:23:02 UTC
It is verified with version `4.2.0-0.nightly-2019-08-18-222019` and the following steps 

# oc get Role -n openshift-cluster-version prometheus-k8s -o yaml
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch

# oc get RoleBinding -n openshift-cluster-version prometheus-k8s -o yaml    
subjects:
- kind: ServiceAccount
  name: prometheus-k8s
  namespace: openshift-monitoring

# oc get Namespace  openshift-cluster-version -o yaml |grep cluster-monitoring
    openshift.io/cluster-monitoring: "true"

# oc logs -f cluster-version-operator-956b48c68-swkfb -n openshift-cluster-version | grep 'servicemonitor.*cluster-version-operator'

I0819 07:04:59.951493       1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410)
I0819 07:08:48.067461       1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410)

Comment 9 errata-xmlrpc 2019-10-16 06:35:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922