Bug 1738527
Summary: | cluster-version-operator is not applying Service Monitor | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Pawel Krupa <pkrupa> | ||||
Component: | Installer | Assignee: | Abhinav Dahiya <adahiya> | ||||
Installer sub component: | openshift-installer | QA Contact: | sheng.lao <shlao> | ||||
Status: | CLOSED ERRATA | Docs Contact: | |||||
Severity: | unspecified | ||||||
Priority: | unspecified | CC: | wking | ||||
Version: | 4.2.0 | ||||||
Target Milestone: | --- | ||||||
Target Release: | 4.2.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-10-16 06:35:15 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Pawel Krupa
2019-08-07 11:38:14 UTC
(In reply to Pawel Krupa from comment #0) > Description of problem: > > CVO should apply its ServiceMonitor and not rely on Cluster Monitoring > Operator to do it. > > Work on this was started [1] but SM was blacklisted in next PR [2] [2] only turns off the service monitor from being rendered to disk when CVO is in bootstrap-mode. The service monitor is definitely being applied. looking at the logs from https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log (https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227) ``` 2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350 1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431) ``` The service monitor was applied to the cluster > and > change was reverted causing CMO to apply this SM (as seen by failing e2e > test [3]). > > > [1]: https://github.com/openshift/cluster-version-operator/pull/214 > [2]: https://github.com/openshift/cluster-version-operator/pull/221 > [3]: https://github.com/openshift/cluster-monitoring-operator/pull/390 Checking a recent run, the failing error was [1] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel/minimal] fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:156]: Unexpected error: <*errors.errorString | 0xc0002733a0>: { s: "timed out waiting for the condition", } timed out waiting for the condition occurred Grepping the build log: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/build-log.txt | grep 'missing some targets:' | tail -n1 Aug 6 15:18:27.110: INFO: missing some targets: [no match for map[job:cluster-version-operator] with health up and scrape URL ^http://.*/metrics$] Checking the logs Abhinav was looking at in comment 1: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-6cff966c8b-926jf/cluster-version-operator/cluster-version-operator/logs/current.log | grep 'servicemonitor.*cluster-version-operator' | head -n 6 2019-08-06T14:52:28.131616402Z I0806 14:52:28.131527 1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431) 2019-08-06T14:52:35.439216286Z E0806 14:52:35.439154 1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2019-08-06T14:53:00.214679873Z E0806 14:53:00.214022 1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2019-08-06T14:53:23.162464672Z E0806 14:53:23.162435 1 task.go:77] error running apply for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431): failed to get resource type: no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" 2019-08-06T14:53:49.259272268Z I0806 14:53:49.258902 1 request.go:530] Throttling request took 91.807834ms, request: GET:https://127.0.0.1:6443/apis/monitoring.coreos.com/v1/namespaces/openshift-cluster-version/servicemonitors/cluster-version-operator 2019-08-06T14:53:49.593395277Z I0806 14:53:49.593350 1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (6 of 431) so that's landing comfortably before the final missing target log at 15:18. The ServiceMonitor being removed in 390 had the openshift-monitoring namespace, while the version being pushed by the CVO has the openshift-cluster-version namespace; I dunno if that matters. Otherwise they look the same. Unfortunately, that job does not seem to have a pods/ with gathered pod logs [2], and must-gather doesn't list the ServiceMonitor where I can find it [3]. Would we expect to see something in one of these monitoring pods [4]? [1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227#0:build-log.txt%3A7406 [2]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/ [3]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-cluster-version/ [4]: https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/390/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1227/artifacts/e2e-aws/must-gather/namespaces/openshift-monitoring/pods/ As of now this should be fixed. Problem was that namespace didn't have a label and cluster-monitoring-operator couldn't pick up ServiceMonitor. Right now we are seeing data flowing into prometheus from CVO and manifests are not in CMO repository, but in CVO one. Created attachment 1605706 [details]
screenshot
It is verified with version `4.2.0-0.nightly-2019-08-18-222019` and the following steps # oc get Role -n openshift-cluster-version prometheus-k8s -o yaml - apiGroups: - "" resources: - services - endpoints - pods verbs: - get - list - watch # oc get RoleBinding -n openshift-cluster-version prometheus-k8s -o yaml subjects: - kind: ServiceAccount name: prometheus-k8s namespace: openshift-monitoring # oc get Namespace openshift-cluster-version -o yaml |grep cluster-monitoring openshift.io/cluster-monitoring: "true" # oc logs -f cluster-version-operator-956b48c68-swkfb -n openshift-cluster-version | grep 'servicemonitor.*cluster-version-operator' I0819 07:04:59.951493 1 sync_worker.go:587] Done syncing for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410) I0819 07:08:48.067461 1 sync_worker.go:574] Running sync for servicemonitor "openshift-cluster-version/cluster-version-operator" (8 of 410) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922 |