Bug 1666118 - openshift-controller-manager metrics are not scraped
Summary: openshift-controller-manager metrics are not scraped
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Ben Parees
QA Contact: Xingxing Xia
Depends On:
TreeView+ depends on / blocked
Reported: 2019-01-15 00:22 UTC by Clayton Coleman
Modified: 2019-06-04 10:42 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:41:55 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:42:02 UTC

Description Clayton Coleman 2019-01-15 00:22:28 UTC
There is no scrape target for openshift controller manager, so build metrics aren't updated.  This prevents "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" from passing (it's currently disabled).

Metrics need to be reported

Comment 1 Ben Parees 2019-01-19 00:16:55 UTC
I started working on this but i'm seeing two issues:

first, the controller manager SA can't seem to validate the token prometheus is passing on the /metrics call:

E0118 22:20:03.258525       1 authentication.go:62] Unable to authenticate the request due
 to an error: tokenreviews.authentication.k8s.io is forbidden: User "system:serviceaccount
:openshift-controller-manager:openshift-controller-manager-sa" cannot create tokenreviews.
authentication.k8s.io at the cluster scope: no RBAC policy matched

After granted the SA more permissions, i got it scraping, but i'm only seeing a subset of the metrics i would expect.. it seems like we're not getting metrics from a bunch of components.  I see the infra metrics like "controller queue lengths" but not the component level metrics like "build success counts".  So something in the controller itself is no longer wired properly to report all the metrics it used to.

Comment 3 Ben Parees 2019-01-24 15:06:19 UTC
should be working now.

Comment 5 Xingxing Xia 2019-01-25 10:51:44 UTC
Checked https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/60 code 
It adds (cluster)roles, (cluster)rolebindings, servicemonitors etc for both controller operator and operand. But not sure what needs be verified for this bug, could you give some guide?
PS: and what does "scrape metrics" mean? Does it just mean "retrieve metrics"? And what does this bug differ from https://jira.coreos.com/browse/MSTR-225 ?

Comment 6 Ben Parees 2019-01-25 16:10:54 UTC
Yes sorry scrape means collect/retrieve.

https://jira.coreos.com/browse/MSTR-225  covered 3 other operators (kube api operator, kube controller operator, openshift api operator), this bug covers the openshift controller operator.

This bug also covers the openshift controller itself.

So you can look to that jira for the general idea how how to verify the metrics, but for this bug you should verify that we have:

1) metrics from the openshift controller operator (i'm not sure what the names of these metrics are, but hopefully that other jira ticket gives you some hints based on what those operators exposed..the openshift controller operator should expose similar metrics)

2) metrics from the openshift controller (such as build metrics)

Comment 7 Xingxing Xia 2019-01-28 09:55:11 UTC
https://jira.coreos.com/browse/MSTR-225 test covered all 4 master kube/openshift api/controller operators, the case used `sa cluster-monitoring-operator -n openshift-monitoring` instead of sa prometheus-k8s.
Below switch to use sa prometheus-k8s. In latest payload env, checked controller operator:
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://metrics.openshift-controller-manager-operator.svc:443/metrics , it has metrics depth, adds, queue_latency, work_duration, retries similar to other master operators: 

Then I checked operand:
oc get pod -n openshift-controller-manager -o wide                                                                
NAME                       READY     STATUS    RESTARTS   AGE       IP            NODE                                        NOMINATED
controller-manager-hhhcq   1/1       Running   1          1h   ip-10-0-20-131.us-east-2.compute.internal   <none>   
controller-manager-sc9t9   1/1       Running   1          1h   ip-10-0-15-55.us-east-2.compute.internal    <none>   
controller-manager-xhbrk   1/1       Running   2          1h   ip-10-0-36-223.us-east-2.compute.internal   <none>

curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)"
It has build metrics and consistent with `oc get build --all-namespaces`.
openshift_build_total{phase="Complete",reason="",strategy="Docker"} 1
openshift_build_total{phase="Complete",reason="",strategy="Source"} 10
openshift_build_total{phase="Failed",reason="GenericBuildFailed",strategy="Source"} 3

Per this, moving to VERIFIED.

Comment 8 Xingxing Xia 2019-01-28 09:56:26 UTC
BTW, curl service or other 2 pod instances, the output only has openshift_build_info, no other build metrics.
Is this OK? (Same question as https://url.corp.redhat.com/MSTR-240-comment)

curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://controller-manager.openshift-controller-manager.svc:443/metrics | grep build
openshift_build_info{gitCommit="8868a98a7b",gitVersion="v4.0.0-0.148.0",major="4",minor="0+"} 1

curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" | grep build
openshift_build_info{gitCommit="8868a98a7b",gitVersion="v4.0.0-0.148.0",major="4",minor="0+"} 1

Comment 11 errata-xmlrpc 2019-06-04 10:41:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.