Bug 1666118

Summary: openshift-controller-manager metrics are not scraped
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MasterAssignee: Ben Parees <bparees>
Status: CLOSED ERRATA QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, bparees, deads, jokerman, mmccomas, yinzhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:41:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-01-15 00:22:28 UTC
There is no scrape target for openshift controller manager, so build metrics aren't updated.  This prevents "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" from passing (it's currently disabled).

Metrics need to be reported

Comment 1 Ben Parees 2019-01-19 00:16:55 UTC
I started working on this but i'm seeing two issues:

first, the controller manager SA can't seem to validate the token prometheus is passing on the /metrics call:

E0118 22:20:03.258525       1 authentication.go:62] Unable to authenticate the request due
 to an error: tokenreviews.authentication.k8s.io is forbidden: User "system:serviceaccount
:openshift-controller-manager:openshift-controller-manager-sa" cannot create tokenreviews.
authentication.k8s.io at the cluster scope: no RBAC policy matched

After granted the SA more permissions, i got it scraping, but i'm only seeing a subset of the metrics i would expect.. it seems like we're not getting metrics from a bunch of components.  I see the infra metrics like "controller queue lengths" but not the component level metrics like "build success counts".  So something in the controller itself is no longer wired properly to report all the metrics it used to.

Comment 3 Ben Parees 2019-01-24 15:06:19 UTC
should be working now.

Comment 5 Xingxing Xia 2019-01-25 10:51:44 UTC
Checked https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/60 code 
It adds (cluster)roles, (cluster)rolebindings, servicemonitors etc for both controller operator and operand. But not sure what needs be verified for this bug, could you give some guide?
PS: and what does "scrape metrics" mean? Does it just mean "retrieve metrics"? And what does this bug differ from https://jira.coreos.com/browse/MSTR-225 ?

Comment 6 Ben Parees 2019-01-25 16:10:54 UTC
Yes sorry scrape means collect/retrieve.

https://jira.coreos.com/browse/MSTR-225  covered 3 other operators (kube api operator, kube controller operator, openshift api operator), this bug covers the openshift controller operator.

This bug also covers the openshift controller itself.

So you can look to that jira for the general idea how how to verify the metrics, but for this bug you should verify that we have:

1) metrics from the openshift controller operator (i'm not sure what the names of these metrics are, but hopefully that other jira ticket gives you some hints based on what those operators exposed..the openshift controller operator should expose similar metrics)

2) metrics from the openshift controller (such as build metrics)

Comment 7 Xingxing Xia 2019-01-28 09:55:11 UTC
https://jira.coreos.com/browse/MSTR-225 test covered all 4 master kube/openshift api/controller operators, the case used `sa cluster-monitoring-operator -n openshift-monitoring` instead of sa prometheus-k8s.
Below switch to use sa prometheus-k8s. In latest payload env, checked controller operator:
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://metrics.openshift-controller-manager-operator.svc:443/metrics , it has metrics depth, adds, queue_latency, work_duration, retries similar to other master operators: 

Then I checked operand:
oc get pod -n openshift-controller-manager -o wide                                                                
NAME                       READY     STATUS    RESTARTS   AGE       IP            NODE                                        NOMINATED
NODE
controller-manager-hhhcq   1/1       Running   1          1h        10.130.0.40   ip-10-0-20-131.us-east-2.compute.internal   <none>   
controller-manager-sc9t9   1/1       Running   1          1h        10.129.0.36   ip-10-0-15-55.us-east-2.compute.internal    <none>   
controller-manager-xhbrk   1/1       Running   2          1h        10.128.0.48   ip-10-0-36-223.us-east-2.compute.internal   <none>

curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://10.129.0.36:8443/metrics
It has build metrics and consistent with `oc get build --all-namespaces`.
openshift_build_total{phase="Complete",reason="",strategy="Docker"} 1
openshift_build_total{phase="Complete",reason="",strategy="Source"} 10
openshift_build_total{phase="Failed",reason="GenericBuildFailed",strategy="Source"} 3

Per this, moving to VERIFIED.

Comment 8 Xingxing Xia 2019-01-28 09:56:26 UTC
BTW, curl service or other 2 pod instances, the output only has openshift_build_info, no other build metrics.
Is this OK? (Same question as https://url.corp.redhat.com/MSTR-240-comment)

curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://controller-manager.openshift-controller-manager.svc:443/metrics | grep build
openshift_build_info{gitCommit="8868a98a7b",gitVersion="v4.0.0-0.148.0",major="4",minor="0+"} 1

curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://10.130.0.40:8443/metrics | grep build
openshift_build_info{gitCommit="8868a98a7b",gitVersion="v4.0.0-0.148.0",major="4",minor="0+"} 1

Comment 11 errata-xmlrpc 2019-06-04 10:41:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758