There is no scrape target for openshift controller manager, so build metrics aren't updated. This prevents "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" from passing (it's currently disabled).
Metrics need to be reported
I started working on this but i'm seeing two issues:
first, the controller manager SA can't seem to validate the token prometheus is passing on the /metrics call:
E0118 22:20:03.258525 1 authentication.go:62] Unable to authenticate the request due
to an error: tokenreviews.authentication.k8s.io is forbidden: User "system:serviceaccount
:openshift-controller-manager:openshift-controller-manager-sa" cannot create tokenreviews.
authentication.k8s.io at the cluster scope: no RBAC policy matched
After granted the SA more permissions, i got it scraping, but i'm only seeing a subset of the metrics i would expect.. it seems like we're not getting metrics from a bunch of components. I see the infra metrics like "controller queue lengths" but not the component level metrics like "build success counts". So something in the controller itself is no longer wired properly to report all the metrics it used to.
should be working now.
Checked https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/60 code
It adds (cluster)roles, (cluster)rolebindings, servicemonitors etc for both controller operator and operand. But not sure what needs be verified for this bug, could you give some guide?
PS: and what does "scrape metrics" mean? Does it just mean "retrieve metrics"? And what does this bug differ from https://jira.coreos.com/browse/MSTR-225 ?
Yes sorry scrape means collect/retrieve.
https://jira.coreos.com/browse/MSTR-225 covered 3 other operators (kube api operator, kube controller operator, openshift api operator), this bug covers the openshift controller operator.
This bug also covers the openshift controller itself.
So you can look to that jira for the general idea how how to verify the metrics, but for this bug you should verify that we have:
1) metrics from the openshift controller operator (i'm not sure what the names of these metrics are, but hopefully that other jira ticket gives you some hints based on what those operators exposed..the openshift controller operator should expose similar metrics)
2) metrics from the openshift controller (such as build metrics)
https://jira.coreos.com/browse/MSTR-225 test covered all 4 master kube/openshift api/controller operators, the case used `sa cluster-monitoring-operator -n openshift-monitoring` instead of sa prometheus-k8s.
Below switch to use sa prometheus-k8s. In latest payload env, checked controller operator:
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://metrics.openshift-controller-manager-operator.svc:443/metrics , it has metrics depth, adds, queue_latency, work_duration, retries similar to other master operators:
Then I checked operand:
oc get pod -n openshift-controller-manager -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED
controller-manager-hhhcq 1/1 Running 1 1h 10.130.0.40 ip-10-0-20-131.us-east-2.compute.internal <none>
controller-manager-sc9t9 1/1 Running 1 1h 10.129.0.36 ip-10-0-15-55.us-east-2.compute.internal <none>
controller-manager-xhbrk 1/1 Running 2 1h 10.128.0.48 ip-10-0-36-223.us-east-2.compute.internal <none>
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://10.129.0.36:8443/metrics
It has build metrics and consistent with `oc get build --all-namespaces`.
Per this, moving to VERIFIED.
BTW, curl service or other 2 pod instances, the output only has openshift_build_info, no other build metrics.
Is this OK? (Same question as https://url.corp.redhat.com/MSTR-240-comment)
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://controller-manager.openshift-controller-manager.svc:443/metrics | grep build
curl -k -H "Authorization: Bearer $(oc sa get-token prometheus-k8s -n openshift-monitoring)" https://10.130.0.40:8443/metrics | grep build
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.