Bug 1565095 - Prometheus can't access router metrics
Summary: Prometheus can't access router metrics
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.9.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: 3.10.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1588010 1619998
TreeView+ depends on / blocked
 
Reported: 2018-04-09 11:26 UTC by Josep 'Pep' Turro Mauri
Modified: 2018-10-08 12:44 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the Prometheus service account doesn't have the required permissions to access the metrics endpoint of the router. Consequence: Prometheus fails to scrape the router's metrics. Fix: the Prometheus service account is granted an additional role with permissions to access the metrics endpoint. Result: Prometheus can pull metrics from the router.
Clone Of:
: 1588010 (view as bug list)
Environment:
Last Closed: 2018-10-08 12:44:07 UTC


Attachments (Terms of Use)
openshift-router target (136.68 KB, image/png)
2018-06-04 07:26 UTC, Junqi Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
Origin (Github) 17685 None None None 2018-04-09 11:25:59 UTC

Description Josep 'Pep' Turro Mauri 2018-04-09 11:26:00 UTC
Description of problem:

When deploying Prometheus on OCP 3.9 using openshift-ansible, the router's metrics are not available: the router metrics endpoint is protected and prometheus can't scrape it.

Version-Release number of selected component (if applicable):

atomic-openshift-3.9.14-1.git.0.4efa2ca.el7.x86_64
openshift-ansible-3.9.14-1.git.3.c62bc34.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Deploy prometheus metrics on an OCP 3.9 cluster via openshift-ansible:

https://docs.openshift.com/container-platform/3.9/install_config/cluster_metrics.html#openshift-prometheus

2. Check the kubernetes-service-endpoints target for the router metrics endpoint

Actual results:

level=debug ts=2018-04-09T11:18:56.431809488Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://192.168.55.143:1936/metrics msg="Scrape failed" err="server returned HTTP status 403 Forbidden"

Expected results:

Router metrics can be scraped by prometheus

Additional info:

This is reported upstream in https://github.com/openshift/origin/issues/17685

Comment 3 Simon Pasquier 2018-05-18 09:35:29 UTC
The upstream bug is fixed on master (upcoming 3.10).

Comment 4 Junqi Zhao 2018-05-22 00:23:43 UTC
Doc is LGTM

Comment 5 Junqi Zhao 2018-05-22 00:33:17 UTC
@Oved

The Rarget Release is set to 3.11, I think it should be 3.10

Comment 6 Junqi Zhao 2018-05-22 07:44:38 UTC
We need new prometheus images to test this defect, the following configurations is not in /etc/prometheus/prometheus.yml of prometheus container
      # Scrape config for the router
      - job_name: 'openshift-router'
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt
          server_name: router.default.svc
        bearer_token_file: /var/run/secrets/kubernetes.io/scraper/token
        kubernetes_sd_configs:
        - role: endpoints
          namespaces:
            names:
            - default
        relabel_configs:
        - source_labels: [__meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
          action: keep
          regex: router;1936-tcp

Comment 7 Simon Pasquier 2018-05-23 09:49:59 UTC
Right, there's a difference between the upstream issue that was focused on 'oc cluster up' + the example Prometheus template [1] and this BZ which targets OpenShift Ansible. IIUC the existing playbooks don't configure Prometheus to scrape the router endpoint: this is the configuration snippet that you're not getting currently. I'll address this.

That being said, the merged PR [2] is relevant for both cases.

[1] https://github.com/openshift/origin/tree/master/examples/prometheus
[2] https://github.com/openshift/origin/pull/19318

Comment 8 Simon Pasquier 2018-05-24 12:29:47 UTC
I've checked further: with the current openshift/origin and openshift-ansible,  Prometheus doesn't scrape the router's metrics because the router's service doesn't have the "prometheus.io/scrape: true" annotation anymore.

I've submitted https://github.com/openshift/openshift-ansible/pull/8512 for Prometheus to scrape the metrics.

Comment 9 Simon Pasquier 2018-06-01 11:06:33 UTC
https://github.com/openshift/openshift-ansible/pull/8512 has been merged.

Comment 10 Junqi Zhao 2018-06-04 07:25:51 UTC
clusterrole router-metrics is added in prometheus namespace, and router metrics could be accessed

openshift-ansible version:
openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch.rpm

Comment 11 Junqi Zhao 2018-06-04 07:26:19 UTC
Created attachment 1447321 [details]
openshift-router target


Note You need to log in before you can comment on or make changes to this bug.