Bug 1698201 - Prometheus is unable to scrape control plane components
Summary: Prometheus is unable to scrape control plane components
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Master
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.1.0
Assignee: Michal Fojtik
QA Contact: Xingxing Xia
URL:
Whiteboard:
: 1698722 1700060 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-09 19:43 UTC by Derek Carr
Modified: 2019-06-04 10:47 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:18 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:47:25 UTC

Description Derek Carr 2019-04-09 19:43:45 UTC
Description of problem:

Installed 4.1 cluster, and Prometheus is unable to scrape controller manager status to report if its up/down.  Further investigation into Prometheus targets shows that this is happening for multiple targets.

openshift-monitoring/kube-controller-manager/0 (0/3 up)
Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused

openshift-operator-lifecycle-manager/catalog-operator/0 (0/1 up) 
Get https://10.128.0.8:8081/metrics: dial tcp 10.128.0.8:8081: connect: connection refused

openshift-operator-lifecycle-manager/olm-operator/0 (0/1 up) 
Get https://10.128.0.10:8081/metrics: dial tcp 10.128.0.10:8081: connect: connection refused

openshift-sdn/monitor-sdn/0 (0/6 up) 
http://10.0.133.168:9101/metrics: dial tcp 10.0.133.168:9101: connect: connection refused

Version-Release number of selected component (if applicable):
4.0.0-0.alpha-2019-04-09-154213

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
All monitoring targets are scrapable.

Additional info:

Comment 1 Frederic Branczyk 2019-04-10 14:15:12 UTC
We should create BZ for each of these components as each team owns their respective monitoring components. We already have an open PR to add checks to the e2e-aws test suite so this doesn't happen again once fixed: https://github.com/openshift/origin/pull/22513.

The controller-manager/scheduler is due to the insecure port being disabled without migration. This is being fixed in https://github.com/openshift/cluster-monitoring-operator/pull/316 on the cluster-monitoring-operator, and the final fix for scraping to work is: https://github.com/openshift/installer/pull/1576.

I'm opening separate bugs for OLM, catalog and SDN and will keep this bugzilla dedicated for scheduler/controller-manager.

Comment 3 Frederic Branczyk 2019-04-11 08:53:57 UTC
Reassigning this to the master team, as we've done everything on our side and we need their support now.

Comment 4 Xingxing Xia 2019-04-11 09:46:19 UTC
*** Bug 1698722 has been marked as a duplicate of this bug. ***

Comment 5 Frederic Branczyk 2019-04-11 15:49:16 UTC
As part of the fix for this. Re-enable the e2e tests for this in the cluster-monitoring-operator repo, that were disabled here: https://github.com/openshift/cluster-monitoring-operator/pull/318

In the future we will also make sure that this configuration is handled by the controller-manager-operator and scheduler-operator respectively, and remove any dependency for this from the cluster-monitoring-operator, as each component owns their own monitoring configuration. That's planned for 4.2, for 4.1 we just want working metrics for core components back :)

Comment 6 Michal Fojtik 2019-04-12 07:31:36 UTC
controller manager moved to secure port after rebase:

openshift-monitoring/kube-controller-manager/0 (0/3 up)
Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused

It is 10257 now (and HTTPS).

Comment 8 Frederic Branczyk 2019-04-16 07:02:11 UTC
*** Bug 1700060 has been marked as a duplicate of this bug. ***

Comment 9 Frederic Branczyk 2019-04-16 14:53:19 UTC
Moving to POST as all the PRs required to fix this have been opened.

Comment 10 Frederic Branczyk 2019-04-18 07:35:40 UTC
Status update: All PRs on the monitoring repos have been merged. The installer PR to open the new ports is still outstanding (https://github.com/openshift/installer/pull/1576). And once that's merged the e2e test to verify that scheduler and controller-manager metrics are collected have to be disabled. As monitoring involvement is done, moving back to master component. Manually testing a cluster from the above installer PR show that this is the last thing for this to be resolved, but should there be any more issues with the cluster-monitoring-operator side, please feel free to move this back to us.

Comment 11 Michal Fojtik 2019-04-18 08:24:26 UTC
Thanks Frederic, I will move this ON_QA when the installer pull merge.

Comment 12 Xingxing Xia 2019-04-25 04:29:11 UTC
Verified in env of payload 4.1.0-0.nightly-2019-04-25-002910. Now openshift-monitoring/kube-controller-manager target items are UP with 10257 port and have no the issue.

Comment 14 errata-xmlrpc 2019-06-04 10:47:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.