1698201 – Prometheus is unable to scrape control plane components

Bug 1698201 - Prometheus is unable to scrape control plane components

Summary: Prometheus is unable to scrape control plane components

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Michal Fojtik
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1698722 1700060 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-04-09 19:43 UTC by Derek Carr
Modified:	2019-06-04 10:47 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:47:18 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:47:25 UTC

Description Derek Carr 2019-04-09 19:43:45 UTC

Description of problem:

Installed 4.1 cluster, and Prometheus is unable to scrape controller manager status to report if its up/down.  Further investigation into Prometheus targets shows that this is happening for multiple targets.

openshift-monitoring/kube-controller-manager/0 (0/3 up)
Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused

openshift-operator-lifecycle-manager/catalog-operator/0 (0/1 up) 
Get https://10.128.0.8:8081/metrics: dial tcp 10.128.0.8:8081: connect: connection refused

openshift-operator-lifecycle-manager/olm-operator/0 (0/1 up) 
Get https://10.128.0.10:8081/metrics: dial tcp 10.128.0.10:8081: connect: connection refused

openshift-sdn/monitor-sdn/0 (0/6 up) 
http://10.0.133.168:9101/metrics: dial tcp 10.0.133.168:9101: connect: connection refused

Version-Release number of selected component (if applicable):
4.0.0-0.alpha-2019-04-09-154213

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
All monitoring targets are scrapable.

Additional info:

Comment 1 Frederic Branczyk 2019-04-10 14:15:12 UTC

We should create BZ for each of these components as each team owns their respective monitoring components. We already have an open PR to add checks to the e2e-aws test suite so this doesn't happen again once fixed: https://github.com/openshift/origin/pull/22513.

The controller-manager/scheduler is due to the insecure port being disabled without migration. This is being fixed in https://github.com/openshift/cluster-monitoring-operator/pull/316 on the cluster-monitoring-operator, and the final fix for scraping to work is: https://github.com/openshift/installer/pull/1576.

I'm opening separate bugs for OLM, catalog and SDN and will keep this bugzilla dedicated for scheduler/controller-manager.

Comment 2 Frederic Branczyk 2019-04-10 14:26:44 UTC

SDN: https://bugzilla.redhat.com/show_bug.cgi?id=1698525
catalog-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698530
olm-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698533

Comment 3 Frederic Branczyk 2019-04-11 08:53:57 UTC

Reassigning this to the master team, as we've done everything on our side and we need their support now.

Comment 4 Xingxing Xia 2019-04-11 09:46:19 UTC

*** Bug 1698722 has been marked as a duplicate of this bug. ***

Comment 5 Frederic Branczyk 2019-04-11 15:49:16 UTC

As part of the fix for this. Re-enable the e2e tests for this in the cluster-monitoring-operator repo, that were disabled here: https://github.com/openshift/cluster-monitoring-operator/pull/318

In the future we will also make sure that this configuration is handled by the controller-manager-operator and scheduler-operator respectively, and remove any dependency for this from the cluster-monitoring-operator, as each component owns their own monitoring configuration. That's planned for 4.2, for 4.1 we just want working metrics for core components back :)

Comment 6 Michal Fojtik 2019-04-12 07:31:36 UTC

controller manager moved to secure port after rebase:

openshift-monitoring/kube-controller-manager/0 (0/3 up)
Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused

It is 10257 now (and HTTPS).

Comment 8 Frederic Branczyk 2019-04-16 07:02:11 UTC

*** Bug 1700060 has been marked as a duplicate of this bug. ***

Comment 9 Frederic Branczyk 2019-04-16 14:53:19 UTC

Moving to POST as all the PRs required to fix this have been opened.

Comment 10 Frederic Branczyk 2019-04-18 07:35:40 UTC

Status update: All PRs on the monitoring repos have been merged. The installer PR to open the new ports is still outstanding (https://github.com/openshift/installer/pull/1576). And once that's merged the e2e test to verify that scheduler and controller-manager metrics are collected have to be disabled. As monitoring involvement is done, moving back to master component. Manually testing a cluster from the above installer PR show that this is the last thing for this to be resolved, but should there be any more issues with the cluster-monitoring-operator side, please feel free to move this back to us.

Comment 11 Michal Fojtik 2019-04-18 08:24:26 UTC

Thanks Frederic, I will move this ON_QA when the installer pull merge.

Comment 12 Xingxing Xia 2019-04-25 04:29:11 UTC

Verified in env of payload 4.1.0-0.nightly-2019-04-25-002910. Now openshift-monitoring/kube-controller-manager target items are UP with 10257 port and have no the issue.

Comment 14 errata-xmlrpc 2019-06-04 10:47:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.