Bug 1698201

Summary:	Prometheus is unable to scrape control plane components
Product:	OpenShift Container Platform	Reporter:	Derek Carr <decarr>
Component:	Master	Assignee:	Michal Fojtik <mfojtik>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	4.1.0	CC:	anpicker, aos-bugs, erooth, fbranczy, jokerman, juzhao, mloibl, mmccomas, pkrupa, sjenning, surbania, xxia
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:47:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Derek Carr 2019-04-09 19:43:45 UTC

Description of problem:

Installed 4.1 cluster, and Prometheus is unable to scrape controller manager status to report if its up/down.  Further investigation into Prometheus targets shows that this is happening for multiple targets.

openshift-monitoring/kube-controller-manager/0 (0/3 up)
Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused

openshift-operator-lifecycle-manager/catalog-operator/0 (0/1 up) 
Get https://10.128.0.8:8081/metrics: dial tcp 10.128.0.8:8081: connect: connection refused

openshift-operator-lifecycle-manager/olm-operator/0 (0/1 up) 
Get https://10.128.0.10:8081/metrics: dial tcp 10.128.0.10:8081: connect: connection refused

openshift-sdn/monitor-sdn/0 (0/6 up) 
http://10.0.133.168:9101/metrics: dial tcp 10.0.133.168:9101: connect: connection refused

Version-Release number of selected component (if applicable):
4.0.0-0.alpha-2019-04-09-154213

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:
All monitoring targets are scrapable.

Additional info:

Comment 1 Frederic Branczyk 2019-04-10 14:15:12 UTC

We should create BZ for each of these components as each team owns their respective monitoring components. We already have an open PR to add checks to the e2e-aws test suite so this doesn't happen again once fixed: https://github.com/openshift/origin/pull/22513.

The controller-manager/scheduler is due to the insecure port being disabled without migration. This is being fixed in https://github.com/openshift/cluster-monitoring-operator/pull/316 on the cluster-monitoring-operator, and the final fix for scraping to work is: https://github.com/openshift/installer/pull/1576.

I'm opening separate bugs for OLM, catalog and SDN and will keep this bugzilla dedicated for scheduler/controller-manager.

Comment 2 Frederic Branczyk 2019-04-10 14:26:44 UTC

SDN: https://bugzilla.redhat.com/show_bug.cgi?id=1698525
catalog-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698530
olm-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698533

Comment 3 Frederic Branczyk 2019-04-11 08:53:57 UTC

Reassigning this to the master team, as we've done everything on our side and we need their support now.

Comment 4 Xingxing Xia 2019-04-11 09:46:19 UTC

*** Bug 1698722 has been marked as a duplicate of this bug. ***

Comment 5 Frederic Branczyk 2019-04-11 15:49:16 UTC

As part of the fix for this. Re-enable the e2e tests for this in the cluster-monitoring-operator repo, that were disabled here: https://github.com/openshift/cluster-monitoring-operator/pull/318

In the future we will also make sure that this configuration is handled by the controller-manager-operator and scheduler-operator respectively, and remove any dependency for this from the cluster-monitoring-operator, as each component owns their own monitoring configuration. That's planned for 4.2, for 4.1 we just want working metrics for core components back :)

Comment 6 Michal Fojtik 2019-04-12 07:31:36 UTC

controller manager moved to secure port after rebase:

openshift-monitoring/kube-controller-manager/0 (0/3 up)
Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused

It is 10257 now (and HTTPS).

Comment 8 Frederic Branczyk 2019-04-16 07:02:11 UTC

*** Bug 1700060 has been marked as a duplicate of this bug. ***

Comment 9 Frederic Branczyk 2019-04-16 14:53:19 UTC

Moving to POST as all the PRs required to fix this have been opened.

Comment 10 Frederic Branczyk 2019-04-18 07:35:40 UTC

Status update: All PRs on the monitoring repos have been merged. The installer PR to open the new ports is still outstanding (https://github.com/openshift/installer/pull/1576). And once that's merged the e2e test to verify that scheduler and controller-manager metrics are collected have to be disabled. As monitoring involvement is done, moving back to master component. Manually testing a cluster from the above installer PR show that this is the last thing for this to be resolved, but should there be any more issues with the cluster-monitoring-operator side, please feel free to move this back to us.

Comment 11 Michal Fojtik 2019-04-18 08:24:26 UTC

Thanks Frederic, I will move this ON_QA when the installer pull merge.

Comment 12 Xingxing Xia 2019-04-25 04:29:11 UTC

Verified in env of payload 4.1.0-0.nightly-2019-04-25-002910. Now openshift-monitoring/kube-controller-manager target items are UP with 10257 port and have no the issue.

Comment 14 errata-xmlrpc 2019-06-04 10:47:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758