|Summary:||Prometheus is unable to scrape control plane components|
|Product:||OpenShift Container Platform||Reporter:||Derek Carr <decarr>|
|Component:||Master||Assignee:||Michal Fojtik <mfojtik>|
|Status:||CLOSED ERRATA||QA Contact:||Xingxing Xia <xxia>|
|Version:||4.1.0||CC:||anpicker, aos-bugs, erooth, fbranczy, jokerman, juzhao, mloibl, mmccomas, pkrupa, sjenning, surbania, xxia|
|Fixed In Version:||Doc Type:||If docs needed, set a value|
|Doc Text:||Story Points:||---|
|Last Closed:||2019-06-04 10:47:18 UTC||Type:||Bug|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Cloudforms Team:||---||Target Upstream Version:|
Description Derek Carr 2019-04-09 19:43:45 UTC
Description of problem: Installed 4.1 cluster, and Prometheus is unable to scrape controller manager status to report if its up/down. Further investigation into Prometheus targets shows that this is happening for multiple targets. openshift-monitoring/kube-controller-manager/0 (0/3 up) Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused openshift-operator-lifecycle-manager/catalog-operator/0 (0/1 up) Get https://10.128.0.8:8081/metrics: dial tcp 10.128.0.8:8081: connect: connection refused openshift-operator-lifecycle-manager/olm-operator/0 (0/1 up) Get https://10.128.0.10:8081/metrics: dial tcp 10.128.0.10:8081: connect: connection refused openshift-sdn/monitor-sdn/0 (0/6 up) http://10.0.133.168:9101/metrics: dial tcp 10.0.133.168:9101: connect: connection refused Version-Release number of selected component (if applicable): 4.0.0-0.alpha-2019-04-09-154213 How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: All monitoring targets are scrapable. Additional info:
Comment 1 Frederic Branczyk 2019-04-10 14:15:12 UTC
We should create BZ for each of these components as each team owns their respective monitoring components. We already have an open PR to add checks to the e2e-aws test suite so this doesn't happen again once fixed: https://github.com/openshift/origin/pull/22513. The controller-manager/scheduler is due to the insecure port being disabled without migration. This is being fixed in https://github.com/openshift/cluster-monitoring-operator/pull/316 on the cluster-monitoring-operator, and the final fix for scraping to work is: https://github.com/openshift/installer/pull/1576. I'm opening separate bugs for OLM, catalog and SDN and will keep this bugzilla dedicated for scheduler/controller-manager.
Comment 2 Frederic Branczyk 2019-04-10 14:26:44 UTC
SDN: https://bugzilla.redhat.com/show_bug.cgi?id=1698525 catalog-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698530 olm-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698533
Comment 3 Frederic Branczyk 2019-04-11 08:53:57 UTC
Reassigning this to the master team, as we've done everything on our side and we need their support now.
Comment 4 Xingxing Xia 2019-04-11 09:46:19 UTC
*** Bug 1698722 has been marked as a duplicate of this bug. ***
Comment 5 Frederic Branczyk 2019-04-11 15:49:16 UTC
As part of the fix for this. Re-enable the e2e tests for this in the cluster-monitoring-operator repo, that were disabled here: https://github.com/openshift/cluster-monitoring-operator/pull/318 In the future we will also make sure that this configuration is handled by the controller-manager-operator and scheduler-operator respectively, and remove any dependency for this from the cluster-monitoring-operator, as each component owns their own monitoring configuration. That's planned for 4.2, for 4.1 we just want working metrics for core components back :)
Comment 6 Michal Fojtik 2019-04-12 07:31:36 UTC
controller manager moved to secure port after rebase: openshift-monitoring/kube-controller-manager/0 (0/3 up) Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused It is 10257 now (and HTTPS).
Comment 8 Frederic Branczyk 2019-04-16 07:02:11 UTC
*** Bug 1700060 has been marked as a duplicate of this bug. ***
Comment 9 Frederic Branczyk 2019-04-16 14:53:19 UTC
Moving to POST as all the PRs required to fix this have been opened.
Comment 10 Frederic Branczyk 2019-04-18 07:35:40 UTC
Status update: All PRs on the monitoring repos have been merged. The installer PR to open the new ports is still outstanding (https://github.com/openshift/installer/pull/1576). And once that's merged the e2e test to verify that scheduler and controller-manager metrics are collected have to be disabled. As monitoring involvement is done, moving back to master component. Manually testing a cluster from the above installer PR show that this is the last thing for this to be resolved, but should there be any more issues with the cluster-monitoring-operator side, please feel free to move this back to us.
Comment 11 Michal Fojtik 2019-04-18 08:24:26 UTC
Thanks Frederic, I will move this ON_QA when the installer pull merge.
Comment 12 Xingxing Xia 2019-04-25 04:29:11 UTC
Verified in env of payload 4.1.0-0.nightly-2019-04-25-002910. Now openshift-monitoring/kube-controller-manager target items are UP with 10257 port and have no the issue.
Comment 14 errata-xmlrpc 2019-06-04 10:47:18 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758