Hide Forgot
Description of problem: Installed 4.1 cluster, and Prometheus is unable to scrape controller manager status to report if its up/down. Further investigation into Prometheus targets shows that this is happening for multiple targets. openshift-monitoring/kube-controller-manager/0 (0/3 up) Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused openshift-operator-lifecycle-manager/catalog-operator/0 (0/1 up) Get https://10.128.0.8:8081/metrics: dial tcp 10.128.0.8:8081: connect: connection refused openshift-operator-lifecycle-manager/olm-operator/0 (0/1 up) Get https://10.128.0.10:8081/metrics: dial tcp 10.128.0.10:8081: connect: connection refused openshift-sdn/monitor-sdn/0 (0/6 up) http://10.0.133.168:9101/metrics: dial tcp 10.0.133.168:9101: connect: connection refused Version-Release number of selected component (if applicable): 4.0.0-0.alpha-2019-04-09-154213 How reproducible: Always Steps to Reproduce: 1. 2. 3. Actual results: Expected results: All monitoring targets are scrapable. Additional info:
We should create BZ for each of these components as each team owns their respective monitoring components. We already have an open PR to add checks to the e2e-aws test suite so this doesn't happen again once fixed: https://github.com/openshift/origin/pull/22513. The controller-manager/scheduler is due to the insecure port being disabled without migration. This is being fixed in https://github.com/openshift/cluster-monitoring-operator/pull/316 on the cluster-monitoring-operator, and the final fix for scraping to work is: https://github.com/openshift/installer/pull/1576. I'm opening separate bugs for OLM, catalog and SDN and will keep this bugzilla dedicated for scheduler/controller-manager.
SDN: https://bugzilla.redhat.com/show_bug.cgi?id=1698525 catalog-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698530 olm-operator: https://bugzilla.redhat.com/show_bug.cgi?id=1698533
Reassigning this to the master team, as we've done everything on our side and we need their support now.
*** Bug 1698722 has been marked as a duplicate of this bug. ***
As part of the fix for this. Re-enable the e2e tests for this in the cluster-monitoring-operator repo, that were disabled here: https://github.com/openshift/cluster-monitoring-operator/pull/318 In the future we will also make sure that this configuration is handled by the controller-manager-operator and scheduler-operator respectively, and remove any dependency for this from the cluster-monitoring-operator, as each component owns their own monitoring configuration. That's planned for 4.2, for 4.1 we just want working metrics for core components back :)
controller manager moved to secure port after rebase: openshift-monitoring/kube-controller-manager/0 (0/3 up) Get http://10.0.139.202:10252/metrics: dial tcp 10.0.139.202:10252: connect: connection refused It is 10257 now (and HTTPS).
*** Bug 1700060 has been marked as a duplicate of this bug. ***
Moving to POST as all the PRs required to fix this have been opened.
Status update: All PRs on the monitoring repos have been merged. The installer PR to open the new ports is still outstanding (https://github.com/openshift/installer/pull/1576). And once that's merged the e2e test to verify that scheduler and controller-manager metrics are collected have to be disabled. As monitoring involvement is done, moving back to master component. Manually testing a cluster from the above installer PR show that this is the last thing for this to be resolved, but should there be any more issues with the cluster-monitoring-operator side, please feel free to move this back to us.
Thanks Frederic, I will move this ON_QA when the installer pull merge.
Verified in env of payload 4.1.0-0.nightly-2019-04-25-002910. Now openshift-monitoring/kube-controller-manager target items are UP with 10257 port and have no the issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758