Description of problem: In OCP 3.9, when you install prometheus it sets up the node-exporter as a daemonset listening on hostport 9100. The problem is that the iptable rules are not configured to allow 9100 and thus scraping fails with "No route to host". For example, this is what I see with debug logging on in prometheus: level=debug ts=2018-04-05T00:49:22.480744133Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-nodes-exporter target=http://10.0.1.76:9100/metrics msg="Scrape failed" err="Get http://10.0.1.76:9100/metrics: dial tcp 10.0.1.76:9100: getsockopt: no route to host" level=debug ts=2018-04-05T00:49:27.506758234Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-nodes-exporter target=http://10.0.1.65:9100/metrics msg="Scrape failed" err="Get http://10.0.1.65:9100/metrics: dial tcp 10.0.1.65:9100: getsockopt: no route to host" ... Using the update_firewall.yml playbook from https://github.com/wkulhanek/openshift-prometheus/tree/master/node-exporter fixes the problem. Version-Release number of selected component (if applicable): How reproducible: Always in AWS Steps to Reproduce: 1. Install prometheus using advanced installer with openshift_hosted_prometheus_deploy=true in inventory 2. 3. Actual results: Scraping fails due to lack of iptable rule for 9100 Expected results: Installer configures iptable rule for 9100, scraping works Additional info: I see other errors for scraping on port 1936, not sure if it's related: level=debug ts=2018-04-05T01:04:27.626451171Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://10.0.1.152:1936/metrics msg="Scrape failed" err="server returned HTTP status 403 Forbidden" level=debug ts=2018-04-05T01:05:27.626622466Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://10.0.1.152:1936/metrics msg="Scrape failed" err="server returned HTTP status 403 Forbidden" ...
I'm a prometheus newbie, I see now there is a status page for the targets and port 1936 is for the haproxy router which is being discussed on sme-openshift already.
(In reply to Gerald Nunn from comment #0) > Description of problem: > > In OCP 3.9, when you install prometheus it sets up the node-exporter as a > daemonset listening on hostport 9100. The problem is that the iptable rules > are not configured to allow 9100 and thus scraping fails with "No route to > host". For example, this is what I see with debug logging on in prometheus: > > level=debug ts=2018-04-05T00:49:22.480744133Z caller=scrape.go:676 > component="scrape manager" scrape_pool=kubernetes-nodes-exporter > target=http://10.0.1.76:9100/metrics msg="Scrape failed" err="Get > http://10.0.1.76:9100/metrics: dial tcp 10.0.1.76:9100: getsockopt: no > route to host" > level=debug ts=2018-04-05T00:49:27.506758234Z caller=scrape.go:676 > component="scrape manager" scrape_pool=kubernetes-nodes-exporter > target=http://10.0.1.65:9100/metrics msg="Scrape failed" err="Get > http://10.0.1.65:9100/metrics: dial tcp 10.0.1.65:9100: getsockopt: no > route to host" > ... > > Using the update_firewall.yml playbook from > https://github.com/wkulhanek/openshift-prometheus/tree/master/node-exporter > fixes the problem. > > Version-Release number of selected component (if applicable): > > > How reproducible: > > Always in AWS Just to clarify: the problem isn't specific to AWS, right? It's true that different infrastructure providers will require some specific network settings (see e.g. https://github.com/openshift/openshift-ansible/pull/6920 ) but the node exporter port will still need to be opened at the node level. > Expected results: > > Installer configures iptable rule for 9100, scraping works Submitted https://github.com/openshift/openshift-ansible/pull/7860 with a suggested fix. > Additional info: > level=debug ts=2018-04-05T01:05:27.626622466Z caller=scrape.go:676 > component="scrape manager" scrape_pool=kubernetes-service-endpoints > target=http://10.0.1.152:1936/metrics msg="Scrape failed" err="server > returned HTTP status 403 Forbidden" As you mentioned, the auth issue with the router metrics is unrelated to firewall ports; filled bug 1565095 to track that.
I do not believe it is specific to AWS since the issue is with the node firewall/iptable and not AWS security groups.
*** Bug 1589023 has been marked as a duplicate of this bug. ***
There's a proposed fix in this PR https://github.com/openshift/openshift-ansible/pull/7860
https://github.com/openshift/openshift-ansible/pull/9072 has been merged which opens up the 9000-10000 port range (eg including the 9100 port for node_exporter).
Depends on Bug 1608288, node-exporter port has changed to 9101
Depends on Bug 1608288, node-exporter port has changed to 9102
prometheus-node-exporter target could be accessed, 9000:10000 port is added in iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT prometheus-node-exporter-v3.11.0-0.24.0.0 openshift-ansible version openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
Created attachment 1479197 [details] prometheus-node-exporter target
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652
Looks like the provided solution is not implemented during upgrade from OCP (3.10.59 to 3.11.59). After manually adding the following iptables rule to all nodes, the node-exporter target endpoints are available: # iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT However, as this rule is not saved to /etc/sysconfig/iptables it will be lost after restarting iptables. To be sure, we removed the old openshift-metrics project (Prometheus project for OCP 3.10) and manually removed the openshift-monitoring project: # ansible-playbook -i [ inventory ] /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym -e openshift_cluster_monitoring_operator_install=true Afterwards, we reinstalled the project: # ansible-playbook -i [ inventory ] /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym Port rage 9000:10000 is open (due to our manual action?) but not saved to /etc/sysconfig/iptables. The needed patch is in /usr/share/ansible/openshift-ansible/roles/openshift_node/defaults/main.yml: ---- - service: Prometheus monitoring port: 9000-10000/tcp ---- But this playbook might not have been executed using /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym?
Some typos in the reply above. All mentions to /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym should have been /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.yml.
@trumbaut I'm not that familiar with openshift-ansible and the upgrade process. To be clear, the openshift-monitoring playbook doesn't modify any firewall rules, it just assumes that the firewall configuration is ok. It is likely that the upgrade playbook didn't apply the updated firewall configuration.
The original issue reported in this ticket was that Prometheus couldn't scrape node_exporter metrics on fresh OpenShift installations and this had been fixed. The ticket has been reopened because the same error happened for users upgrading from 3.x to 3.11 but I think that it is better tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1659441.
I'm closing this ticket back. After discussing offline with Saurabh Sadhale, we concluded that the issue that triggered the re-opening of this ticket was in fact tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1659441.