Bug 1563888 - Installing prometheus should update iptable rules for node-exporter
Summary: Installing prometheus should update iptable rules for node-exporter
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.9.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 3.11.0
Assignee: Simon Pasquier
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1600562 1603144
TreeView+ depends on / blocked
 
Reported: 2018-04-05 01:09 UTC by Gerald Nunn
Modified: 2019-02-28 13:17 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: the 9100 port is blocked on all nodes by default. Consequence: Prometheus can't scrape the node_exporter service running on the other nodes and which listens on port 9100. Fix: the firewall configuration is modified to allow incoming TCP traffic for the 9000-1000 port range. Result: Prometheus can scrape the node_exporter services.
Clone Of:
: 1600562 1603144 (view as bug list)
Environment:
Last Closed: 2019-02-21 14:32:38 UTC
Target Upstream Version:


Attachments (Terms of Use)
prometheus-node-exporter target (132.80 KB, image/png)
2018-08-28 09:12 UTC, Junqi Zhao
no flags Details


Links
System ID Priority Status Summary Last Updated
Github openshift openshift-ansible issues 7999 None closed Prometheus - kubernetes-nodes-exporter endpoints down 2021-01-19 05:48:43 UTC
Github openshift openshift-ansible pull 9072 None closed Allow the 9k-10k port range for Prometheus 2021-01-19 05:48:44 UTC
Red Hat Knowledge Base (Solution) 3750891 None None None 2018-12-12 14:09:52 UTC
Red Hat Product Errata RHBA-2018:2652 None None None 2018-10-11 07:20:11 UTC

Description Gerald Nunn 2018-04-05 01:09:51 UTC
Description of problem:

In OCP 3.9, when you install prometheus it sets up the node-exporter as a daemonset listening on hostport 9100. The problem is that the iptable rules are not configured to allow 9100 and thus scraping fails with "No route to host". For example, this is what I see with debug logging on in prometheus:

level=debug ts=2018-04-05T00:49:22.480744133Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-nodes-exporter target=http://10.0.1.76:9100/metrics  msg="Scrape failed" err="Get http://10.0.1.76:9100/metrics:  dial tcp 10.0.1.76:9100: getsockopt: no route to host"
level=debug ts=2018-04-05T00:49:27.506758234Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-nodes-exporter target=http://10.0.1.65:9100/metrics  msg="Scrape failed" err="Get http://10.0.1.65:9100/metrics:  dial tcp 10.0.1.65:9100: getsockopt: no route to host"
...

Using the update_firewall.yml playbook from https://github.com/wkulhanek/openshift-prometheus/tree/master/node-exporter fixes the problem.

Version-Release number of selected component (if applicable):


How reproducible:

Always in AWS

Steps to Reproduce:
1. Install prometheus using advanced installer with openshift_hosted_prometheus_deploy=true in inventory
2.
3.

Actual results:

Scraping fails due to lack of iptable rule for 9100

Expected results:

Installer configures iptable rule for 9100, scraping works

Additional info:

I see other errors for scraping on port 1936, not sure if it's related:

level=debug ts=2018-04-05T01:04:27.626451171Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://10.0.1.152:1936/metrics  msg="Scrape failed" err="server returned HTTP status 403 Forbidden"
level=debug ts=2018-04-05T01:05:27.626622466Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://10.0.1.152:1936/metrics  msg="Scrape failed" err="server returned HTTP status 403 Forbidden"
...

Comment 1 Gerald Nunn 2018-04-05 01:14:47 UTC
I'm a prometheus newbie, I see now there is a status page for the targets and port 1936 is for the haproxy router which is being discussed on sme-openshift already.

Comment 2 Josep 'Pep' Turro Mauri 2018-04-09 15:27:18 UTC
(In reply to Gerald Nunn from comment #0)
> Description of problem:
> 
> In OCP 3.9, when you install prometheus it sets up the node-exporter as a
> daemonset listening on hostport 9100. The problem is that the iptable rules
> are not configured to allow 9100 and thus scraping fails with "No route to
> host". For example, this is what I see with debug logging on in prometheus:
> 
> level=debug ts=2018-04-05T00:49:22.480744133Z caller=scrape.go:676
> component="scrape manager" scrape_pool=kubernetes-nodes-exporter
> target=http://10.0.1.76:9100/metrics  msg="Scrape failed" err="Get
> http://10.0.1.76:9100/metrics:  dial tcp 10.0.1.76:9100: getsockopt: no
> route to host"
> level=debug ts=2018-04-05T00:49:27.506758234Z caller=scrape.go:676
> component="scrape manager" scrape_pool=kubernetes-nodes-exporter
> target=http://10.0.1.65:9100/metrics  msg="Scrape failed" err="Get
> http://10.0.1.65:9100/metrics:  dial tcp 10.0.1.65:9100: getsockopt: no
> route to host"
> ...
> 
> Using the update_firewall.yml playbook from
> https://github.com/wkulhanek/openshift-prometheus/tree/master/node-exporter
> fixes the problem.
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> Always in AWS

Just to clarify: the problem isn't specific to AWS, right? 

It's true that different infrastructure providers will require some specific network settings (see e.g. https://github.com/openshift/openshift-ansible/pull/6920 ) but the node exporter port will still need to be opened at the node level.

> Expected results:
> 
> Installer configures iptable rule for 9100, scraping works

Submitted https://github.com/openshift/openshift-ansible/pull/7860 with a suggested fix.

> Additional info:
> level=debug ts=2018-04-05T01:05:27.626622466Z caller=scrape.go:676
> component="scrape manager" scrape_pool=kubernetes-service-endpoints
> target=http://10.0.1.152:1936/metrics  msg="Scrape failed" err="server
> returned HTTP status 403 Forbidden"

As you mentioned, the auth issue with the router metrics is unrelated to firewall ports; filled bug 1565095 to track that.

Comment 3 Gerald Nunn 2018-04-12 12:42:03 UTC
I do not believe it is specific to AWS since the issue is with the node firewall/iptable and not AWS security groups.

Comment 8 Junqi Zhao 2018-06-10 23:56:14 UTC
*** Bug 1589023 has been marked as a duplicate of this bug. ***

Comment 9 Scott Dodson 2018-06-13 15:16:38 UTC
There's a proposed fix in this PR https://github.com/openshift/openshift-ansible/pull/7860

Comment 10 Simon Pasquier 2018-07-12 13:30:16 UTC
https://github.com/openshift/openshift-ansible/pull/9072 has been merged which opens up the 9000-10000 port range (eg including the 9100 port for node_exporter).

Comment 12 Junqi Zhao 2018-08-23 08:23:56 UTC
Depends on Bug 1608288, node-exporter port has changed to 9101

Comment 13 Junqi Zhao 2018-08-27 00:45:05 UTC
Depends on Bug 1608288, node-exporter port has changed to 9102

Comment 14 Junqi Zhao 2018-08-28 09:12:27 UTC
prometheus-node-exporter target could be accessed,
9000:10000 port is added in iptables
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT


prometheus-node-exporter-v3.11.0-0.24.0.0

openshift-ansible version
openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm

Comment 15 Junqi Zhao 2018-08-28 09:12:58 UTC
Created attachment 1479197 [details]
prometheus-node-exporter target

Comment 17 errata-xmlrpc 2018-10-11 07:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Comment 20 trumbaut 2019-02-04 13:29:52 UTC
Looks like the provided solution is not implemented during upgrade from OCP (3.10.59 to 3.11.59). After manually adding the following iptables rule to all nodes, the node-exporter target endpoints are available:

# iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT

However, as this rule is not saved to /etc/sysconfig/iptables it will be lost after restarting iptables.

To be sure, we removed the old openshift-metrics project (Prometheus project for OCP 3.10) and manually removed the openshift-monitoring project:

# ansible-playbook -i [ inventory ] /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym -e openshift_cluster_monitoring_operator_install=true

Afterwards, we reinstalled the project:

# ansible-playbook -i [ inventory ] /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym

Port rage 9000:10000 is open (due to our manual action?) but not saved to /etc/sysconfig/iptables.

The needed patch is in /usr/share/ansible/openshift-ansible/roles/openshift_node/defaults/main.yml:

----
- service: Prometheus monitoring
  port: 9000-10000/tcp
----

But this playbook might not have been executed using /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym?

Comment 21 trumbaut 2019-02-04 13:31:33 UTC
Some typos in the reply above. All mentions to /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym should have been /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.yml.

Comment 22 Simon Pasquier 2019-02-04 16:18:23 UTC
@trumbaut I'm not that familiar with openshift-ansible and the upgrade process. To be clear, the openshift-monitoring playbook doesn't modify any firewall rules, it just assumes that the firewall configuration is ok. It is likely that the upgrade playbook didn't apply the updated firewall configuration.

Comment 23 Simon Pasquier 2019-02-13 16:35:48 UTC
The original issue reported in this ticket was that Prometheus couldn't scrape node_exporter metrics on fresh OpenShift installations and this had been fixed. The ticket has been reopened because the same error happened for users upgrading from 3.x to 3.11 but I think that it is better tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1659441.

Comment 24 Simon Pasquier 2019-02-21 14:32:38 UTC
I'm closing this ticket back. After discussing offline with Saurabh Sadhale, we concluded that the issue that triggered the re-opening of this ticket was in fact tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1659441.


Note You need to log in before you can comment on or make changes to this bug.