Bug 1563888

Summary: Installing prometheus should update iptable rules for node-exporter
Product: OpenShift Container Platform Reporter: Gerald Nunn <gnunn>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED CURRENTRELEASE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.9.0CC: aos-bugs, ckoep, cstark, dapark, jmalde, jokerman, juzhao, knakayam, mmccomas, pdwyer, pep, spasquie, ssadhale, thomas.rumbaut
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the 9100 port is blocked on all nodes by default. Consequence: Prometheus can't scrape the node_exporter service running on the other nodes and which listens on port 9100. Fix: the firewall configuration is modified to allow incoming TCP traffic for the 9000-1000 port range. Result: Prometheus can scrape the node_exporter services.
Story Points: ---
Clone Of:
: 1600562 1603144 (view as bug list) Environment:
Last Closed: 2019-02-21 14:32:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1600562, 1603144    
Attachments:
Description Flags
prometheus-node-exporter target none

Description Gerald Nunn 2018-04-05 01:09:51 UTC
Description of problem:

In OCP 3.9, when you install prometheus it sets up the node-exporter as a daemonset listening on hostport 9100. The problem is that the iptable rules are not configured to allow 9100 and thus scraping fails with "No route to host". For example, this is what I see with debug logging on in prometheus:

level=debug ts=2018-04-05T00:49:22.480744133Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-nodes-exporter target=http://10.0.1.76:9100/metrics  msg="Scrape failed" err="Get http://10.0.1.76:9100/metrics:  dial tcp 10.0.1.76:9100: getsockopt: no route to host"
level=debug ts=2018-04-05T00:49:27.506758234Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-nodes-exporter target=http://10.0.1.65:9100/metrics  msg="Scrape failed" err="Get http://10.0.1.65:9100/metrics:  dial tcp 10.0.1.65:9100: getsockopt: no route to host"
...

Using the update_firewall.yml playbook from https://github.com/wkulhanek/openshift-prometheus/tree/master/node-exporter fixes the problem.

Version-Release number of selected component (if applicable):


How reproducible:

Always in AWS

Steps to Reproduce:
1. Install prometheus using advanced installer with openshift_hosted_prometheus_deploy=true in inventory
2.
3.

Actual results:

Scraping fails due to lack of iptable rule for 9100

Expected results:

Installer configures iptable rule for 9100, scraping works

Additional info:

I see other errors for scraping on port 1936, not sure if it's related:

level=debug ts=2018-04-05T01:04:27.626451171Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://10.0.1.152:1936/metrics  msg="Scrape failed" err="server returned HTTP status 403 Forbidden"
level=debug ts=2018-04-05T01:05:27.626622466Z caller=scrape.go:676 component="scrape manager" scrape_pool=kubernetes-service-endpoints target=http://10.0.1.152:1936/metrics  msg="Scrape failed" err="server returned HTTP status 403 Forbidden"
...

Comment 1 Gerald Nunn 2018-04-05 01:14:47 UTC
I'm a prometheus newbie, I see now there is a status page for the targets and port 1936 is for the haproxy router which is being discussed on sme-openshift already.

Comment 2 Josep 'Pep' Turro Mauri 2018-04-09 15:27:18 UTC
(In reply to Gerald Nunn from comment #0)
> Description of problem:
> 
> In OCP 3.9, when you install prometheus it sets up the node-exporter as a
> daemonset listening on hostport 9100. The problem is that the iptable rules
> are not configured to allow 9100 and thus scraping fails with "No route to
> host". For example, this is what I see with debug logging on in prometheus:
> 
> level=debug ts=2018-04-05T00:49:22.480744133Z caller=scrape.go:676
> component="scrape manager" scrape_pool=kubernetes-nodes-exporter
> target=http://10.0.1.76:9100/metrics  msg="Scrape failed" err="Get
> http://10.0.1.76:9100/metrics:  dial tcp 10.0.1.76:9100: getsockopt: no
> route to host"
> level=debug ts=2018-04-05T00:49:27.506758234Z caller=scrape.go:676
> component="scrape manager" scrape_pool=kubernetes-nodes-exporter
> target=http://10.0.1.65:9100/metrics  msg="Scrape failed" err="Get
> http://10.0.1.65:9100/metrics:  dial tcp 10.0.1.65:9100: getsockopt: no
> route to host"
> ...
> 
> Using the update_firewall.yml playbook from
> https://github.com/wkulhanek/openshift-prometheus/tree/master/node-exporter
> fixes the problem.
> 
> Version-Release number of selected component (if applicable):
> 
> 
> How reproducible:
> 
> Always in AWS

Just to clarify: the problem isn't specific to AWS, right? 

It's true that different infrastructure providers will require some specific network settings (see e.g. https://github.com/openshift/openshift-ansible/pull/6920 ) but the node exporter port will still need to be opened at the node level.

> Expected results:
> 
> Installer configures iptable rule for 9100, scraping works

Submitted https://github.com/openshift/openshift-ansible/pull/7860 with a suggested fix.

> Additional info:
> level=debug ts=2018-04-05T01:05:27.626622466Z caller=scrape.go:676
> component="scrape manager" scrape_pool=kubernetes-service-endpoints
> target=http://10.0.1.152:1936/metrics  msg="Scrape failed" err="server
> returned HTTP status 403 Forbidden"

As you mentioned, the auth issue with the router metrics is unrelated to firewall ports; filled bug 1565095 to track that.

Comment 3 Gerald Nunn 2018-04-12 12:42:03 UTC
I do not believe it is specific to AWS since the issue is with the node firewall/iptable and not AWS security groups.

Comment 8 Junqi Zhao 2018-06-10 23:56:14 UTC
*** Bug 1589023 has been marked as a duplicate of this bug. ***

Comment 9 Scott Dodson 2018-06-13 15:16:38 UTC
There's a proposed fix in this PR https://github.com/openshift/openshift-ansible/pull/7860

Comment 10 Simon Pasquier 2018-07-12 13:30:16 UTC
https://github.com/openshift/openshift-ansible/pull/9072 has been merged which opens up the 9000-10000 port range (eg including the 9100 port for node_exporter).

Comment 12 Junqi Zhao 2018-08-23 08:23:56 UTC
Depends on Bug 1608288, node-exporter port has changed to 9101

Comment 13 Junqi Zhao 2018-08-27 00:45:05 UTC
Depends on Bug 1608288, node-exporter port has changed to 9102

Comment 14 Junqi Zhao 2018-08-28 09:12:27 UTC
prometheus-node-exporter target could be accessed,
9000:10000 port is added in iptables
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT


prometheus-node-exporter-v3.11.0-0.24.0.0

openshift-ansible version
openshift-ansible-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-docs-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-playbooks-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm
openshift-ansible-roles-3.11.0-0.24.0.git.0.3cd1597None.noarch.rpm

Comment 15 Junqi Zhao 2018-08-28 09:12:58 UTC
Created attachment 1479197 [details]
prometheus-node-exporter target

Comment 17 errata-xmlrpc 2018-10-11 07:19:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Comment 20 trumbaut 2019-02-04 13:29:52 UTC
Looks like the provided solution is not implemented during upgrade from OCP (3.10.59 to 3.11.59). After manually adding the following iptables rule to all nodes, the node-exporter target endpoints are available:

# iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT

However, as this rule is not saved to /etc/sysconfig/iptables it will be lost after restarting iptables.

To be sure, we removed the old openshift-metrics project (Prometheus project for OCP 3.10) and manually removed the openshift-monitoring project:

# ansible-playbook -i [ inventory ] /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym -e openshift_cluster_monitoring_operator_install=true

Afterwards, we reinstalled the project:

# ansible-playbook -i [ inventory ] /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym

Port rage 9000:10000 is open (due to our manual action?) but not saved to /etc/sysconfig/iptables.

The needed patch is in /usr/share/ansible/openshift-ansible/roles/openshift_node/defaults/main.yml:

----
- service: Prometheus monitoring
  port: 9000-10000/tcp
----

But this playbook might not have been executed using /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym?

Comment 21 trumbaut 2019-02-04 13:31:33 UTC
Some typos in the reply above. All mentions to /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.ym should have been /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.yml.

Comment 22 Simon Pasquier 2019-02-04 16:18:23 UTC
@trumbaut I'm not that familiar with openshift-ansible and the upgrade process. To be clear, the openshift-monitoring playbook doesn't modify any firewall rules, it just assumes that the firewall configuration is ok. It is likely that the upgrade playbook didn't apply the updated firewall configuration.

Comment 23 Simon Pasquier 2019-02-13 16:35:48 UTC
The original issue reported in this ticket was that Prometheus couldn't scrape node_exporter metrics on fresh OpenShift installations and this had been fixed. The ticket has been reopened because the same error happened for users upgrading from 3.x to 3.11 but I think that it is better tracked at https://bugzilla.redhat.com/show_bug.cgi?id=1659441.

Comment 24 Simon Pasquier 2019-02-21 14:32:38 UTC
I'm closing this ticket back. After discussing offline with Saurabh Sadhale, we concluded that the issue that triggered the re-opening of this ticket was in fact tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1659441.