Bug 1552235

Summary: Prometheus is unable to scrape hosted router components due to iptables rules from openshift-ansible
Product: OpenShift Container Platform Reporter: David H <david_hocky>
Component: MonitoringAssignee: Paul Gier <pgier>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.7.0CC: aivaraslaimikis, aos-bugs, cstark, jmalde, jokerman, juzhao, klaas, ksalunkh, mfojtik, minden, Miranda_Shutt, mmccomas, pasik, sauchter, spasquie, surbania, travi
Target Milestone: ---   
Target Release: 3.10.z   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: The default firewall settings block the router stats/metrics port. Consequence: This prevents prometheus from collecting the metrics from the openshift router. Fix: Open the firewall to allow connections to the router stats port. Result: Prometheus can now collect metrics from the router.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 10:11:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
router targets are up none

Description David H 2018-03-06 18:50:45 UTC
Description of problem:

When deploying prometheus (same issue seems present in master) -- https://prometheus-openshift-metrics.<cluster fqdn>/targets shows  "getsockopt: no route to host" when trying to scrape the /metrics endpoint on the OpenShift hosted routers

Version-Release number of selected component (if applicable):
Seen in release-3.7 but no fundamental changes in master that were evident that might change this. IPTables on nodes where the hosted router is deployed are not updated to expose this port

How reproducible:
Consistent

Steps to Reproduce:
1. Deploy openshift-ansible with prometheus and hosted router
2. Check prometheus target status

Actual results:

http://<node ip>:1936/metrics DOWN	instance="<node ip>:1936" ... Get http://<node ip>:1936/metrics: dial tcp <node ip>:1936: getsockopt: no route to host

Expected results:

Expect that all "kubernetes-service-endpoints" scrape targets are Green

Additional info:

Initial PR proposed https://github.com/openshift/openshift-ansible/pull/6636/files
Some concerns raised with the way the firewall module interacts with the actual hosts where the hosted router runs but needs feedback on how the deployment team wants to see this executed or any proposed alternative

Comment 1 Junqi Zhao 2018-06-10 23:58:23 UTC
*** Bug 1589023 has been marked as a duplicate of this bug. ***

Comment 3 Frederic Branczyk 2018-09-07 08:02:55 UTC
*** Bug 1625510 has been marked as a duplicate of this bug. ***

Comment 6 Klaas Demter 2018-11-05 08:39:44 UTC
The upstream issue was closed, this is not correct. I still can't access all routers in a multi-infrastructure node setup. It can only access one router -- my guess: the one thats running on the same node as prometheus.

Comment 12 kedar 2019-01-15 05:29:02 UTC
Hello Team,

Any updates on this issue

Regards,
Kedar

Comment 13 Paul Gier 2019-01-22 21:47:31 UTC
I don't see an easy way to open the router metrics port (1936) during install for only the router nodes since the node firewall configuration takes place mostly before anything is done with the routers.  Also, even if we could do that, I'm not sure how it would work post install if for example you wanted to move a router to a different node, you'd still need to manually open that port.  So I've created a PR against 3.10 to optionally open that port for all nodes during install.
https://github.com/openshift/openshift-ansible/pull/11052

Comment 15 Junqi Zhao 2019-02-12 05:50:05 UTC
Tested with 
openshift-ansible-3.10.110-1.git.0.1e03ab3.el7.noarch.rpm
openshift-ansible-docs-3.10.110-1.git.0.1e03ab3.el7.noarch.rpm
openshift-ansible-playbooks-3.10.110-1.git.0.1e03ab3.el7.noarch.rpm
openshift-ansible-roles-3.10.110-1.git.0.1e03ab3.el7.noarch.rpm
oopenshift-ansible-test-3.10.110-1.git.0.1e03ab3.el7.noarch.rpm

1936 port are opened for all nodes after install
# iptables-save | grep 1936
-A KUBE-SEP-DFSWOTRTOQBRAYA4 -s 10.0.77.74/32 -m comment --comment "default/router:1936-tcp" -j KUBE-MARK-MASQ
-A KUBE-SEP-DFSWOTRTOQBRAYA4 -p tcp -m comment --comment "default/router:1936-tcp" -m tcp -j DNAT --to-destination 10.0.77.74:1936
-A KUBE-SERVICES ! -s 10.128.0.0/14 -d 172.30.139.20/32 -p tcp -m comment --comment "default/router:1936-tcp cluster IP" -m tcp --dport 1936 -j KUBE-MARK-MASQ
-A KUBE-SERVICES -d 172.30.139.20/32 -p tcp -m comment --comment "default/router:1936-tcp cluster IP" -m tcp --dport 1936 -j KUBE-SVC-4JCRTMMYZAAYMIJ2
-A KUBE-SVC-4JCRTMMYZAAYMIJ2 -m comment --comment "default/router:1936-tcp" -j KUBE-SEP-DFSWOTRTOQBRAYA4
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 1936 -j ACCEPT

Comment 16 Junqi Zhao 2019-02-12 05:50:40 UTC
Created attachment 1533903 [details]
router targets are up

Comment 18 errata-xmlrpc 2019-02-20 10:11:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0328