Bug 1659441

Summary: Port range from 9000-10000 need to be opened from iptables explicitly after upgrading from 3.10 to 3.11
Product: OpenShift Container Platform Reporter: Saurabh Sadhale <ssadhale>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 3.11.0CC: adeshpan, aos-bugs, bbennett, cdc, ddelcian, fbranczy, gferrazs, jokerman, juzhao, lmeyer, minden, mmccomas, Nikolaus.Eppinger, pkanthal, spasquie, surbania, travi, vrutkovs, wmeng
Target Milestone: ---Keywords: Reopened
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: the 9000-10000 port range isn't opened after upgrading from 3.10 to 3.11. Consequence: Prometheus can't scrape the infrastructure services such as node_exporter. Fix: the upgrade playbook is modified to apply the updated firewall rules. Result: Prometheus can scrape the infrastructure services such as node_exporter.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-02-20 14:11:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Saurabh Sadhale 2018-12-14 11:37:47 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Saurabh Sadhale 2018-12-14 11:47:31 UTC
Description of problem:

After upgrade from 3.10 to 3.11 for cluster monitoring the port 9100 was not opened for iptables and needed to open it explicitly to make it work. 

Version-Release number of selected component (if applicable):

rpm -qa | grep openshift-ansible

openshift-ansible-playbooks-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-docs-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-3.11.43-1.git.0.fa69a02.el7.noarch
openshift-ansible-roles-3.11.43-1.git.0.fa69a02.el7.noarch

How reproducible: 


Steps to Reproduce:
1.Upgrade from 3.10 to 3.11 
2.
3.

Actual results:
The port 9100 was not opened 

Expected results:
The cluster monitoring should work without the need to open the port explicitly.

Additional info:

Comment 2 Ben Bennett 2018-12-17 19:39:23 UTC
This is by design... we do not want to manipulate the node's iptables rules to allow this.  The cluster admin has to choose what traffic they want to allow in to the cluster and make the appropriate changes.

Comment 3 Nikolaus Eppinger 2018-12-18 11:03:32 UTC
the node-exporter pods don't work without this port and prometheus needed this funktional.

Comment 4 Nikolaus Eppinger 2018-12-19 09:30:01 UTC
i have found BZ 1563888 but in this case this ports were not opened by update

Comment 8 Ben Bennett 2019-01-17 19:13:39 UTC
Moving to the install team since if the port is supposed to be opened when Prometheus is installed, then that would be the team to do it.

Comment 9 Vadim Rutkovsky 2019-01-18 13:42:27 UTC
(In reply to Ben Bennett from comment #8)
> Moving to the install team since if the port is supposed to be opened when
> Prometheus is installed, then that would be the team to do it.

Shouldn't Prometheus operator take care of that?

Comment 21 Nikolaus Eppinger 2019-01-31 09:22:52 UTC
we have updated from 3.7 to 3.11

Comment 31 Simon Pasquier 2019-02-13 09:58:35 UTC
PR to the openshift-ansible repository to run the firewall task during upgrade: https://github.com/openshift/openshift-ansible/pull/11186

Comment 32 Simon Pasquier 2019-02-13 16:30:31 UTC
The upstream PR has been merged. @Junqi Zhao, can it be moved to ON_QA?

Comment 34 Junqi Zhao 2019-02-14 01:06:21 UTC
(In reply to Simon Pasquier from comment #32)
> The upstream PR has been merged. @Junqi Zhao, can it be moved to ON_QA?

ON_QA now, will verify it today, but I am afraid since we did not have the original error in our environment and port range from 9000-10000 already opened before the fix, for this bug, what could I do now is only upgrade OCP from 3.10 to 3.11 and install cluster monitoring, then check if port range from 9000-10000 are opened.

Do you have some suggestions on how to verify it?

Comment 35 Simon Pasquier 2019-02-14 11:51:53 UTC
To test locally, I have removed manually the iptables rule that opened ports 9000-10000 then apply the upgrade playbook and check that the rule is added back.

Comment 37 Junqi Zhao 2019-02-15 09:59:51 UTC
Tested with ose-ansible:3.11.82-5 images which packaged openshift-ansible-3.11.82-3.git.0.9718d0a.el7, issue is fixed,  9000-10000 ports are opened from iptables

# docker run -u root --rm -it --entrypoint=/bin/rpm brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/openshift3/ose-ansible:v3.11.82-5 -qa | grep ansible
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
ansible-2.6.13-1.el7ae.noarch
openshift-ansible-roles-3.11.82-3.git.0.9718d0a.el7.noarch
openshift-ansible-3.11.82-3.git.0.9718d0a.el7.noarch
openshift-ansible-playbooks-3.11.82-3.git.0.9718d0a.el7.noarch
openshift-ansible-docs-3.11.82-3.git.0.9718d0a.el7.noarch

Steps:
1. install one 310 env firstly
2. removed manually the iptables rule
3. upgrade OCP from 310 to 311
3. check port
# iptables-save | grep 9000:10000
-A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 9000:10000 -j ACCEPT

# netstat -nalp | grep '9101'
tcp        0      0 127.0.0.1:9101          0.0.0.0:*               LISTEN      15582/node_exporter 
tcp        0      0 127.0.0.1:52168         127.0.0.1:9101          ESTABLISHED 15703/kube-rbac-pro 
tcp        0      0 127.0.0.1:9101          127.0.0.1:52168         ESTABLISHED 15582/node_exporter

Comment 39 errata-xmlrpc 2019-02-20 14:11:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0326