Bug 1746144 - Port 10250 incorrectly configured in OSP Security Group Heat template
Summary: Port 10250 incorrectly configured in OSP Security Group Heat template
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.11.z
Assignee: Luis Tomas Bolivar
QA Contact: Jon Uriarte
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-27 18:26 UTC by Robert Bost
Modified: 2020-07-27 13:49 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-27 13:49:10 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-ansible pull 11865 0 'None' closed Ensure port 10250 is accessible from nodes/pods 2020-12-24 14:48:11 UTC
Red Hat Knowledge Base (Solution) 4376341 0 None None None 2019-08-27 18:54:17 UTC
Red Hat Product Errata RHBA-2020:2990 0 None None None 2020-07-27 13:49:23 UTC

Description Robert Bost 2019-08-27 18:26:37 UTC
Description of problem: OpenShift on OpenStack installation fails in task "Approve node certificates when bootstrapping" due to failing calls from Master -> Node over port 10250. The Master -> Node communication over port 10250 [1] fails with this message:

  dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz

[1] release-3.11 https://github.com/openshift/openshift-ansible/blob/a7e91cd9bb5e852a6f80cd71b4a181d7b9064a76/roles/lib_openshift/library/oc_csr_approve.py#L206-L220

How reproducible: Always for customer. 

Actual results:
Failure during installation. Required custom modifications to Security Group during installation so the Ansible Task would succeed.

Expected results:
I would expect to see the node-secgrp rules specify the master-secgrp for remote_group_id. 
release-3.11: https://github.com/openshift/openshift-ansible/blob/a7e91cd9bb5e852a6f80cd71b4a181d7b9064a76/roles/openshift_openstack/defaults/main.yml#L166-L185Additional info:

Comment 1 Luis Tomas Bolivar 2019-08-28 09:01:54 UTC
Hello,

Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250.

If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180.

Let me create a fix for it

Comment 3 Robert Bost 2019-08-28 20:27:50 UTC
> Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node?

The blocked communication was from master-to-node from what I could tell given the error (more clarification on this later):

  dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz

> I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250.

I'll attach the heat template from user having this issue.

> If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, 

The failing command is executed from here:

  release-3.11 branch: https://github.com/openshift/openshift-ansible/blob/0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/oc_csr_approve.py#L206-L220

That would essentially be `oc get --raw /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod but that is a part of the host network, no dependency on Kuryr CNI I think. 

> I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180.

I've not been able to test this or anything (no quick access to an OSP cluster), but I guess your intent is that removing the "remote_mode: remote_group_id" will cause the default remote_mode value of "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is "0.0.0.0/0", right?

Comment 5 Luis Tomas Bolivar 2019-08-29 06:59:55 UTC
(In reply to Robert Bost from comment #3)
> > Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node?
> 
> The blocked communication was from master-to-node from what I could tell
> given the error (more clarification on this later):
> 
>   dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach:
> 'https://master-0.openshift.example.com:10250/healthz
> 
> > I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250.
> 
> I'll attach the heat template from user having this issue.
> 
> > If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, 
> 
> The failing command is executed from here:
> 
>   release-3.11 branch:
> https://github.com/openshift/openshift-ansible/blob/
> 0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/
> oc_csr_approve.py#L206-L220
> 
> That would essentially be `oc get --raw
> /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod
> but that is a part of the host network, no dependency on Kuryr CNI I think. 
> 
> > I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180.
> 
> I've not been able to test this or anything (no quick access to an OSP
> cluster), but I guess your intent is that removing the "remote_mode:
> remote_group_id" will cause the default remote_mode value of
> "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is
> "0.0.0.0/0", right?

Yes, and that has already been merged upstream

Comment 6 Jon Uriarte 2020-07-09 10:40:23 UTC
Verified in openshift-ansible-3.11.235 on top of OSP 13.0.12 2020-07-01.1 puddle.

Installation is successfully completed with and without namespace isolation feature enabled.

INSTALLER STATUS ********************************************************************************************************************************************
Initialization               : Complete (0:01:12)
Health Check                 : Complete (0:00:05)
Node Bootstrap Preparation   : Complete (0:11:15)
etcd Install                 : Complete (0:01:12)
Master Install               : Complete (0:10:00)
Master Additional Install    : Complete (0:01:52)
Node Join                    : Complete (0:02:09)
Hosted Install               : Complete (0:01:04)
Cluster Monitoring Operator  : Complete (0:01:46)
Web Console Install          : Complete (0:00:35)
Console Install              : Complete (0:00:56)
Service Catalog Install      : Complete (0:00:00)


The failed task works now:
TASK [Approve node certificates when bootstrapping] ****************************
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
changed: [master-1.openshift.example.com]

Comment 9 errata-xmlrpc 2020-07-27 13:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990


Note You need to log in before you can comment on or make changes to this bug.