Bug 1746144

Summary: Port 10250 incorrectly configured in OSP Security Group Heat template
Product: OpenShift Container Platform Reporter: Robert Bost <rbost>
Component: InstallerAssignee: Luis Tomas Bolivar <ltomasbo>
Installer sub component: openshift-ansible QA Contact: Jon Uriarte <juriarte>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bmilne, ltomasbo
Version: 3.11.0Keywords: Triaged
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-27 13:49:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robert Bost 2019-08-27 18:26:37 UTC
Description of problem: OpenShift on OpenStack installation fails in task "Approve node certificates when bootstrapping" due to failing calls from Master -> Node over port 10250. The Master -> Node communication over port 10250 [1] fails with this message:

  dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz

[1] release-3.11 https://github.com/openshift/openshift-ansible/blob/a7e91cd9bb5e852a6f80cd71b4a181d7b9064a76/roles/lib_openshift/library/oc_csr_approve.py#L206-L220

How reproducible: Always for customer. 

Actual results:
Failure during installation. Required custom modifications to Security Group during installation so the Ansible Task would succeed.

Expected results:
I would expect to see the node-secgrp rules specify the master-secgrp for remote_group_id. 
release-3.11: https://github.com/openshift/openshift-ansible/blob/a7e91cd9bb5e852a6f80cd71b4a181d7b9064a76/roles/openshift_openstack/defaults/main.yml#L166-L185Additional info:

Comment 1 Luis Tomas Bolivar 2019-08-28 09:01:54 UTC
Hello,

Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250.

If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180.

Let me create a fix for it

Comment 3 Robert Bost 2019-08-28 20:27:50 UTC
> Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node?

The blocked communication was from master-to-node from what I could tell given the error (more clarification on this later):

  dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz

> I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250.

I'll attach the heat template from user having this issue.

> If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, 

The failing command is executed from here:

  release-3.11 branch: https://github.com/openshift/openshift-ansible/blob/0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/oc_csr_approve.py#L206-L220

That would essentially be `oc get --raw /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod but that is a part of the host network, no dependency on Kuryr CNI I think. 

> I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180.

I've not been able to test this or anything (no quick access to an OSP cluster), but I guess your intent is that removing the "remote_mode: remote_group_id" will cause the default remote_mode value of "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is "0.0.0.0/0", right?

Comment 5 Luis Tomas Bolivar 2019-08-29 06:59:55 UTC
(In reply to Robert Bost from comment #3)
> > Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node?
> 
> The blocked communication was from master-to-node from what I could tell
> given the error (more clarification on this later):
> 
>   dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach:
> 'https://master-0.openshift.example.com:10250/healthz
> 
> > I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250.
> 
> I'll attach the heat template from user having this issue.
> 
> > If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, 
> 
> The failing command is executed from here:
> 
>   release-3.11 branch:
> https://github.com/openshift/openshift-ansible/blob/
> 0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/
> oc_csr_approve.py#L206-L220
> 
> That would essentially be `oc get --raw
> /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod
> but that is a part of the host network, no dependency on Kuryr CNI I think. 
> 
> > I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180.
> 
> I've not been able to test this or anything (no quick access to an OSP
> cluster), but I guess your intent is that removing the "remote_mode:
> remote_group_id" will cause the default remote_mode value of
> "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is
> "0.0.0.0/0", right?

Yes, and that has already been merged upstream

Comment 6 Jon Uriarte 2020-07-09 10:40:23 UTC
Verified in openshift-ansible-3.11.235 on top of OSP 13.0.12 2020-07-01.1 puddle.

Installation is successfully completed with and without namespace isolation feature enabled.

INSTALLER STATUS ********************************************************************************************************************************************
Initialization               : Complete (0:01:12)
Health Check                 : Complete (0:00:05)
Node Bootstrap Preparation   : Complete (0:11:15)
etcd Install                 : Complete (0:01:12)
Master Install               : Complete (0:10:00)
Master Additional Install    : Complete (0:01:52)
Node Join                    : Complete (0:02:09)
Hosted Install               : Complete (0:01:04)
Cluster Monitoring Operator  : Complete (0:01:46)
Web Console Install          : Complete (0:00:35)
Console Install              : Complete (0:00:56)
Service Catalog Install      : Complete (0:00:00)


The failed task works now:
TASK [Approve node certificates when bootstrapping] ****************************
FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left).
FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left).
changed: [master-1.openshift.example.com]

Comment 9 errata-xmlrpc 2020-07-27 13:49:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990