Bug 1746144
Summary: | Port 10250 incorrectly configured in OSP Security Group Heat template | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robert Bost <rbost> |
Component: | Installer | Assignee: | Luis Tomas Bolivar <ltomasbo> |
Installer sub component: | openshift-ansible | QA Contact: | Jon Uriarte <juriarte> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | bmilne, ltomasbo |
Version: | 3.11.0 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 3.11.z | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-07-27 13:49:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Robert Bost
2019-08-27 18:26:37 UTC
Hello, Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250. If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180. Let me create a fix for it > Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? The blocked communication was from master-to-node from what I could tell given the error (more clarification on this later): dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz > I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250. I'll attach the heat template from user having this issue. > If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, The failing command is executed from here: release-3.11 branch: https://github.com/openshift/openshift-ansible/blob/0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/oc_csr_approve.py#L206-L220 That would essentially be `oc get --raw /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod but that is a part of the host network, no dependency on Kuryr CNI I think. > I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180. I've not been able to test this or anything (no quick access to an OSP cluster), but I guess your intent is that removing the "remote_mode: remote_group_id" will cause the default remote_mode value of "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is "0.0.0.0/0", right? (In reply to Robert Bost from comment #3) > > Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? > > The blocked communication was from master-to-node from what I could tell > given the error (more clarification on this later): > > dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: > 'https://master-0.openshift.example.com:10250/healthz > > > I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250. > > I'll attach the heat template from user having this issue. > > > If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, > > The failing command is executed from here: > > release-3.11 branch: > https://github.com/openshift/openshift-ansible/blob/ > 0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/ > oc_csr_approve.py#L206-L220 > > That would essentially be `oc get --raw > /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod > but that is a part of the host network, no dependency on Kuryr CNI I think. > > > I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180. > > I've not been able to test this or anything (no quick access to an OSP > cluster), but I guess your intent is that removing the "remote_mode: > remote_group_id" will cause the default remote_mode value of > "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is > "0.0.0.0/0", right? Yes, and that has already been merged upstream Verified in openshift-ansible-3.11.235 on top of OSP 13.0.12 2020-07-01.1 puddle. Installation is successfully completed with and without namespace isolation feature enabled. INSTALLER STATUS ******************************************************************************************************************************************** Initialization : Complete (0:01:12) Health Check : Complete (0:00:05) Node Bootstrap Preparation : Complete (0:11:15) etcd Install : Complete (0:01:12) Master Install : Complete (0:10:00) Master Additional Install : Complete (0:01:52) Node Join : Complete (0:02:09) Hosted Install : Complete (0:01:04) Cluster Monitoring Operator : Complete (0:01:46) Web Console Install : Complete (0:00:35) Console Install : Complete (0:00:56) Service Catalog Install : Complete (0:00:00) The failed task works now: TASK [Approve node certificates when bootstrapping] **************************** FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left). changed: [master-1.openshift.example.com] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2990 |