Description of problem: OpenShift on OpenStack installation fails in task "Approve node certificates when bootstrapping" due to failing calls from Master -> Node over port 10250. The Master -> Node communication over port 10250 [1] fails with this message: dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz [1] release-3.11 https://github.com/openshift/openshift-ansible/blob/a7e91cd9bb5e852a6f80cd71b4a181d7b9064a76/roles/lib_openshift/library/oc_csr_approve.py#L206-L220 How reproducible: Always for customer. Actual results: Failure during installation. Required custom modifications to Security Group during installation so the Ansible Task would succeed. Expected results: I would expect to see the node-secgrp rules specify the master-secgrp for remote_group_id. release-3.11: https://github.com/openshift/openshift-ansible/blob/a7e91cd9bb5e852a6f80cd71b4a181d7b9064a76/roles/openshift_openstack/defaults/main.yml#L166-L185Additional info:
Hello, Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250. If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180. Let me create a fix for it
> Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? The blocked communication was from master-to-node from what I could tell given the error (more clarification on this later): dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: 'https://master-0.openshift.example.com:10250/healthz > I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250. I'll attach the heat template from user having this issue. > If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, The failing command is executed from here: release-3.11 branch: https://github.com/openshift/openshift-ansible/blob/0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/oc_csr_approve.py#L206-L220 That would essentially be `oc get --raw /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod but that is a part of the host network, no dependency on Kuryr CNI I think. > I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180. I've not been able to test this or anything (no quick access to an OSP cluster), but I guess your intent is that removing the "remote_mode: remote_group_id" will cause the default remote_mode value of "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is "0.0.0.0/0", right?
(In reply to Robert Bost from comment #3) > > Can we get a bit more information. The blocked communication is from pod on master to a node, or from master node to another node? > > The blocked communication was from master-to-node from what I could tell > given the error (more clarification on this later): > > dial tcp 10.125.66.66:10250: i/o timeout'\nTrying to reach: > 'https://master-0.openshift.example.com:10250/healthz > > > I'm asking since all the nodes (master, infra, and app-nodes) should have the node SG group added (can you please check?) which actually allows traffic from the ports that have that SG in port 10250. > > I'll attach the heat template from user having this issue. > > > If this is from a pod, perhaps this is a duplicate of this other bug https://bugzilla.redhat.com/show_bug.cgi?id=1607724. If using namespace isolation it should already work, but for no isolation, > > The failing command is executed from here: > > release-3.11 branch: > https://github.com/openshift/openshift-ansible/blob/ > 0d71d0a7495fe23b2a5bb934f7b996d905a2c8ac/roles/lib_openshift/library/ > oc_csr_approve.py#L206-L220 > > That would essentially be `oc get --raw > /api/v1/nodes/<NODE-NAME>/proxy/healthz` which would go to master-api Pod > but that is a part of the host network, no dependency on Kuryr CNI I think. > > > I think perhaps just dropping this line should be ok: https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_openstack/defaults/main.yml#L180. > > I've not been able to test this or anything (no quick access to an OSP > cluster), but I guess your intent is that removing the "remote_mode: > remote_group_id" will cause the default remote_mode value of > "remote_ip_prefix" to be used. And, the default remote_ip_prefix value is > "0.0.0.0/0", right? Yes, and that has already been merged upstream
Verified in openshift-ansible-3.11.235 on top of OSP 13.0.12 2020-07-01.1 puddle. Installation is successfully completed with and without namespace isolation feature enabled. INSTALLER STATUS ******************************************************************************************************************************************** Initialization : Complete (0:01:12) Health Check : Complete (0:00:05) Node Bootstrap Preparation : Complete (0:11:15) etcd Install : Complete (0:01:12) Master Install : Complete (0:10:00) Master Additional Install : Complete (0:01:52) Node Join : Complete (0:02:09) Hosted Install : Complete (0:01:04) Cluster Monitoring Operator : Complete (0:01:46) Web Console Install : Complete (0:00:35) Console Install : Complete (0:00:56) Service Catalog Install : Complete (0:00:00) The failed task works now: TASK [Approve node certificates when bootstrapping] **************************** FAILED - RETRYING: Approve node certificates when bootstrapping (30 retries left). FAILED - RETRYING: Approve node certificates when bootstrapping (29 retries left). changed: [master-1.openshift.example.com]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2990