Bug 1792822 - Adding RHEL compute node requires port 22623 to be opened on external LB
Summary: Adding RHEL compute node requires port 22623 to be opened on external LB
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.5.0
Assignee: Russell Teague
QA Contact: Gaoyun Pei
Depends On:
Blocks: 1807032
TreeView+ depends on / blocked
Reported: 2020-01-20 05:08 UTC by Rutvik
Modified: 2020-07-13 17:13 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Bootstrap server endpoint uses 'api' endpoint which goes through external load-balancer Consequence: Additional port required to be open on external load-balancer to bootstrap nodes Fix: Switched the bootstrap server endpoint to the internal endpoint 'api-int' Result: Additional open port not required on external load-balancer
Clone Of:
: 1807032 (view as bug list)
Last Closed: 2020-07-13 17:13:17 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Github openshift openshift-ansible pull 12107 None closed Bug 1792822: Change bootstrap server to use 'api-int' for ignition config 2020-09-17 08:53:29 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:13:43 UTC

Description Rutvik 2020-01-20 05:08:24 UTC
Description of problem:

In a disconnected UPI environment, while adding an RHEL worker node in the existing cluster, the scale-up playbook failed on the below error.

TASK [openshift_node : Wait for bootstrap endpoint to show up] *******************************************************************************************
FAILED - RETRYING: Wait for bootstrap endpoint to show up (60 retries left).
FAILED - RETRYING: Wait for bootstrap endpoint to show up (59 retries left).
FAILED - RETRYING: Wait for bootstrap endpoint to show up (58 retries left).
FAILED - RETRYING: Wait for bootstrap endpoint to show up (57 retries left).
FAILED - RETRYING: Wait for bootstrap endpoint to show up (1 retries left).
fatal: [worker2.xy.com]: FAILED! => {"attempts": 60, "changed": false, "content": "", "elapsed": 0, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "https://api.<cluster_name>.<domain>.com:22623/config/worker"}

[+] https://github.com/openshift/openshift-ansible/blob/release-4.2/roles/openshift_node/tasks/config.yml#L42

This environment had 2 LB(external and internal). These LBs are haproxy based and Machine Config server (for port 22623) on external LB is not defined as per installation docs: https://docs.openshift.com/container-platform/4.2/installing/installing_bare_metal/installing-bare-metal.html

Actual results:

Node scale up playbook seems to be looking for 22623 bootstrap endpoint but trying to reach via the "https://api.<cluster_name>.<domain>.com:22623/config/worker" because as per the documentation[1], it says 22623 does not need to be opened on external LB.

Expected results:

Either a worker node should get added without adding 22623 port on the external LB or documentation needs to be clear enough on what all ports are required at the external LB.

[1] https://docs.openshift.com/container-platform/4.2/installing/installing_bare_metal/installing-bare-metal.html

Additional Info:

** As a workaround, we made changes in the external haproxy.conf and added 22623 in the frontend section just like it's there in internal LB configuration. Then the question arises is, what is the purpose does it serve to have 2 LBs with identical configurations? Can we not have just one LB with 2 NICs?

** Also, based on the below Bugzilla, we did not recommend enabling firewalld on RHEL worker nodes. However, the BZ status is closed with errata and it says the issue is fixed with:
openshift-ansible-4.2.0-201908142219.git.188.7254b39.el7. Have we tested the worker node scaling having firewalld enabled?

[+] https://github.com/openshift/openshift-ansible/blob/release-4.2/roles/openshift_node/tasks/config.yml#L19
# The base OS RHEL with "Minimal" installation option is
# enabled firewalld serivce by default, it denies unexpected 10250 port.
# Reference: https://bugzilla.redhat.com/show_bug.cgi?id=1740439           <---- closed with an errata
- name: Disable firewalld service
    name: "firewalld.service"
    enabled: false
  register: service_status
  - service_status is failed
- not ('Could not find the requested service' in service_status.msg)

Comment 1 Stephen Cuppett 2020-01-20 13:01:14 UTC
Setting to current development branch (4.4). For fixes, if any, required/requested for prior versions, clones of this BZ will be created targeting those z-streams.

Comment 3 Lukas Bednar 2020-02-25 10:51:56 UTC
Hello, I am facing this issue with 4.4.0 openshift-ansible on RHOS IPI provisioned cluster.
I can see that target version has changed to 4.5 ... is there any easier workaround than adding external load balancer?

Comment 4 Russell Teague 2020-02-25 13:41:42 UTC
This bug was fixed in https://github.com/openshift/openshift-ansible/pull/12099

Comment 7 Gaoyun Pei 2020-03-09 09:35:29 UTC
Verify this bug with openshift-ansible-4.5.0-202003062301.git.0.dc37bae.el7.noarch.rpm

Scale-up playbook is using the internal api LB address to fetch bootstrap.ign, so no need additional 22623 opened on external LB now.

TASK [openshift_node : Wait for bootstrap endpoint to show up] *****************

"redirected": false, "status": 200, "url": "https://api-int.gpei-45g.qe.gcp.devcluster.openshift.com:22623/config/worker"}

Comment 9 errata-xmlrpc 2020-07-13 17:13:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.