Bug 1813123

Summary: [UPI on OSP] RHEL worker scaleup failed due to master security group only is limit to machineNetwork
Product: OpenShift Container Platform Reporter: weiwei jiang <wjiang>
Component: InstallerAssignee: Adolfo Duarte <adduarte>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: adduarte, eduen, m.andre, wsun
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-04-09 14:39:06 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1804083    
Bug Blocks:    

Description weiwei jiang 2020-03-13 02:11:51 UTC
This bug was initially created as a copy of Bug #1804083

I am copying this bug because: 



Description of problem:
When trying to scaleup a RHEL worker to existing UPI on OSP cluster,
the ansible procedure stuck at the loop `TASK [openshift_node : Wait for bootstrap endpoint to show up]`.

my scenario is, 
1. I create the RHEL worker with the same subnet for the RHCOS worker,
2. I also create a floating ip for the RHEL worker to make the worker can be sshed from the outside for jenkins slave


After I add new rules for the external_network subnet range(which the RHEL worker floating ip belong to), the openshift-ansbile back to work.


Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1. Setup a UPI on OSP cluster according to https://github.com/openshift/installer/blob/release-4.4/docs/user/openstack/install_upi.md
2. Scaleup a RHEL worker according to https://github.com/openshift/openshift-ansible/blob/release-4.4/README.md
3.

Actual results:

TASK [openshift_node : Wait for bootstrap endpoint to show up] *****************
Tuesday 18 February 2020  10:56:10 +0800 (0:00:00.406)       0:03:09.745 ****** 
FAILED - RETRYING: Wait for bootstrap endpoint to show up (60 retries left).
FAILED - RETRYING: Wait for bootstrap endpoint to show up (59 retries left).
...
FAILED - RETRYING: Wait for bootstrap endpoint to show up (2 retries left).
FAILED - RETRYING: Wait for bootstrap endpoint to show up (1 retries left).
fatal: [wjuos442181-5sf8q-rhel-0.wjuos442181.qe.devcluster.openshift.com]: FAILED! => {"attempts": 60, "changed": false, "content": "", "elapsed": 30, "msg": "Status code was -1 and not [200]: Request failed: <urlopen error timed out>", "redirected": false, "status": -1, "url": "https://api.wjuos442181.qe.devcluster.openshift.com:22623/config/worker"}

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   
wjuos442181-5sf8q-rhel-0.wjuos442181.qe.devcluster.openshift.com : ok=15   changed=9    unreachable=0    failed=1    skipped=2    rescued=0    ignored=0   

Tuesday 18 February 2020  11:37:05 +0800 (0:40:54.601)       0:44:04.346 ****** 
=============================================================================== 
openshift_node : Wait for bootstrap endpoint to show up -------------- 2454.60s
openshift_node : Install openshift support packages ------------------- 121.96s
openshift_node : Install openshift packages ---------------------------- 60.98s
openshift_node : Get cluster nodes -------------------------------------- 1.20s
openshift_node : Setting sebool container_manage_cgroup ----------------- 1.13s
openshift_node : Enable the CRI-O service ------------------------------- 0.75s
openshift_node : Get kubernetes server version -------------------------- 0.63s
openshift_node : Enable IP Forwarding ----------------------------------- 0.43s
openshift_node : Enable persistent storage on journal ------------------- 0.43s
openshift_node : Create temp directory ---------------------------------- 0.41s
openshift_node : Disable swap ------------------------------------------- 0.40s
openshift_node : Get cluster version ------------------------------------ 0.36s
openshift_node : Disable firewalld service ------------------------------ 0.32s
openshift_node : include_tasks ------------------------------------------ 0.12s
openshift_node : Fail if new_workers group contains active nodes -------- 0.08s
openshift_node : Set fact l_kubernetes_version -------------------------- 0.08s
openshift_node : include_tasks ------------------------------------------ 0.08s
openshift_node : Set fact l_cluster_version ----------------------------- 0.07s
openshift_node : Override kubernetes version when running CI ------------ 0.07s
openshift_node : Override cluster version when running CI --------------- 0.07s

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 9 Adolfo Duarte 2020-03-25 09:01:00 UTC
weiwei

Is this system still available? 

Also to recapture what you did to the system: 


Did you add a security rule to allow the floating ip of the rhel worker? 

Or did you somehow point the rhel worker to the internal dns? 

Thanks. There is a couple of ways of solving this problem and I want to document the one you tested. 
Thanks.

Comment 12 Martin André 2020-04-09 14:39:06 UTC

*** This bug has been marked as a duplicate of bug 1804083 ***