Bug 1802820

Summary: [UPI on Azure] the UPI templates do not let us ssh to the bootstrap node by default
Product: OpenShift Container Platform Reporter: Etienne Simard <esimard>
Component: InstallerAssignee: Fabiano Franz <ffranz>
Installer sub component: openshift-installer QA Contact: Etienne Simard <esimard>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified    
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-05-04 11:36:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Etienne Simard 2020-02-13 23:39:43 UTC
Description of problem:

Using the current UPI templates for 4.4 available at https://github.com/openshift/installer/tree/master/upi/azure and the docs available at https://github.com/openshift/installer/blob/master/docs/user/azure/install_upi.md do not provide ssh access to the bootstrap node for troubleshooting purposes

Version-Release number of the following components:

https://github.com/openshift/installer/commit/f058552712fbe65ec6f25f192a827dab82f70f9d

How reproducible: Easy

Steps to Reproduce:
1. Follow the UPI on Azure steps to create a new OCP cluster
2. Once the Bootstrap node is up, try to ssh

Actual results:

You can't ssh to the bootstrap node

~~~
$ ssh core.3.4
^C
~~~

Expected results:

We should be able to ssh to the bootstrap node and gather bootstrap logs 

~~~
$ ssh core.3.4
The authenticity of host '1.2.3.4 (1.2.3.4)' can't be established.
ECDSA key fingerprint is SHA256:JRPa0+ufibRP8lwOtb6cXaEx28EQrpWOufbeJWuVsR4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '1.2.3.4' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 44.81.202002071430-0
  Part of OpenShift 4.4, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.4/architecture/architecture-rhcos.html

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary service is "bootkube.service". To watch its status, run e.g.

  journalctl -b -f -u bootkube.service
~~~


Additional info:

I was able to ssh to the bootstrap node after adding a NAT and Firewall rules for SSH towards the bootstrap node.

Comment 3 Etienne Simard 2020-02-18 22:00:07 UTC
Code Version:

Installer master including https://github.com/openshift/installer/pull/3118

Installer Version:

DEBUG OpenShift Installer 4.4.0-0.nightly-2020-02-18-132334 
DEBUG Built from commit 43bed121efd7d9b3353e7ef5bd85dae07e0cc97e 



Failed verification: while the ssh NSG rule is present in the updated bootstrap template, there are no LB or Inbound NAT rules so the connection is not accepted unless you add one of those to the public-lb.

Example

~~~

[After the bootstrap is up and running]

[esimard@elaptop essboot2]$ ssh core.107.79
^C
[esimard@elaptop essboot2]$ telnet 13.89.107.79 22
Trying 13.89.107.79...
^C

[After Successfully saved load balancer inbound NAT rule 'natssh']

[esimard@elaptop essboot2]$ telnet 13.89.107.79 22
Trying 13.89.107.79...
Connected to 13.89.107.79.
Escape character is '^]'.
SSH-2.0-OpenSSH_8.0
^]
telnet> ^CConnection closed.
[esimard@elaptop essboot2]$ ssh core.107.79
The authenticity of host '13.89.107.79 (13.89.107.79)' can't be established.
ECDSA key fingerprint is SHA256:DUh0QqZc5OLbp/bclpuRnoeZtWurotXyqT+yX5iHBUg.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '13.89.107.79' (ECDSA) to the list of known hosts.
Red Hat Enterprise Linux CoreOS 44.81.202001241431.0
  Part of OpenShift 4.4, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).

WARNING: Direct SSH access to machines is not recommended; instead,
make configuration changes via `machineconfig` objects:
  https://docs.openshift.com/container-platform/4.4/architecture/architecture-rhcos.html

---
This is the bootstrap node; it will be destroyed when the master is fully up.

The primary service is "bootkube.service". To watch its status, run e.g.

  journalctl -b -f -u bootkube.service
[systemd]
Failed Units: 1
  sssd.service
~~~

Comment 4 Fabiano Franz 2020-02-19 17:33:17 UTC
Are you trying to connect using the same public IP attached to the public load balancer (the same that serves api.clustername)?

The strategy I proposed in https://github.com/openshift/installer/pull/3118 is to use _another_ public IP specifically for SSH access, pointing directly to the network interface that belongs to the bootstrap VM, and that public IP gets deleted along with the NSG rule once the bootstrap process is finished. WDYT about this approach? 

I understand, though, it's not clear and maybe we need a mention to this in the docs, with a command to print the connection string with the ssh-specific public IP.

Comment 5 Etienne Simard 2020-02-19 18:24:48 UTC
Hello Fabiano,

No, I copied the `Public IP address` directly from the bootstrap Virtual Machine. I don't mind the approach as long as it's documented but it might not be the standard way to do it.
My other tests are done with the `Public IP address` (Azure IPI and possibly other platforms as well).

I just re-tested with the `bootstrap-ssh-pip` address and confirmed that it works. If you decide to continue with this approach, please re-send it to ON_QA and so I can add the verification logs.

Comment 6 Etienne Simard 2020-02-20 23:16:04 UTC
Verified 

Installer master including https://github.com/openshift/installer/pull/3118

Installer Version:

DEBUG OpenShift Installer 4.4.0-0.nightly-2020-02-18-132334 
DEBUG Built from commit 43bed121efd7d9b3353e7ef5bd85dae07e0cc97e 


Notes: My documentation on that subject was not right: if you click connect, it's the IP dedicated to the bootstrap. In the bootstrap NIC, the NIC public IP shows the IP dedicated to the bootstrap (same as IPI on Azure). In the bootstrap VM overview, the public IP is the one for the public LB. 


Confirmed SSH and gather boostrap:

~~~
$ ./openshift-install gather bootstrap --bootstrap 52.154.233.131 --master "10.0.0.8"
INFO Pulling debug logs from the bootstrap machine 
INFO Bootstrap gather logs captured here "log-bundle-20200219131723.tar.gz" 
~~~

Comment 8 errata-xmlrpc 2020-05-04 11:36:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581