Bug 1685508

Summary: have no way for troubleshooting the Bootstrap Node
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: InstallerAssignee: Alex Crawford <crawford>
Installer sub component: openshift-installer QA Contact: Johnny Liu <jialiu>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: dsanzmor, jialiu, vlaad, wking
Version: 4.1.0Keywords: Regression
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-08 22:52:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-03-05 11:05:13 UTC
Description of problem:

Version-Release number of the following components:
extracted installer from registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-05-065158

# ./openshift-install version
./openshift-install v4.0.15-1-dirty


How reproducible:
Always

Steps to Reproduce:
1. When cluster install is not going well, user should be able to follow https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md#troubleshooting-the-bootstrap-node for troubleshooting the Bootstrap Node
2.
3.

Actual results:
ssh connection failed with timeout, and curl bootstrap service failed either. I am sure the bootstrap node was running when I run the following command.

[root@preserve-jialiu-ansible demo2]# curl -vvv --insecure --cert ./tls/journal-gatewayd.crt --key ./tls/journal-gatewayd.key 'https://ec2-18-223-241-12.us-east-2.compute.amazonaws.com:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'
* About to connect() to ec2-18-223-241-12.us-east-2.compute.amazonaws.com port 19531 (#0)
*   Trying 18.223.241.12...
* Connection timed out
* Failed connect to ec2-18-223-241-12.us-east-2.compute.amazonaws.com:19531; Connection timed out
* Closing connection 0
curl: (7) Failed connect to ec2-18-223-241-12.us-east-2.compute.amazonaws.com:19531; Connection timed out

Seem like security group is blocking the connection from my localhost?

Expected results:
Allow user to connect bootstrap node for troubleshooting.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2019-03-05 13:16:54 UTC
> ... ssh connection failed with timeout...

$ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-05-065158 | grep installer
  installer                                     https://github.com/openshift/installer                                     c8b3b5532694c7713efe300a636108174d623c52
$ git log --first-parent --format='%ad %h %d %s' --date=iso c8b3b553^..origin/master | cat
2019-03-05 02:38:46 -0800 18fca42a0  (HEAD -> master, wking/master, origin/master, origin/HEAD) Merge pull request #1351 from flaper87/default-dns
2019-03-04 23:46:12 -0800 395e1a5b2  Merge pull request #1243 from flaper87/multi-run-per-tenant
2019-03-04 18:46:43 -0800 0dd936708  Merge pull request #1361 from eparis/explicit-network-interface
2019-03-04 14:23:46 -0800 83d7fcab4  Merge pull request #1358 from ashcrow/rhcos-name-update
2019-03-04 12:29:18 -0800 7e8ee7d0d  Merge pull request #1359 from ashcrow/update-code-references-to-rhcos
2019-03-04 11:17:27 -0800 04651f0da  Merge pull request #1334 from squeed/allow-udp
2019-03-04 09:51:07 -0800 0ddac4193  Merge pull request #1331 from abhinavdahiya/docs-custom-mc
2019-03-04 06:18:57 -0800 27b9175c6  Merge pull request #1355 from wking/drop-uuidgen-gzip
2019-03-01 19:06:31 -0800 a98d4d75c  Merge pull request #1348 from abhinavdahiya/fix_bootstrap_subnet
2019-03-01 13:56:19 -0800 3fd354f75  Merge pull request #1292 from umohnani8/infra-image
2019-03-01 12:06:44 -0800 f41af0dab  Merge pull request #1347 from wking/wrapf-to-wrap
2019-03-01 08:25:43 -0800 3a5193cdd  Merge pull request #1346 from staebler/delete_snapshot
2019-03-01 05:38:28 -0800 c8b3b5532  (origin/release-4.0) Merge pull request #1338 from flaper87/sec-groups-update

Fix for the SSH issue was #1348.

Comment 3 Johnny Liu 2019-03-07 09:37:38 UTC
Still reproduce this bug with v4.0.16-1-dirty installer.

# oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 | grep installer
  installer                                     https://github.com/openshift/installer                                     c8b3b5532694c7713efe300a636108174d623c52

Extract installer from 4.0.0-0.nightly-2019-03-06-074438
Run install
Upon bootstrap is running, run ssh and curl command against the bootstrap node, both failed with timeout error.

[root@preserve-jialiu-ansible 20190307]# ssh -i libra-new.pem core.compute.amazonaws.com -vvv
Warning: Identity file libra-new.pem not accessible: No such file or directory.
OpenSSH_7.4p1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 58: Applying options for *
debug2: resolving "ec2-35-180-187-123.eu-west-3.compute.amazonaws.com" port 22
debug2: ssh_connect_direct: needpriv 0
debug1: Connecting to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com [35.180.187.123] port 22.
debug1: connect to address 35.180.187.123 port 22: Connection timed out
ssh: connect to host ec2-35-180-187-123.eu-west-3.compute.amazonaws.com port 22: Connection timed out

[root@preserve-jialiu-ansible 20190307]# curl -vvv --insecure --cert ./bz1674034/tls/journal-gatewayd.crt --key ./bz1674034/tls/journal-gatewayd.key 'https://ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'
* About to connect() to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com port 19531 (#0)
*   Trying 35.180.187.123...
* Connection timed out
* Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out
* Closing connection 0
curl: (7) Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out

Comment 4 W. Trevor King 2019-03-07 20:41:33 UTC
> [root@preserve-jialiu-ansible 20190307]# curl -vvv --insecure --cert ./bz1674034/tls/journal-gatewayd.crt --key ./bz1674034/tls/journal-gatewayd.key 'https://ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service'
> * About to connect() to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com port 19531 (#0)
> *   Trying 35.180.187.123...
> * Connection timed out
> * Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out
> * Closing connection 0
> curl: (7) Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out

Hmm, it's working in CI [1] with [2].  Looks like you're using DNS instead of the documented bootstrap IP [3].  You can get the bootstrap IP from terraform.tfstate [4] (until the bootstrap machine has been torn down), which isn't very convenient, but it works.  You can also retrieve the public IP with the AWS CLI or web console.

[1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5394/artifacts/release-e2e-aws/bootstrap/bootkube.service
[2]: https://github.com/openshift/release/blob/1614bde2cfe28fa1dc2e242b8feda66db933ef99/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L361-L368
[3]: https://github.com/openshift/installer/blob/v0.14.0/docs/user/troubleshooting.md#troubleshooting-the-bootstrap-node
[4]: https://github.com/openshift/release/blob/1614bde2cfe28fa1dc2e242b8feda66db933ef99/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L352-L354

Comment 5 Johnny Liu 2019-03-08 03:35:34 UTC
Hmm, seem like this is nothing with using public IP or DNS, more like related with security group rules. 

Today I run a new install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service succeed.

The following is the nodestrap's security group rules:
Ports	Protocol	Source	terraform-20190308023923833100000005	terraform-20190308023923607100000003
22	tcp	0.0.0.0/0	✔	 
19531	tcp	0.0.0.0/0	✔	 
6443	tcp	10.0.0.0/16	 	✔
2379-2380	tcp	sg-0f57ce794823c9200	 	✔
9000-9999	udp	sg-0580f087a471058ed, sg-0f57ce794823c9200	 	✔
22623	tcp	10.0.0.0/16	 	✔
12379-12380	tcp	sg-0f57ce794823c9200	 	✔

4789	udp	sg-0580f087a471058ed, sg-0f57ce794823c9200	 	✔
10252	tcp	sg-0580f087a471058ed, sg-0f57ce794823c9200	 	✔
30000-32767	tcp	sg-0f57ce794823c9200	 	✔
0	icmp	10.0.0.0/16	 	✔
22	tcp	10.0.0.0/16	 	✔
10251	tcp	sg-0580f087a471058ed, sg-0f57ce794823c9200	 	✔
9000-9999	tcp	sg-0580f087a471058ed, sg-0f57ce794823c9200	 	✔
10250	tcp	sg-0580f087a471058ed, sg-0f57ce794823c9200	 	✔

I also tried another install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service still failed with timeout error like before.
Ports	Protocol	Source	terraform-20190308025600338000000002	terraform-20190308025601505100000004
30000-32767	tcp	sg-08896a9217945d18e	✔	 
6443	tcp	10.0.0.0/16	✔	 
0	icmp	10.0.0.0/16	✔	 
2379-2380	tcp	sg-08896a9217945d18e	✔	 
22	tcp	10.0.0.0/16	✔	 
10251	tcp	sg-08896a9217945d18e, sg-0cece65e7aa8dda65	✔	 
22623	tcp	10.0.0.0/16	✔	 
9000-9999	tcp	sg-08896a9217945d18e, sg-0cece65e7aa8dda65	✔	 
10250	tcp	sg-08896a9217945d18e, sg-0cece65e7aa8dda65	✔	 
12379-12380	tcp	sg-08896a9217945d18e	✔	 
4789	udp	sg-08896a9217945d18e, sg-0cece65e7aa8dda65	✔	 
10252	tcp	sg-08896a9217945d18e, sg-0cece65e7aa8dda65	✔	 
22	tcp	0.0.0.0/0	 	✔
19531	tcp	0.0.0.0/0	 	✔

Compare with the above rules, seem like the rules for allowing traffic to 22 from 0.0.0.0/0 and to 19531 from 0.0.0.0/0 is placed in different order. I think it that root cause.

Comment 6 W. Trevor King 2019-03-08 03:41:02 UTC
> Compare with the above rules, seem like the rules for allowing traffic to 22 from 0.0.0.0/0 and to 19531 from 0.0.0.0/0 is placed in different order. I think it that root cause.

Ah, that is probably what's hap[ening, thanks :).  Although now I don't understand why this is working in CI :p.  But I'll patch to avoid the overlap.

Comment 7 W. Trevor King 2019-03-08 08:47:57 UTC
Actually, looks like the overlap may not be a problem.  From [1]:

> If there is more than one rule for a specific port, we apply the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address 203.0.113.1 and another rule that allows access to TCP port 22 from everyone, everyone has access to TCP port 22.

Taking a closer look at your comment, I don't see a difference between your two runs:

> Today I run a new install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service succeed.
> ...
> I also tried another install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service still failed with timeout error like before.

Are you saying that your failures are non-deterministic?  And what are you running to generate those security-group lists?

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#security-group-rules

Comment 8 Johnny Liu 2019-03-11 09:15:49 UTC
(In reply to W. Trevor King from comment #7)
> Actually, looks like the overlap may not be a problem.  From [1]:
> 
> 
> Taking a closer look at your comment, I don't see a difference between your
> two runs:
> 
> > Today I run a new install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service succeed.
> > ...
> > I also tried another install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service still failed with timeout error like before.

Sorry, my fault, it must be copy/paste mistake, but the security group rules was right what I saw at that moment.

Because 4.0.0-0.nightly-2019-03-07-085509 is not existing any more. I re-run some testing with different builds:
1. released version of v0.14.0 installer + default RHCOS and payload release image
both ssh connection and curl bootstrap service failed with timeout error like before.

Ports	Protocol	Source	terraform-20190311072902200900000005	terraform-20190311072900731800000002
22	tcp	0.0.0.0/0	✔	 
19531	tcp	0.0.0.0/0	✔	 
30000-32767	tcp	sg-0f23e2ffbd48992dd	 	✔
6443	tcp	10.0.0.0/16	 	✔
0	icmp	10.0.0.0/16	 	✔
2379-2380	tcp	sg-0f23e2ffbd48992dd	 	✔
22	tcp	10.0.0.0/16	 	✔
10251	tcp	sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd	 	✔
22623	tcp	10.0.0.0/16	 	✔
9000-9999	tcp	sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd	 	✔
10250	tcp	sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd	 	✔
12379-12380	tcp	sg-0f23e2ffbd48992dd	 	✔
4789	udp	sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd	 	✔
10252	tcp	sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd	 	✔


2. v4.0.21-1-dirty installer extracted from 4.0.0-0.nightly-2019-03-10-151536 + 47.330 rhcos + 4.0.0-0.nightly-2019-03-06-074438 release payload image
both ssh connection and curl bootstrap service succeed.

Ports	Protocol	Source	terraform-20190311072920099000000004	terraform-20190311072917634500000003
22	tcp	0.0.0.0/0	✔	 
19531	tcp	0.0.0.0/0	✔	 
6443	tcp	10.0.0.0/16	 	✔
2379-2380	tcp	sg-0f7ce2a0fdfc11b94	 	✔
9000-9999	udp	sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94	 	✔
22623	tcp	10.0.0.0/16	 	✔
12379-12380	tcp	sg-0f7ce2a0fdfc11b94	 	✔
4789	udp	sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94	 	✔
10252	tcp	sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94	 	✔
30000-32767	tcp	sg-0f7ce2a0fdfc11b94	 	✔
0	icmp	10.0.0.0/16	 	✔
22	tcp	10.0.0.0/16	 	✔
10251	tcp	sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94	 	✔
9000-9999	tcp	sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94	 	✔
10250	tcp	sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94	 	✔

3. v4.0.16-1-dirty installer extracted from 4.0.0-0.nightly-2019-03-06-074438 + 47.330 rhcos + 4.0.0-0.nightly-2019-03-06-074438 release payload image
Ports	Protocol	Source	terraform-20190311085916028700000002	terraform-20190311085916029800000003
30000-32767	tcp	sg-0bb60cb3faf6a778c	✔	 
6443	tcp	10.0.0.0/16	✔	 
0	icmp	10.0.0.0/16	✔	 
2379-2380	tcp	sg-0bb60cb3faf6a778c	✔	 
22	tcp	10.0.0.0/16	✔	 
10251	tcp	sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07	✔	 
22623	tcp	10.0.0.0/16	✔	 
9000-9999	tcp	sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07	✔	 
10250	tcp	sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07	✔	 
12379-12380	tcp	sg-0bb60cb3faf6a778c	✔	 
4789	udp	sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07	✔	 
10252	tcp	sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07	✔	 
22	tcp	0.0.0.0/0	 	✔
19531	tcp	0.0.0.0/0	 	✔

> 
> Are you saying that your failures are non-deterministic?  And what are you
> running to generate those security-group lists?
The above Security Group Rules are copied directly from aws web console.

> If there is more than one rule for a specific port, we apply the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address 203.0.113.1 and another rule that allows access to TCP port 22 from everyone, everyone has access to TCP port 22.
Good to know this. In comment 5, I thought the security group rules would work like rhel iptable rules, the order of rules would decide which traffic was allowed or denied. Seem like I was wrong. 
According to today's testing, the rules for allowing traffic to 22 from 0.0.0.0/0 and to 19531 from 0.0.0.0/0 is placed in the same order (scenario 1 and 2), but have different results, so I think you are right about the statement of security group rules applying most permissive rules.

Summarize the past test results:
v0.14.0: FAIL
v4.0.16-1-dirty: FAIL
v4.0.18-1-dirty: PASS
v4.0.21-1-dirty: PASS

Seem like this issue does not happen any more in recent installer. I have no idea why they have different behavior if this is nothing with security group rules order. I will keep an eye on this issue in the following days.

> 
> [1]:
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.
> html#security-group-rules

Comment 9 W. Trevor King 2019-03-11 17:00:13 UTC
> I have no idea why they have different behavior if this is nothing with security group rules order.

#1348 (referenced in comment 1) fixed this by moving the bootstrap machine back into a public subnet.  It landed after 0.14.0, as documented in the change-log [1].  That explains "broken in 0.14.0" and "fixed with recent builds".  I'm not sure how it maps into v4.0.x builds.

[1]: https://github.com/openshift/installer/blame/v0.14.0/CHANGELOG.md#L26-L34

Comment 11 W. Trevor King 2019-03-13 06:10:40 UTC
> Still reproduce this bug with v4.0.16-1-dirty installer.
> 
> # oc adm release info --commits
> registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 | grep installer
>   installer                                    https://github.com/openshift/installer                                    c8b3b5532694c7713efe300a636108174d623c52

Ah, this commit was a bit before installer#1348 landed with the fix.  So I think that's a pretty solid explaination.

Comment 12 David Sanz 2019-03-14 11:31:24 UTC
Verified on 4.0.0-0.nightly-2019-03-13-233958

Access to bootstrap node has been granted both using ssh or curl

Comment 13 Johnny Liu 2019-04-08 03:03:41 UTC
Per comment 12, move this bug to VERIFIED.