Hide Forgot
Description of problem: Version-Release number of the following components: extracted installer from registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-05-065158 # ./openshift-install version ./openshift-install v4.0.15-1-dirty How reproducible: Always Steps to Reproduce: 1. When cluster install is not going well, user should be able to follow https://github.com/openshift/installer/blob/master/docs/user/troubleshooting.md#troubleshooting-the-bootstrap-node for troubleshooting the Bootstrap Node 2. 3. Actual results: ssh connection failed with timeout, and curl bootstrap service failed either. I am sure the bootstrap node was running when I run the following command. [root@preserve-jialiu-ansible demo2]# curl -vvv --insecure --cert ./tls/journal-gatewayd.crt --key ./tls/journal-gatewayd.key 'https://ec2-18-223-241-12.us-east-2.compute.amazonaws.com:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service' * About to connect() to ec2-18-223-241-12.us-east-2.compute.amazonaws.com port 19531 (#0) * Trying 18.223.241.12... * Connection timed out * Failed connect to ec2-18-223-241-12.us-east-2.compute.amazonaws.com:19531; Connection timed out * Closing connection 0 curl: (7) Failed connect to ec2-18-223-241-12.us-east-2.compute.amazonaws.com:19531; Connection timed out Seem like security group is blocking the connection from my localhost? Expected results: Allow user to connect bootstrap node for troubleshooting. Additional info: Please attach logs from ansible-playbook with the -vvv flag
> ... ssh connection failed with timeout... $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-05-065158 | grep installer installer https://github.com/openshift/installer c8b3b5532694c7713efe300a636108174d623c52 $ git log --first-parent --format='%ad %h %d %s' --date=iso c8b3b553^..origin/master | cat 2019-03-05 02:38:46 -0800 18fca42a0 (HEAD -> master, wking/master, origin/master, origin/HEAD) Merge pull request #1351 from flaper87/default-dns 2019-03-04 23:46:12 -0800 395e1a5b2 Merge pull request #1243 from flaper87/multi-run-per-tenant 2019-03-04 18:46:43 -0800 0dd936708 Merge pull request #1361 from eparis/explicit-network-interface 2019-03-04 14:23:46 -0800 83d7fcab4 Merge pull request #1358 from ashcrow/rhcos-name-update 2019-03-04 12:29:18 -0800 7e8ee7d0d Merge pull request #1359 from ashcrow/update-code-references-to-rhcos 2019-03-04 11:17:27 -0800 04651f0da Merge pull request #1334 from squeed/allow-udp 2019-03-04 09:51:07 -0800 0ddac4193 Merge pull request #1331 from abhinavdahiya/docs-custom-mc 2019-03-04 06:18:57 -0800 27b9175c6 Merge pull request #1355 from wking/drop-uuidgen-gzip 2019-03-01 19:06:31 -0800 a98d4d75c Merge pull request #1348 from abhinavdahiya/fix_bootstrap_subnet 2019-03-01 13:56:19 -0800 3fd354f75 Merge pull request #1292 from umohnani8/infra-image 2019-03-01 12:06:44 -0800 f41af0dab Merge pull request #1347 from wking/wrapf-to-wrap 2019-03-01 08:25:43 -0800 3a5193cdd Merge pull request #1346 from staebler/delete_snapshot 2019-03-01 05:38:28 -0800 c8b3b5532 (origin/release-4.0) Merge pull request #1338 from flaper87/sec-groups-update Fix for the SSH issue was #1348.
Still reproduce this bug with v4.0.16-1-dirty installer. # oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 | grep installer installer https://github.com/openshift/installer c8b3b5532694c7713efe300a636108174d623c52 Extract installer from 4.0.0-0.nightly-2019-03-06-074438 Run install Upon bootstrap is running, run ssh and curl command against the bootstrap node, both failed with timeout error. [root@preserve-jialiu-ansible 20190307]# ssh -i libra-new.pem core.compute.amazonaws.com -vvv Warning: Identity file libra-new.pem not accessible: No such file or directory. OpenSSH_7.4p1, OpenSSL 1.0.1e-fips 11 Feb 2013 debug1: Reading configuration data /etc/ssh/ssh_config debug1: /etc/ssh/ssh_config line 58: Applying options for * debug2: resolving "ec2-35-180-187-123.eu-west-3.compute.amazonaws.com" port 22 debug2: ssh_connect_direct: needpriv 0 debug1: Connecting to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com [35.180.187.123] port 22. debug1: connect to address 35.180.187.123 port 22: Connection timed out ssh: connect to host ec2-35-180-187-123.eu-west-3.compute.amazonaws.com port 22: Connection timed out [root@preserve-jialiu-ansible 20190307]# curl -vvv --insecure --cert ./bz1674034/tls/journal-gatewayd.crt --key ./bz1674034/tls/journal-gatewayd.key 'https://ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service' * About to connect() to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com port 19531 (#0) * Trying 35.180.187.123... * Connection timed out * Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out * Closing connection 0 curl: (7) Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out
> [root@preserve-jialiu-ansible 20190307]# curl -vvv --insecure --cert ./bz1674034/tls/journal-gatewayd.crt --key ./bz1674034/tls/journal-gatewayd.key 'https://ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531/entries?follow&_SYSTEMD_UNIT=bootkube.service' > * About to connect() to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com port 19531 (#0) > * Trying 35.180.187.123... > * Connection timed out > * Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out > * Closing connection 0 > curl: (7) Failed connect to ec2-35-180-187-123.eu-west-3.compute.amazonaws.com:19531; Connection timed out Hmm, it's working in CI [1] with [2]. Looks like you're using DNS instead of the documented bootstrap IP [3]. You can get the bootstrap IP from terraform.tfstate [4] (until the bootstrap machine has been torn down), which isn't very convenient, but it works. You can also retrieve the public IP with the AWS CLI or web console. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/5394/artifacts/release-e2e-aws/bootstrap/bootkube.service [2]: https://github.com/openshift/release/blob/1614bde2cfe28fa1dc2e242b8feda66db933ef99/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L361-L368 [3]: https://github.com/openshift/installer/blob/v0.14.0/docs/user/troubleshooting.md#troubleshooting-the-bootstrap-node [4]: https://github.com/openshift/release/blob/1614bde2cfe28fa1dc2e242b8feda66db933ef99/ci-operator/templates/openshift/installer/cluster-launch-installer-e2e.yaml#L352-L354
Hmm, seem like this is nothing with using public IP or DNS, more like related with security group rules. Today I run a new install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service succeed. The following is the nodestrap's security group rules: Ports Protocol Source terraform-20190308023923833100000005 terraform-20190308023923607100000003 22 tcp 0.0.0.0/0 ✔ 19531 tcp 0.0.0.0/0 ✔ 6443 tcp 10.0.0.0/16 ✔ 2379-2380 tcp sg-0f57ce794823c9200 ✔ 9000-9999 udp sg-0580f087a471058ed, sg-0f57ce794823c9200 ✔ 22623 tcp 10.0.0.0/16 ✔ 12379-12380 tcp sg-0f57ce794823c9200 ✔ 4789 udp sg-0580f087a471058ed, sg-0f57ce794823c9200 ✔ 10252 tcp sg-0580f087a471058ed, sg-0f57ce794823c9200 ✔ 30000-32767 tcp sg-0f57ce794823c9200 ✔ 0 icmp 10.0.0.0/16 ✔ 22 tcp 10.0.0.0/16 ✔ 10251 tcp sg-0580f087a471058ed, sg-0f57ce794823c9200 ✔ 9000-9999 tcp sg-0580f087a471058ed, sg-0f57ce794823c9200 ✔ 10250 tcp sg-0580f087a471058ed, sg-0f57ce794823c9200 ✔ I also tried another install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service still failed with timeout error like before. Ports Protocol Source terraform-20190308025600338000000002 terraform-20190308025601505100000004 30000-32767 tcp sg-08896a9217945d18e ✔ 6443 tcp 10.0.0.0/16 ✔ 0 icmp 10.0.0.0/16 ✔ 2379-2380 tcp sg-08896a9217945d18e ✔ 22 tcp 10.0.0.0/16 ✔ 10251 tcp sg-08896a9217945d18e, sg-0cece65e7aa8dda65 ✔ 22623 tcp 10.0.0.0/16 ✔ 9000-9999 tcp sg-08896a9217945d18e, sg-0cece65e7aa8dda65 ✔ 10250 tcp sg-08896a9217945d18e, sg-0cece65e7aa8dda65 ✔ 12379-12380 tcp sg-08896a9217945d18e ✔ 4789 udp sg-08896a9217945d18e, sg-0cece65e7aa8dda65 ✔ 10252 tcp sg-08896a9217945d18e, sg-0cece65e7aa8dda65 ✔ 22 tcp 0.0.0.0/0 ✔ 19531 tcp 0.0.0.0/0 ✔ Compare with the above rules, seem like the rules for allowing traffic to 22 from 0.0.0.0/0 and to 19531 from 0.0.0.0/0 is placed in different order. I think it that root cause.
> Compare with the above rules, seem like the rules for allowing traffic to 22 from 0.0.0.0/0 and to 19531 from 0.0.0.0/0 is placed in different order. I think it that root cause. Ah, that is probably what's hap[ening, thanks :). Although now I don't understand why this is working in CI :p. But I'll patch to avoid the overlap.
Actually, looks like the overlap may not be a problem. From [1]: > If there is more than one rule for a specific port, we apply the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address 203.0.113.1 and another rule that allows access to TCP port 22 from everyone, everyone has access to TCP port 22. Taking a closer look at your comment, I don't see a difference between your two runs: > Today I run a new install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service succeed. > ... > I also tried another install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service still failed with timeout error like before. Are you saying that your failures are non-deterministic? And what are you running to generate those security-group lists? [1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#security-group-rules
(In reply to W. Trevor King from comment #7) > Actually, looks like the overlap may not be a problem. From [1]: > > > Taking a closer look at your comment, I don't see a difference between your > two runs: > > > Today I run a new install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service succeed. > > ... > > I also tried another install using 47.330 rhcos + v4.0.18-1-dirty + 4.0.0-0.nightly-2019-03-07-085509, both ssh connection and curl bootstrap service still failed with timeout error like before. Sorry, my fault, it must be copy/paste mistake, but the security group rules was right what I saw at that moment. Because 4.0.0-0.nightly-2019-03-07-085509 is not existing any more. I re-run some testing with different builds: 1. released version of v0.14.0 installer + default RHCOS and payload release image both ssh connection and curl bootstrap service failed with timeout error like before. Ports Protocol Source terraform-20190311072902200900000005 terraform-20190311072900731800000002 22 tcp 0.0.0.0/0 ✔ 19531 tcp 0.0.0.0/0 ✔ 30000-32767 tcp sg-0f23e2ffbd48992dd ✔ 6443 tcp 10.0.0.0/16 ✔ 0 icmp 10.0.0.0/16 ✔ 2379-2380 tcp sg-0f23e2ffbd48992dd ✔ 22 tcp 10.0.0.0/16 ✔ 10251 tcp sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd ✔ 22623 tcp 10.0.0.0/16 ✔ 9000-9999 tcp sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd ✔ 10250 tcp sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd ✔ 12379-12380 tcp sg-0f23e2ffbd48992dd ✔ 4789 udp sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd ✔ 10252 tcp sg-095464dd6572ed30f, sg-0f23e2ffbd48992dd ✔ 2. v4.0.21-1-dirty installer extracted from 4.0.0-0.nightly-2019-03-10-151536 + 47.330 rhcos + 4.0.0-0.nightly-2019-03-06-074438 release payload image both ssh connection and curl bootstrap service succeed. Ports Protocol Source terraform-20190311072920099000000004 terraform-20190311072917634500000003 22 tcp 0.0.0.0/0 ✔ 19531 tcp 0.0.0.0/0 ✔ 6443 tcp 10.0.0.0/16 ✔ 2379-2380 tcp sg-0f7ce2a0fdfc11b94 ✔ 9000-9999 udp sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94 ✔ 22623 tcp 10.0.0.0/16 ✔ 12379-12380 tcp sg-0f7ce2a0fdfc11b94 ✔ 4789 udp sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94 ✔ 10252 tcp sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94 ✔ 30000-32767 tcp sg-0f7ce2a0fdfc11b94 ✔ 0 icmp 10.0.0.0/16 ✔ 22 tcp 10.0.0.0/16 ✔ 10251 tcp sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94 ✔ 9000-9999 tcp sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94 ✔ 10250 tcp sg-03b50cb420df33fc2, sg-0f7ce2a0fdfc11b94 ✔ 3. v4.0.16-1-dirty installer extracted from 4.0.0-0.nightly-2019-03-06-074438 + 47.330 rhcos + 4.0.0-0.nightly-2019-03-06-074438 release payload image Ports Protocol Source terraform-20190311085916028700000002 terraform-20190311085916029800000003 30000-32767 tcp sg-0bb60cb3faf6a778c ✔ 6443 tcp 10.0.0.0/16 ✔ 0 icmp 10.0.0.0/16 ✔ 2379-2380 tcp sg-0bb60cb3faf6a778c ✔ 22 tcp 10.0.0.0/16 ✔ 10251 tcp sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07 ✔ 22623 tcp 10.0.0.0/16 ✔ 9000-9999 tcp sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07 ✔ 10250 tcp sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07 ✔ 12379-12380 tcp sg-0bb60cb3faf6a778c ✔ 4789 udp sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07 ✔ 10252 tcp sg-0bb60cb3faf6a778c, sg-0ceab163f4daa0f07 ✔ 22 tcp 0.0.0.0/0 ✔ 19531 tcp 0.0.0.0/0 ✔ > > Are you saying that your failures are non-deterministic? And what are you > running to generate those security-group lists? The above Security Group Rules are copied directly from aws web console. > If there is more than one rule for a specific port, we apply the most permissive rule. For example, if you have a rule that allows access to TCP port 22 (SSH) from IP address 203.0.113.1 and another rule that allows access to TCP port 22 from everyone, everyone has access to TCP port 22. Good to know this. In comment 5, I thought the security group rules would work like rhel iptable rules, the order of rules would decide which traffic was allowed or denied. Seem like I was wrong. According to today's testing, the rules for allowing traffic to 22 from 0.0.0.0/0 and to 19531 from 0.0.0.0/0 is placed in the same order (scenario 1 and 2), but have different results, so I think you are right about the statement of security group rules applying most permissive rules. Summarize the past test results: v0.14.0: FAIL v4.0.16-1-dirty: FAIL v4.0.18-1-dirty: PASS v4.0.21-1-dirty: PASS Seem like this issue does not happen any more in recent installer. I have no idea why they have different behavior if this is nothing with security group rules order. I will keep an eye on this issue in the following days. > > [1]: > https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security. > html#security-group-rules
> I have no idea why they have different behavior if this is nothing with security group rules order. #1348 (referenced in comment 1) fixed this by moving the bootstrap machine back into a public subnet. It landed after 0.14.0, as documented in the change-log [1]. That explains "broken in 0.14.0" and "fixed with recent builds". I'm not sure how it maps into v4.0.x builds. [1]: https://github.com/openshift/installer/blame/v0.14.0/CHANGELOG.md#L26-L34
> Still reproduce this bug with v4.0.16-1-dirty installer. > > # oc adm release info --commits > registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 | grep installer > installer https://github.com/openshift/installer c8b3b5532694c7713efe300a636108174d623c52 Ah, this commit was a bit before installer#1348 landed with the fix. So I think that's a pretty solid explaination.
Verified on 4.0.0-0.nightly-2019-03-13-233958 Access to bootstrap node has been granted both using ssh or curl
Per comment 12, move this bug to VERIFIED.