Description of problem:
Version-Release number of the following components:
Steps to Reproduce:
1.Follow https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#configure-router-for-upi-dns to set up a upi env on aws
2. Follow https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-1-dynamic-compute-using-machine-api to scale up worker node.
machine is provisioned, but the machine is not registered as worker node.
# oc get machine -n openshift-machine-api
NAME INSTANCE STATE TYPE REGION ZONE AGE
jialiuuuu1-xbhm2-worker-us-east-2b-6fhst i-0d59d7e418af95b93 running m4.large us-east-2 us-east-2b 12m
# oc get machine -n openshift-machine-api jialiuuuu1-xbhm2-worker-us-east-2b-6fhst -o yaml
- apiVersion: machine.openshift.io/v1beta1
- name: tag:Name
- name: tag:Name
- name: kubernetes.io/cluster/jialiuuuu1-xbhm2
- address: 10.0.64.164
- address: ""
- address: ip-10-0-64-164.us-east-2.compute.internal
- lastProbeTime: "2019-04-09T11:08:57Z"
message: machine successfully created
# oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-51-169.us-east-2.compute.internal Ready master 31h v1.12.4+509916ce1
ip-10-0-61-5.us-east-2.compute.internal Ready worker 31h v1.12.4+509916ce1
ip-10-0-73-255.us-east-2.compute.internal Ready master 31h v1.12.4+509916ce1
ip-10-0-92-244.us-east-2.compute.internal Ready master 31h v1.12.4+509916ce1
The provisioned machine should register itself as compute node.
Please attach logs from ansible-playbook with the -vvv flag
https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-2-manually-launching-worker-instances works for me.
Do several trial, only `oc edit machineset --namespace openshift-machine-api` does not work, user have to dump out an existing machineset yaml file, update the desired subnet filter, target security group, RHEL CoreOS AMI and EC2 instance profile for the yaml file, and replace all occurrence with 'clustername-infraID' with 'clustername' in the yaml file, and create a new machineset using the updated yaml file, then the provisioned machine would be registered itself to cluster as a worker.
If removing scale up node via machine api from beta release, customer doc also need remove related section. @Kathryn WDYT?
@Jianlin, yes, it does. I'll pull the references to scaling workers via machine api from the AWS UPI doc PR.
@Brenton, @Trevor, if this functionality is supported in a future release, make sure to tag the JIRA for me.
With  the CloudFormation environment has all the subnet, security group, etc. tags that the stock installer Machine(Set)s expect, so they work out of the box. Of course, control-plane Machines created by the machine-API won't actually work until we have an etcd operator or some such to wire them up, but I've tested scaling compute solely by tweaking 'replicas' and that worked. Stephen has made some good points about things folks can get wrong by forgetting to take various steps, so I've put up  with some more background on why I took the approach I did for the installer docs. Obviously, openshift-docs is free to take another path (and try and talk the installer into following along), if my argument doesn't seem to hold water ;). Movint to POST because this bug targets the installer docs. I dunno whether *this* bug needs to get cloned into a Documentation bug as well, or if bug 1698207 (now about porting installer#1649 into openshift-docs) is sufficient. Kathryn?
Yes, from docs perspective, this is a separate issue. I can and will patch your method in installer#1649 about getting 0 machinesets to work for beta 4, but supporting the choice between using or not using the cluster-controlled workers is a different user story and doc task at this point.
Agree, I also think this is a separate issue with 1698207.
This bug is talking about "launch workers that are backed by MachineSets", while 1698207 is talking about "0 control plane machines".
1. Run testing against https://github.com/openshift/installer/pull/1649, the updated CF template introduced a critical issue.
+ ./openshift-install wait-for install-complete --dir ./upi_2019-04-25-02-31-11
INFO Waiting up to 30m0s for the cluster at https://api.jialiu-upi2.qe1.devcluster.openshift.com:6443 to initialize...
FATAL failed to initialize the cluster: timed out waiting for the condition
Go to bootstrap node, get such failure message:
[core@ip-10-0-6-128 ~]$ journalctl -b -f -u bootkube.service
-- Logs begin at Thu 2019-04-25 06:47:33 UTC. --
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh: https://etcd-2.jialiu-upi2.qe1.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.94.134:2379: connect: connection refused
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh: https://etcd-1.jialiu-upi2.qe1.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.75.207:2379: connect: connection refused
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh: https://etcd-0.jialiu-upi2.qe1.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.58.216:2379: connect: connection refused
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh: Error: unhealthy cluster
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh: etcdctl failed. Retrying in 5 seconds...
Have no way to login master, because MCO is not boot up yet, no ssh key is injected into master node yet. Luckily, aws web console provide a menu to get system log.
From master system log, saw the follow error:
[ 8.506438] ignition: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #3
[ 8.523044] ignition: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host
[ 9.323533] ignition: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #4
[ 9.338622] ignition: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host
[ 10.939125] ignition: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #5
[ 10.953984] ignition: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host
[[0m[0;31m* [0m] A start job is running for Ignition (disks) (9s / no limit)
[K[[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (disks) (10s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (disks) (10s / no limit)[ 14.154469] ignition: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #6
[ 14.165692] ignition: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host
Go to route53, confirmed that no 'api' record is added, only 'api-int' is added. Compare with a successful IPI install, both 'api' and 'api-int' is created. After added 'api' record by hand in route53, the error message disappeared. So I think the rot cause is https://github.com/openshift/installer/pull/1649#discussion_r278510894.
2. After workaround the 1st issue, would hit the 2nd issue, copied from the PR:
If you want to select target subnets (or instance profile, security group, etc.) by ID or by using a different tag, adjust that as well.
This sentence looks ambiguous, as a user, even I strictly follow the doc and example CF template, I would do not know how to edit them to make it working well.
And I checked the the PR, seem like the PR is trying to make the example be more accurate, user only need edit 'replica' to '1' without extra operation. So I think here we could do some more enhancement to make it become more smooth for user.
Here is the default machineset after install:
- name: tag:Name
- name: tag:Name
- name: kubernetes.io/cluster/jialiu-upi2-rlv4l
As my latest test result:
1. For iamInstanceProfile.id, no need any change more. The PR make a good improvement here.
2. For subnet, no need any change more. The PR make a good improvement here.
3. For securityGroups, user have to add 'Name=jialiu-upi2-rlv4l-worker-sg' tag manually for 'worker-sg-jialiu-upi2-rlv4l' (sg group name) created in 03_cluster_security.yaml. I think CF template could help add 'Name' tag upon the sg creation just like iamInstanceProfile enhancement.
I also tried to select sg via ID like this:
- name: tag:id
It does not work, seem like securityGroups filters only work with tag:Name.
4. From document perspective, ami.id should be mentioned to notify user it is editable, so that make sure it is matched with other machines' rhcos version launched via CF template parameter.
Withdraw issue #1 mentioned in comment 11, my fault, forget to change IgnitionLocation url accordingly. Later I would re-run my testing, once new finding, will update it here.
@Kathryn, leave https://github.com/openshift/openshift-docs/pull/14241#pullrequestreview-230620974 for doc update.
Now mainly focus on my issue #2 in comment 11.
api-int is more appropriate for this use. We should never need to serve Ignition configs outside of the cluster.
@Jianlin, I made more updates based on your #11 issue 2.
Is the update for the worker security group ID right now? https://github.com/openshift/openshift-docs/pull/14241/files#diff-93324577ed8dddfc44da93c402766bfcR95
Before address my #11 issue 2, the more important thing is comment 13, the api-int x509 issue is fixed now? If not fixed, user even can not have a successful install now.
I've updated my in-flight installer PR to suggest removing the compute MachineSets as well. That gets us down to a single track (CloudFormation-launched compute nodes) in the docs and simplifies things (e.g. we no longer need the hairy sed for zeroing compute replicas). Folks are still free to opt-in to the machine API if they want, but that will be up to docs outside the installer repository (e.g. those in flight with  or in openshift/openshift-docs). I'll leave this on the installer team until installer#1649 lands, but then I think it should transition to the Documentation or Cloud Compute components to sort out any remaining issues between the machine-API operator and UPI environments. Possibly in a new bug, if that helps condense any findings from earlier in this bug.
Trevor, thanks for the update on the worker MachineSets. Do you have any comments on the "the api-int x509" issue?
Moving to POST, because installer#1649 is in flight.
And I used those templates to launch a successful cluster this morning, so there is at least one path that avoids X.509 errors (the templates have also changed since the 30th, so that may have been it). If MachineSets on UPI are still an issue, I think we want to open a new bug in the Documentation and/or Cloud Compute components to sort them out, because now the installer is out of that business ;).
Version-Release number of the following components:
Steps to verify:
1. Go thru the steps listed in the 
2. Verify that there is no guidance for Machine(set)s in the documentation.
3. Following the docs install an UPI on AWS cluster.
hence moving this bug to verified state.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.