Bug 1697968

Summary: [upi-on-aws] Dynamic Compute created using Machine API can not register itself into cluster
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: InstallerAssignee: W. Trevor King <wking>
Installer sub component: openshift-installer QA Contact: Siva Reddy <schituku>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: adahiya, bleanhar, crawford, decarr, gpei, kalexand, scuppett, sponnaga, wking, wmeng
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Johnny Liu 2019-04-09 11:23:57 UTC
Description of problem:

Version-Release number of the following components:
4.0.0-0.nightly-2019-04-05-165550

How reproducible:
Always

Steps to Reproduce:
1.Follow https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#configure-router-for-upi-dns to set up a upi env on aws
2. Follow https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-1-dynamic-compute-using-machine-api to scale up worker node.
3.

Actual results:
machine is provisioned, but the machine is not registered as worker node.
# oc get machine -n openshift-machine-api
NAME                                       INSTANCE              STATE     TYPE       REGION      ZONE         AGE
jialiuuuu1-xbhm2-worker-us-east-2b-6fhst   i-0d59d7e418af95b93   running   m4.large   us-east-2   us-east-2b   12m

# oc get machine -n openshift-machine-api jialiuuuu1-xbhm2-worker-us-east-2b-6fhst -o yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  creationTimestamp: "2019-04-09T11:08:56Z"
  finalizers:
  - machine.machine.openshift.io
  generateName: jialiuuuu1-xbhm2-worker-us-east-2b-
  generation: 1
  labels:
    machine.openshift.io/cluster-api-cluster: jialiuuuu1-xbhm2
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/cluster-api-machineset: jialiuuuu1-xbhm2-worker-us-east-2b
  name: jialiuuuu1-xbhm2-worker-us-east-2b-6fhst
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: machine.openshift.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: MachineSet
    name: jialiuuuu1-xbhm2-worker-us-east-2b
    uid: 7405933a-59ac-11e9-a7b3-025808c4e0c0
  resourceVersion: "784350"
  selfLink: /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/jialiuuuu1-xbhm2-worker-us-east-2b-6fhst
  uid: de16bc23-5ab7-11e9-b6f5-06d19159fdbe
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      ami:
        id: ami-0eef624367320ec26
      apiVersion: awsproviderconfig.openshift.io/v1beta1
      blockDevices:
      - ebs:
          iops: 0
          volumeSize: 120
          volumeType: gp2
      credentialsSecret:
        name: aws-cloud-credentials
      deviceIndex: 0
      iamInstanceProfile:
        id: jialiuuuu1-sg-iam-WorkerInstanceProfile-1B4GDOZU48MAK
      instanceType: m4.large
      kind: AWSMachineProviderConfig
      metadata:
        creationTimestamp: null
      placement:
        availabilityZone: us-east-2b
        region: us-east-2
      publicIp: null
      securityGroups:
      - filters:
        - name: tag:Name
          values:
          - worker-sg-jialiuuuu1
      subnet:
        filters:
        - name: tag:Name
          values:
          - jialiuuuu1-xbhm2-private-us-east-2b
      tags:
      - name: kubernetes.io/cluster/jialiuuuu1-xbhm2
        value: owned
      userDataSecret:
        name: worker-user-data
  versions:
    kubelet: ""
status:
  addresses:
  - address: 10.0.64.164
    type: InternalIP
  - address: ""
    type: ExternalDNS
  - address: ip-10-0-64-164.us-east-2.compute.internal
    type: InternalDNS
  lastUpdated: "2019-04-09T11:09:17Z"
  providerStatus:
    apiVersion: awsproviderconfig.openshift.io/v1beta1
    conditions:
    - lastProbeTime: "2019-04-09T11:08:57Z"
      lastTransitionTime: "2019-04-09T11:08:57Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreation
    instanceId: i-0d59d7e418af95b93
    instanceState: running
    kind: AWSMachineProviderStatus

# oc get node
NAME                                        STATUS   ROLES    AGE   VERSION
ip-10-0-51-169.us-east-2.compute.internal   Ready    master   31h   v1.12.4+509916ce1
ip-10-0-61-5.us-east-2.compute.internal     Ready    worker   31h   v1.12.4+509916ce1
ip-10-0-73-255.us-east-2.compute.internal   Ready    master   31h   v1.12.4+509916ce1
ip-10-0-92-244.us-east-2.compute.internal   Ready    master   31h   v1.12.4+509916ce1

Expected results:
The provisioned machine should register itself as compute node.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 2 Johnny Liu 2019-04-10 09:06:04 UTC
Do several trial, only `oc edit machineset --namespace openshift-machine-api` does not work, user have to dump out an existing machineset yaml file, update the desired subnet filter, target security group, RHEL CoreOS AMI and EC2 instance profile for the yaml file, and replace all occurrence with 'clustername-infraID' with 'clustername' in the yaml file, and create a new machineset using the updated yaml file, then the provisioned machine would be registered itself to cluster as a worker.

Comment 5 Johnny Liu 2019-04-16 05:57:58 UTC
If removing scale up node via machine api from beta release, customer doc also need remove related section. @Kathryn WDYT?

Comment 6 Kathryn Alexander 2019-04-16 12:15:21 UTC
@Jianlin, yes, it does. I'll pull the references to scaling workers via machine api from the AWS UPI doc PR.

@Brenton, @Trevor, if this functionality is supported in a future release, make sure to tag the JIRA for me.

Comment 9 W. Trevor King 2019-04-24 07:26:34 UTC
With [1] the CloudFormation environment has all the subnet, security group, etc. tags that the stock installer Machine(Set)s expect, so they work out of the box.  Of course, control-plane Machines created by the machine-API won't actually work until we have an etcd operator or some such to wire them up, but I've tested scaling compute solely by tweaking 'replicas' and that worked.  Stephen has made some good points about things folks can get wrong by forgetting to take various steps, so I've put up [2] with some more background on why I took the approach I did for the installer docs.  Obviously, openshift-docs is free to take another path (and try and talk the installer into following along), if my argument doesn't seem to hold water ;).  Movint to POST because this bug targets the installer docs.  I dunno whether *this* bug needs to get cloned into a Documentation bug as well, or if bug 1698207 (now about porting installer#1649 into openshift-docs) is sufficient.  Kathryn?

[1]: https://github.com/openshift/installer/pull/1649
[2]: https://github.com/openshift/installer/pull/1649#issuecomment-486006440

Comment 10 Kathryn Alexander 2019-04-24 12:59:17 UTC
Yes, from docs perspective, this is a separate issue. I can and will patch your method in installer#1649 about getting 0 machinesets to work for beta 4, but supporting the choice between using or not using the cluster-controlled workers is a different user story and doc task at this point.

Comment 11 Johnny Liu 2019-04-25 11:50:34 UTC
Agree, I also think this is a separate issue with 1698207.

This bug is talking about "launch workers that are backed by MachineSets", while 1698207 is talking about "0 control plane machines".

1. Run testing against https://github.com/openshift/installer/pull/1649, the updated CF template introduced a critical issue.

+ ./openshift-install wait-for install-complete --dir ./upi_2019-04-25-02-31-11
INFO Waiting up to 30m0s for the cluster at https://api.jialiu-upi2.qe1.devcluster.openshift.com:6443 to initialize... 
FATAL failed to initialize the cluster: timed out waiting for the condition 


Go to bootstrap node, get such failure message:
[core@ip-10-0-6-128 ~]$ journalctl -b -f -u bootkube.service
-- Logs begin at Thu 2019-04-25 06:47:33 UTC. --
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh[1335]: https://etcd-2.jialiu-upi2.qe1.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.94.134:2379: connect: connection refused
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh[1335]: https://etcd-1.jialiu-upi2.qe1.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.75.207:2379: connect: connection refused
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh[1335]: https://etcd-0.jialiu-upi2.qe1.devcluster.openshift.com:2379 is unhealthy: failed to connect: dial tcp 10.0.58.216:2379: connect: connection refused
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh[1335]: Error: unhealthy cluster
Apr 25 07:39:54 ip-10-0-6-128 bootkube.sh[1335]: etcdctl failed. Retrying in 5 seconds...


Have no way to login master, because MCO is not boot up yet, no ssh key is injected into master node yet. Luckily, aws web console provide a menu to get system log.
From master system log, saw the follow error:
[    8.506438] ignition[574]: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #3

[    8.523044] ignition[574]: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host

[    9.323533] ignition[574]: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #4

[    9.338622] ignition[574]: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host

[   10.939125] ignition[574]: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #5

[   10.953984] ignition[574]: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host

[[0m[0;31m*     [0m] A start job is running for Ignition (disks) (9s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (disks) (10s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (disks) (10s / no limit)[   14.154469] ignition[574]: GET https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: attempt #6

[   14.165692] ignition[574]: GET error: Get https://api.jialiu-upi2.qe1.devcluster.openshift.com:22623/config/master: dial tcp: lookup api.jialiu-upi2.qe1.devcluster.openshift.com on 10.0.0.2:53: no such host


Go to route53, confirmed that no 'api' record is added, only 'api-int' is added. Compare with a successful IPI install, both 'api' and 'api-int' is created. After added 'api' record by hand in route53, the error message disappeared. So I think the rot cause is https://github.com/openshift/installer/pull/1649#discussion_r278510894.


2. After workaround the 1st issue, would hit the 2nd issue, copied from the PR:
'''
If you want to select target subnets (or instance profile, security group, etc.) by ID or by using a different tag, adjust that as well.
'''
This sentence looks ambiguous, as a user, even I strictly follow the doc and example CF template, I would do not know how to edit them to make it working well. 
And I checked the the PR, seem like the PR is trying to make the example be more accurate, user only need edit 'replica' to '1' without extra operation. So I think here we could do some more enhancement to make it become more smooth for user.

Here is the default machineset after install:
      providerSpec:
        value:
          ami:
            id: ami-07e0e0e0035b5a3fe
          apiVersion: awsproviderconfig.openshift.io/v1beta1
          blockDevices:
          - ebs:
              iops: 0
              volumeSize: 120
              volumeType: gp2
          credentialsSecret:
            name: aws-cloud-credentials
          deviceIndex: 0
          iamInstanceProfile:
            id: jialiu-upi2-rlv4l-worker-profile
          instanceType: m4.large
          kind: AWSMachineProviderConfig
          metadata:
            creationTimestamp: null
          placement:
            availabilityZone: us-east-2a
            region: us-east-2
          publicIp: null
          securityGroups:
          - filters:
            - name: tag:Name
              values:
              - jialiu-upi2-rlv4l-worker-sg
          subnet:
            filters:
            - name: tag:Name
              values:
              - jialiu-upi2-rlv4l-private-us-east-2a
          tags:
          - name: kubernetes.io/cluster/jialiu-upi2-rlv4l
            value: owned
          userDataSecret:
            name: worker-user-data


As my latest test result:
1. For iamInstanceProfile.id, no need any change more. The PR make a good improvement here.
2. For subnet, no need any change more. The PR make a good improvement here.
3. For securityGroups, user have to add 'Name=jialiu-upi2-rlv4l-worker-sg' tag manually for 'worker-sg-jialiu-upi2-rlv4l' (sg group name) created in 03_cluster_security.yaml. I think CF template could help add 'Name' tag upon the sg creation just like iamInstanceProfile enhancement.
I also tried to select sg via ID like this:
          securityGroups:
          - filters:
            - name: tag:id
              values:
              - sg-02139eb077cda3624
It does not work, seem like securityGroups filters only work with tag:Name.
4. From document perspective, ami.id should be mentioned to notify user it is editable, so that make sure it is matched with other machines' rhcos version launched via CF template parameter.

Comment 12 Johnny Liu 2019-04-25 13:12:46 UTC
Withdraw issue #1 mentioned in comment 11, my fault, forget to change IgnitionLocation url accordingly. Later I would re-run my testing, once new finding, will update it here.

@Kathryn, leave https://github.com/openshift/openshift-docs/pull/14241#pullrequestreview-230620974 for doc update. 

Now mainly focus on my issue #2 in comment 11.

Comment 15 Alex Crawford 2019-04-25 22:57:52 UTC
api-int is more appropriate for this use. We should never need to serve Ignition configs outside of the cluster.

Comment 16 Kathryn Alexander 2019-04-26 15:53:02 UTC
@Jianlin, I made more updates based on your #11 issue 2.

Is the update for the worker security group ID right now? https://github.com/openshift/openshift-docs/pull/14241/files#diff-93324577ed8dddfc44da93c402766bfcR95

Comment 17 Johnny Liu 2019-04-28 00:43:39 UTC
Before address my #11 issue 2, the more important thing is comment 13, the api-int x509 issue is fixed now? If not fixed, user even can not have a successful install now.

Comment 18 W. Trevor King 2019-04-30 08:17:08 UTC
I've updated my in-flight installer PR to suggest removing the compute MachineSets as well.  That gets us down to a single track (CloudFormation-launched compute nodes) in the docs and simplifies things (e.g. we no longer need the hairy sed for zeroing compute replicas).  Folks are still free to opt-in to the machine API if they want, but that will be up to docs outside the installer repository (e.g. those in flight with [2] or in openshift/openshift-docs).  I'll leave this on the installer team until installer#1649 lands, but then I think it should transition to the Documentation or Cloud Compute components to sort out any remaining issues between the machine-API operator and UPI environments.  Possibly in a new bug, if that helps condense any findings from earlier in this bug.

[1]: https://github.com/openshift/installer/pull/1649#issuecomment-487856424
[2]: https://github.com/openshift/machine-api-operator/pull/306

Comment 20 Kathryn Alexander 2019-04-30 13:19:30 UTC
Trevor, thanks for the update on the worker MachineSets. Do you have any comments on the "the api-int x509" issue?

Comment 22 W. Trevor King 2019-05-01 17:36:18 UTC
Moving to POST, because installer#1649 is in flight.

Comment 23 W. Trevor King 2019-05-02 22:10:45 UTC
Landed: https://github.com/openshift/installer/pull/1649#event-2315821084

And I used those templates to launch a successful cluster this morning, so there is at least one path that avoids X.509 errors (the templates have also changed since the 30th, so that may have been it).  If MachineSets on UPI are still an issue, I think we want to open a new bug in the Documentation and/or Cloud Compute components to sort them out, because now the installer is out of that business ;).

Comment 25 Siva Reddy 2019-05-07 21:17:22 UTC
Version-Release number of the following components:
4.1.0-0.nightly-2019-05-07-183643

Steps to verify:
1. Go thru the steps listed in the [1]
2. Verify that there is no guidance for Machine(set)s in the documentation.
3.  Following the docs install an UPI on AWS cluster.

    hence moving this bug to verified state.

Comment 27 errata-xmlrpc 2019-06-04 10:47:18 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758