Description of problem: IHAC installing OCP 4.5.3 on AWS Private VPC in IPI installation. The installation process of bootstrapping temporary control plane works fine, Only three masters are seen there. When the installation moves ahead and the machine-api-operator comes up, we could see 3 masters additionally getting created along with 3 worker nodes. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Setup OCP 4.5.3 on AWS private VPC and user tags 2. Installation shows 6 masters when machine-api-operator comes up. Actual results: Expected results: It is expected that the machine-controller get the already created masters which is not visible here and hence new masters are created in that place. Additional info: Additional details with the tags, vpc details and instance are in the private note to avoid sensitive data loss.
The issue is the user modified the machine objects prior to install to add custom instance tags. One of the tags was the name field. This resulted in the machine-controller not finding the instance. items: - apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: ... labels: machine.openshift.io/cluster-api-cluster: cluster-id-xyz ... name: cluster-id-xyz-master-0 namespace: openshift-machine-api spec: metadata: {} ... tags: - name: kubernetes.io/cluster/valid-cluster-id-here value: owned - name: CustomTagOk value: CustomValueOk - name: Name value: custom-value-put-here ... We should protect against this problem in the machine-api. It's unclear to me how these tags got placed on the objects as the tags do appear in AWS, so perhaps the installer is honoring them from elsewhere? I'm going to assign this to the installer team for investigation on how this might happen. I see the tfstate file shows an instances with the name tag with values that shouldn't be there.
Moving this to installer. machine-api should also protect against this, but the root of this particular case is the installer IMO.
Jira ticket for tracking machine-api work: https://issues.redhat.com/browse/OCPCLOUD-934
Based on discussions from Michael , he recommends that we add validations like 1. the platform.aws.userTags should not allow `Name` key as that can affect the master machines from getting adopted. 2. the same field should not also allow any keys `kubernete.io/clustername/*` keys.
Hello Abhinav, From the PR and your above comment 7, seems like the key with prefix `kubernetes.io/clustername/` was blocked, I'm not sure if it should be `kubernetes.io/cluster/`, just double confirm. Thanks.
verified. FAILED. >> error: time="2020-08-18T11:40:02Z" level=info msg="API v1.19.0-rc.2+99cb93a-dirty up" time="2020-08-18T11:40:02Z" level=info msg="Waiting up to 30m0s for bootstrapping to complete..." time="2020-08-18T12:10:02Z" level=info msg="Pulling debug logs from the bootstrap machine" time="2020-08-18T12:10:10Z" level=debug msg="error: error executing jsonpath \"{range .items[*]}{.metadata.name}{\\\"\\\\n\\\"}{end}\": Error executing template: not in range, nothing to end. Printing more information for debugging the template:" time="2020-08-18T12:10:12Z" level=debug msg="error: error executing jsonpath \"{range .items[*]}{.metadata.name}{\\\"\\\\n\\\"}{end}\": Error executing template: not in range, nothing to end. Printing more information for debugging the template:" time="2020-08-18T12:10:13Z" level=debug msg="Collecting info from 10.0.87.51" time="2020-08-18T12:10:13Z" level=debug msg="Collecting info from 10.0.63.41" time="2020-08-18T12:10:13Z" level=debug msg="Collecting info from 10.0.76.245" time="2020-08-18T12:10:14Z" level=info msg="Bootstrap gather logs captured here \"/home/ec2-user/46/yunjiang-bz209fix6/log-bundle-20200818121002.tar.gz\"" >> install-config <--snip--> platform: aws: region: us-east-2 userTags: kubernetes.io/cluster/yunjiang: yunjiang subnets: - subnet-0e96ec3d5f40e7afc - subnet-02ae90227c72b06fb - subnet-03e313b800882f9e7 <--snip--> attached install log and bootstrap logs.
Created attachment 1711799 [details] install log and log-bundle
verified. PASS. version: 4.6.0-0.nightly-2020-08-25-234625
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days