Bug 1862209 - master machines are newly created even when 3 masters are already created [NEEDINFO]
Summary: master machines are newly created even when 3 masters are already created
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.5
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.6.0
Assignee: Abhinav Dahiya
QA Contact: Yunfei Jiang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-30 18:10 UTC by Jatan Malde
Modified: 2021-01-20 12:42 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Using platform.aws.userTags to add Name or kubernetes.io/cluster/ tags to resources created by the installer caused machine-api to fail to identify existing control plane machines. Consequence: Failure to identify existing control plane machines cause machine-api to create another set of control plane hosts creating problems with etcd cluster membership. Fix: The installer now does not allow users to set error prone tags in platform.aws.userTags Result: Users will be prevented from adding tags that cause their clusters to have multiple control plane hosts and possibly broken etcd clusters.
Clone Of:
Environment:
Last Closed: 2020-10-27 16:21:22 UTC
Target Upstream Version:
yunjiang: needinfo? (adahiya)


Attachments (Terms of Use)
install log and log-bundle (1.58 MB, application/gzip)
2020-08-19 05:58 UTC, Yunfei Jiang
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 4008 0 None closed Bug 1862209: types/aws: validate Name and kubernetes.io/clustername/* keys are not allowed 2021-01-19 06:40:26 UTC
Github openshift installer pull 4081 0 None closed Bug 1862209: aws: fix validation for user tags 2021-01-19 06:40:26 UTC
Red Hat Knowledge Base (Solution) 5280691 0 None None None 2020-07-31 19:10:05 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:21:49 UTC

Description Jatan Malde 2020-07-30 18:10:54 UTC
Description of problem:

IHAC installing OCP 4.5.3 on AWS Private VPC in IPI installation. 

The installation process of bootstrapping temporary control plane works fine, Only three masters are seen there. 

When the installation moves ahead and the machine-api-operator comes up, we could see 3 masters additionally getting created along with 3 worker nodes. 


Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Setup OCP 4.5.3 on AWS private VPC and user tags
2. Installation shows 6 masters when machine-api-operator comes up.

Actual results:

Expected results:

It is expected that the machine-controller get the already created masters which is not visible here and hence new masters are created in that place.

Additional info:

Additional details with the tags, vpc details and instance are in the private note to avoid sensitive data loss.

Comment 3 Michael Gugino 2020-07-30 18:39:42 UTC
The issue is the user modified the machine objects prior to install to add custom instance tags.  One of the tags was the name field.  This resulted in the machine-controller not finding the instance.

items:
- apiVersion: machine.openshift.io/v1beta1
  kind: Machine
  metadata:
    ...
    labels:
      machine.openshift.io/cluster-api-cluster: cluster-id-xyz
      ...
    name: cluster-id-xyz-master-0
    namespace: openshift-machine-api
  spec:
    metadata: {}
        ...
        tags:
        - name: kubernetes.io/cluster/valid-cluster-id-here
          value: owned
        - name: CustomTagOk
          value: CustomValueOk
        - name: Name
          value: custom-value-put-here
...

We should protect against this problem in the machine-api.  It's unclear to me how these tags got placed on the objects as the tags do appear in AWS, so perhaps the installer is honoring them from elsewhere?

I'm going to assign this to the installer team for investigation on how this might happen.  I see the tfstate file shows an instances with the name tag with values that shouldn't be there.

Comment 4 Michael Gugino 2020-07-30 18:40:44 UTC
Moving this to installer.  machine-api should also protect against this, but the root of this particular case is the installer IMO.

Comment 5 Michael Gugino 2020-07-30 18:48:56 UTC
Jira ticket for tracking machine-api work: https://issues.redhat.com/browse/OCPCLOUD-934

Comment 7 Abhinav Dahiya 2020-07-31 16:57:39 UTC
Based on discussions from Michael , he recommends that we add validations like

1. the platform.aws.userTags should not allow `Name` key as that can affect the master machines from getting adopted.
2. the same field should not also allow any keys `kubernete.io/clustername/*` keys.

Comment 10 Yunfei Jiang 2020-08-18 11:46:55 UTC
Hello Abhinav,

From the PR and your above comment 7, seems like the key with prefix `kubernetes.io/clustername/` was blocked, I'm not sure if it should be `kubernetes.io/cluster/`, just double confirm.

Thanks.

Comment 11 Yunfei Jiang 2020-08-19 05:57:49 UTC
verified. FAILED.


>> error:

time="2020-08-18T11:40:02Z" level=info msg="API v1.19.0-rc.2+99cb93a-dirty up"
time="2020-08-18T11:40:02Z" level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
time="2020-08-18T12:10:02Z" level=info msg="Pulling debug logs from the bootstrap machine"
time="2020-08-18T12:10:10Z" level=debug msg="error: error executing jsonpath \"{range .items[*]}{.metadata.name}{\\\"\\\\n\\\"}{end}\": Error executing template: not in range, nothing to end. Printing more information for debugging the template:"
time="2020-08-18T12:10:12Z" level=debug msg="error: error executing jsonpath \"{range .items[*]}{.metadata.name}{\\\"\\\\n\\\"}{end}\": Error executing template: not in range, nothing to end. Printing more information for debugging the template:"
time="2020-08-18T12:10:13Z" level=debug msg="Collecting info from 10.0.87.51"
time="2020-08-18T12:10:13Z" level=debug msg="Collecting info from 10.0.63.41"
time="2020-08-18T12:10:13Z" level=debug msg="Collecting info from 10.0.76.245"
time="2020-08-18T12:10:14Z" level=info msg="Bootstrap gather logs captured here \"/home/ec2-user/46/yunjiang-bz209fix6/log-bundle-20200818121002.tar.gz\""

>> install-config

<--snip-->
platform:
  aws:
    region: us-east-2
    userTags:
      kubernetes.io/cluster/yunjiang: yunjiang
    subnets:
    - subnet-0e96ec3d5f40e7afc
    - subnet-02ae90227c72b06fb
    - subnet-03e313b800882f9e7
<--snip-->

attached install log and bootstrap logs.

Comment 12 Yunfei Jiang 2020-08-19 05:58:55 UTC
Created attachment 1711799 [details]
install log and log-bundle

Comment 14 Yunfei Jiang 2020-08-26 06:41:07 UTC
verified. PASS.
version: 4.6.0-0.nightly-2020-08-25-234625

Comment 16 errata-xmlrpc 2020-10-27 16:21:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.