Description of problem:
All Openshift cluster resources should have some sort of Kubernetes Cluster ID for AWS. Unfortunately, this information is not documented in detail either in upstream nor in openshift.
The basic expectation of AWS cloudprovider is, each node will have a tag with key - "KubernetesCluster" and a value that represents the cluster-id. Other than this - same Cluster id will also configured in cloudprovider config file (aws.conf):
Once above expectation is met, all AWS resources (such as volumes, security groups, load balancers) created on-demand by Openshift automatically gets above KubernetesClusterTag tag.
Untagged clusters can result in broken storage, load balancer features if more than one cluster is present in same region, so tagging the cluster using aformentioned mechanism is strongly advised. There is a discussion upstream to make this mandatory and return with panic error if KubernetesClusterTag is not configured.
Currently clusters deployed using openshift-ansible-ops are untagged. In few cases, we even have more than one cluster running in same region and occasionally we get PV creation bugs too.
This bug has two parts:
1. For any new cluster created using openshift-ansible-ops should have the tag. We need to update the ansible recipe to do that.
2. For existing clusters or clusters that are going through an upgrade, we need to retroactively tag the nodes and update aws.conf. The only problem with retroactively tagging nodes is, we are still left with aws resources created on-demand(such as volumes, load balancers, security groups) which are left untagged. We need to find all AWS resources created by Openshift and tag them as well:
- EBS volumes should be tagged. This is relatively easy to do, we can get all PVs and tag the EBS volumes using a ansible script.
- Load balancers should be tagged.
- Security Groups should be tagged.
I am not super familiar with networking area and I need some help trying to figure out a way to find out all networking AWS resources that needs to be tagged.
See also: https://bugzilla.redhat.com/show_bug.cgi?id=1365398
This bug need verify on aws and 128 version. now the latest version in mirror repo is still 127, need wait mirror repo sync.
Verify on openshift v3.7.0-0.146.0
[root@ip-172-18-8-9 ~]# openshift version
Steps to verify:
1. Setup cluster env in aws and enable cloud-provider
2. Remove "KubernetesCluster" tag on instance then restart atomic-openshift-master-controllers
# systemctl restart atomic-openshift-master-controllers.service
3. Check atomic-openshift-master-controllers logs
Oct 10 05:07:30 ip-172-18-8-9.ec2.internal atomic-openshift-master-controllers: E1010 05:07:30.537355 69790 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
Oct 10 05:07:30 ip-172-18-8-9.ec2.internal atomic-openshift-master-controllers: W1010 05:07:30.537373 69790 tags.go:78] AWS cloud - no clusterID filtering applied for shared resources; do not run multiple clusters in this AZ.
Oct 10 05:07:30 ip-172-18-8-9.ec2.internal atomic-openshift-master-controllers: F1010 05:07:30.537423 69790 controllermanager.go:179] error building controller context: no ClusterID Found. A ClusterID is required for the cloud
Oct 10 05:07:30 ip-172-18-8-9.ec2.internal systemd: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a
Oct 10 05:07:30 ip-172-18-8-9.ec2.internal systemd: Unit atomic-openshift-master-controllers.service entered failed state.
Oct 10 05:07:30 ip-172-18-8-9.ec2.internal systemd: atomic-openshift-master-controllers.service failed.
4. Modify allow-untagged-cloud=true in /etc/origin/master/master-config.yaml
5. Check controller log again, there is some warning and controller can start success
Oct 10 05:12:30 ip-172-18-8-9 atomic-openshift-master-controllers: E1010 05:12:30.541450 70438 tags.go:94] Tag "KubernetesCluster" nor "kubernetes.io/cluster/..." not found; Kubernetes may behave unexpectedly.
Oct 10 05:12:30 ip-172-18-8-9 atomic-openshift-master-controllers: W1010 05:12:30.541471 70438 tags.go:78] AWS cloud - no clusterID filtering applied for shared resources; do not run multiple clusters in this AZ.
Oct 10 05:12:30 ip-172-18-8-9 atomic-openshift-master-controllers: W1010 05:12:30.541522 70438 controllermanager.go:422] detected a cluster without a ClusterID. A ClusterID will be required in the future. Please tag your cluster to avoid any future issues
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.