Bug 1697236 - [upi-on-aws] subnet created by cloudformation is missing "kubernetes.io/cluster/cluster_name" tag
Summary: [upi-on-aws] subnet created by cloudformation is missing "kubernetes.io/clust...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.1.0
Assignee: Stephen Cuppett
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-04-08 07:25 UTC by Johnny Liu
Modified: 2019-06-04 10:47 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-04 10:47:08 UTC
Target Upstream Version:


Attachments (Terms of Use)
subnets tags (11.75 KB, text/plain)
2019-04-15 08:00 UTC, Johnny Liu
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 None None None 2019-06-04 10:47:15 UTC

Description Johnny Liu 2019-04-08 07:25:54 UTC
Description of problem:
Subnets created by 01_vpc.yaml are missing "kubernetes.io/cluster/cluster_name" tag, which lead to ELB for router svc is not created.

Version-Release number of the following components:
4.0.0-0.nightly-2019-04-05-165550

How reproducible:
Always

Steps to Reproduce:
1. Follow https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#configure-router-for-upi-dns to create UPI on aws.
2.
3.

Actual results:
ELB for router is not created.
# oc get svc -n openshift-ingress
NAME                      TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
router-default            LoadBalancer   172.30.110.19   <pending>     80:31992/TCP,443:31684/TCP   3h42m
router-internal-default   ClusterIP      172.30.22.23    <none>        80/TCP,443/TCP,1936/TCP      3h42m

# oc get event -n openshift-ingress
LAST SEEN   TYPE      REASON                       OBJECT                   MESSAGE
3m33s       Normal    EnsuringLoadBalancer         service/router-default   Ensuring load balancer
153m        Warning   CreatingLoadBalancerFailed   service/router-default   Error creating load balancer (will retry): failed to ensure load balancer for service openshift-ingress/route

Expected results:
ELB for router should be created automatically

Additional info:
After manually add "kubernetes.io/cluster/cluster_name=owned" tag for public subnet created by 01_vpc.yaml, the ELB would be crated automatically.

Comment 1 Stephen Cuppett 2019-04-09 11:35:04 UTC
In all the CloudFormation, the infrastructure ID is not passed in as a parameter (e.g. "jialiuuuu1-xbhm2", cluster=jialiuuuu1, infrastructure_id=xbhm2). The UPI instructions indicate a few places [1] where security group, subnet, etc. search tags need updated for UPI. We cannot guarantee the customer VPC *exactly* set like IPI and should try to avoid requiring it. We need to make sure we call out how to update the operator to find the right things if something like the infrastructure ID is not represented. The instructions & templates left omitted this intentionally for this purpose.

[1]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-1-dynamic-compute-using-machine-api

Comment 2 Stephen Cuppett 2019-04-09 11:35:42 UTC
See related BZ1697262.

Comment 3 Johnny Liu 2019-04-09 12:13:14 UTC
(In reply to Stephen Cuppett from comment #1)
> In all the CloudFormation, the infrastructure ID is not passed in as a
> parameter (e.g. "jialiuuuu1-xbhm2", cluster=jialiuuuu1,
> infrastructure_id=xbhm2). The UPI instructions indicate a few places [1]
> where security group, subnet, etc. search tags need updated for UPI. We
> cannot guarantee the customer VPC *exactly* set like IPI and should try to
> avoid requiring it. We need to make sure we call out how to update the
> operator to find the right things if something like the infrastructure ID is
> not represented. The instructions & templates left omitted this
> intentionally for this purpose.
> 
> [1]:
> https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.
> md#option-1-dynamic-compute-using-machine-api

I totally understand infrastructure_id is not easy to pass in upon initial cloudformation stack creation. But from customer view, I do not think user could get a working cluster by following steps mentioned in this doc. From the doc, I did not find any place to ask user to update subnet tag to include infrastructure_id, and even no place to explain where to get infrastructure_id. And I do not think elb for ingress router would be provisoned successfully if user update security group, subnet, etc. search tags need updated for UPI following [1] (and this is only one option, user maybe also take option 2).

And a worse news is [1] does not work for me, refer to BZ#1697968. In my testing, I have to follow option 2 [2].

[1]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-1-dynamic-compute-using-machine-api
[2]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-2-manually-launching-worker-instances

> We need to make sure we call out how to update the
> operator to find the right things if something like the infrastructure ID is
> not represented.
Yeah, agree. Operator should be more wiser, take UPI install into consideration, but not only IPI.
At a minimal, user need a thoughtful doc to guide them step by step how to update every resource, especially for those tags. Of course, if I am a customer, I would insist operator should do everything automatically without human interaction.

Comment 5 Stephen Cuppett 2019-04-09 16:40:40 UTC
Looking at the original defect text more closely, the expected/experienced behavior is around ingress. We do not add tags to the subnets in the current vpc template (because we want to be able to reuse VPC and subnets or leverage existing ones). We need to document in [1] how we make the filter match and find the appropriate subnets. For multiple clusters, can multiple tags of this form be created? The default template need not be used (existing VPC) so we can add a tag, but we'd still need to document the different ways these can be made to match up.

[1]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#configure-router-for-upi-dns

Comment 6 W. Trevor King 2019-04-09 16:57:19 UTC
> We do not add tags to the subnets in the current vpc template...

Easiest way to handle this for UPI is probably configuring the router/ingress ELB's subnet by ID.  But searching through openshift/api/config/v1, I don't see any subnet knobs.  Should we grow some?  Where is the code creating the router/ingress ELB anyway?

Comment 7 W. Trevor King 2019-04-09 17:52:38 UTC
> Where is the code creating the router/ingress ELB anyway?

Abhinav pointed out that this is core Kubernetes itself fulfilling the Route (with [1], or the version of that code that still lives in k8s.io/kubernetes).  So yeah, you're going to need to tag your subnet with 'kubernetes.io/cluster/{}' to match however you tag your instances (I think the instance tags are how the cloud provider discovers the cluster name to use).  I'm going to make it easier to get the installer-generated infrastructure name out of our asset store, but folks who want to use a different infrastructure name will probably need to run a multi-step install and adjust their Machine(Set)s and other resources to fill in their alternative infrastructure name.

[1]: https://github.com/kubernetes/cloud-provider-aws/blob/1448c509b4fbc7b7ac6d666225eca7fb951cf644/pkg/cloudprovider/providers/aws/tags.go#L30-L34

Comment 13 W. Trevor King 2019-04-11 18:54:29 UTC
Here's how I'm handling this in the WIP CI work [1,2,3].  The '--tags' tags both the stack and resources created while fulfilling the stack, so everything is all set up for the cluster to consume and for 'openshift-install destroy cluster' to remove (once [4] lands).

[1]: https://github.com/openshift/release/commit/799fe4a7ba0388befed7bc8fd6a106e37fc8a7c8#diff-2b1b845b92f8062711789a2bfdb27290R287
[2]: https://github.com/openshift/release/commit/799fe4a7ba0388befed7bc8fd6a106e37fc8a7c8#diff-2b1b845b92f8062711789a2bfdb27290R298
[3]: https://github.com/openshift/release/pull/3440
[4]: https://github.com/openshift/installer/pull/1595

Comment 14 Stephen Cuppett 2019-04-11 19:00:44 UTC
Have added needed subnet tags and doc hit commit to: https://github.com/openshift/installer/pull/1590/

Comment 15 Johnny Liu 2019-04-12 08:08:13 UTC
(In reply to Stephen Cuppett from comment #14)
> Have added needed subnet tags and doc hit commit to:
> https://github.com/openshift/installer/pull/1590/

In this PR, for subnet tag issues, private subnet created by CF stack would be tagged with 'kubernetes.io/role/internal-elb=""' tag, I have some questions for it.
1. The newly added tag would resolve router/ingress ELB provision issue? no need "kubernates.io/cluster/<cluster_name>=owned" tag?
2. According to my initial test report, the "kubernates.io/cluster/<cluster_name>=owned" tag need to be added to *public* subnet, but not *private* subnet.
3. Or the newly added tag - 'kubernetes.io/role/internal-elb=""' would not resolve any issue, user have to use '--tags' parameter with "kubernates.io/cluster/<InfrastructureName>=owned" to tag both the stack and resources created while fulfilling the stack to meet router/ingress operator ELB provision prerequisite?

Comment 16 Stephen Cuppett 2019-04-12 10:02:06 UTC
There is a change in 02_cluster_infra.yaml as well to add "kubernates.io/cluster/<cluster_name>=shared" tag. Both of these changes together will make the subnet tagging similar to IPI.

Comment 17 Johnny Liu 2019-04-15 07:58:39 UTC
The PR is merged now, I re-test this bug with 4.0.0-0.nightly-2019-04-10-182914, still does not work.


Follow the doc to create UPI-on-AWS (Note: I did not add '--tags' parameter with "kubernates.io/cluster/<InfrastructureName>=owned" when creating CF stack).
After installation, check, ingress route still is not provisioned.
# oc get event -n openshift-ingress
LAST SEEN   TYPE      REASON                       OBJECT                   MESSAGE
12s         Normal    EnsuringLoadBalancer         service/router-default   Ensuring load balancer
115m        Warning   CreatingLoadBalancerFailed   service/router-default   Error creating load balancer (will retry): failed to ensure load balancer for service openshift-ingress/router-default: could not find any suitable subnets for creating the ELB


Get subnets tags, found no "kubernates.io/cluster/<InfrastructureName>=owned" tags for public subnets, the initial issue still reproduced.

Comment 18 Johnny Liu 2019-04-15 08:00:39 UTC
Created attachment 1555156 [details]
subnets tags

Comment 19 Johnny Liu 2019-04-15 08:11:09 UTC
And one more thing need to be highlighted, because instance created by CF template is created with "kubernetes.io/cluster/jialiu-upi1=owned", but the created subnet is tagged with "kubernetes.io/cluster/jialiu-upi1-pgkgh=shared", the tag key names are even insistent.

Comment 20 Stephen Cuppett 2019-04-15 11:36:02 UTC
You should see "kubernates.io/cluster/<InfrastructureName>=shared", not owned. I do see it in the attachment.  The instance is "owned", the subnet is "shared" and these tags represent the correct state.

If the functionality does not work with this tag, can you open a separate bug? Thanks!

Comment 21 Stephen Cuppett 2019-04-15 14:03:47 UTC
Also submitted https://github.com/openshift/installer/pull/1620 to ensure the InfrastructureName tag is used on instances as well to hopefully improve consistency and perhaps some of these features. Will leave bug in MODIFIED for the moment, please retest after this pull lands.

Comment 23 Johnny Liu 2019-04-16 03:02:15 UTC
(In reply to Stephen Cuppett from comment #20)
> You should see "kubernates.io/cluster/<InfrastructureName>=shared", not
> owned. I do see it in the attachment.  The instance is "owned", the subnet
> is "shared" and these tags represent the correct state.
> 
> If the functionality does not work with this tag, can you open a separate
> bug? Thanks!

Actually per my test result, I think ingress ELB provision is expecting public subnet tagged with "kubernates.io/cluster/<clustername>=<owned_shared>", because instances where router is running on is tagged with  "kubernates.io/cluster/<clustername>", the key point here is tag name, but not tag value, I think the tag name inconsistency between instances and subnets by CF templates cause ELB provision failure.

(In reply to Stephen Cuppett from comment #21)
> Also submitted https://github.com/openshift/installer/pull/1620 to ensure
> the InfrastructureName tag is used on instances as well to hopefully improve
> consistency and perhaps some of these features. Will leave bug in MODIFIED
> for the moment, please retest after this pull lands.

Yeah, this should be some root fix for such failure.

Comment 24 Johnny Liu 2019-04-16 07:26:31 UTC
The PR is merged, run verification using the latest CF templates against 4.0.0-0.nightly-2019-04-10-182914, and PASS.


instances are tagged with "kubernetes.io/cluster/jialiu-upi2-ml7bx=owned".
private subnets are tagged with "kubernetes.io/cluster/jialiu-upi2-ml7bx=shared" and "kubernetes.io/role/internal-elb=''".
public subnets are tagged with "kubernetes.io/cluster/jialiu-upi2-ml7bx=shared".


In such env, ELB for ingress router is provisioned successfully.

Comment 26 errata-xmlrpc 2019-06-04 10:47:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758


Note You need to log in before you can comment on or make changes to this bug.