Hide Forgot
Description of problem: Subnets created by 01_vpc.yaml are missing "kubernetes.io/cluster/cluster_name" tag, which lead to ELB for router svc is not created. Version-Release number of the following components: 4.0.0-0.nightly-2019-04-05-165550 How reproducible: Always Steps to Reproduce: 1. Follow https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#configure-router-for-upi-dns to create UPI on aws. 2. 3. Actual results: ELB for router is not created. # oc get svc -n openshift-ingress NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-default LoadBalancer 172.30.110.19 <pending> 80:31992/TCP,443:31684/TCP 3h42m router-internal-default ClusterIP 172.30.22.23 <none> 80/TCP,443/TCP,1936/TCP 3h42m # oc get event -n openshift-ingress LAST SEEN TYPE REASON OBJECT MESSAGE 3m33s Normal EnsuringLoadBalancer service/router-default Ensuring load balancer 153m Warning CreatingLoadBalancerFailed service/router-default Error creating load balancer (will retry): failed to ensure load balancer for service openshift-ingress/route Expected results: ELB for router should be created automatically Additional info: After manually add "kubernetes.io/cluster/cluster_name=owned" tag for public subnet created by 01_vpc.yaml, the ELB would be crated automatically.
In all the CloudFormation, the infrastructure ID is not passed in as a parameter (e.g. "jialiuuuu1-xbhm2", cluster=jialiuuuu1, infrastructure_id=xbhm2). The UPI instructions indicate a few places [1] where security group, subnet, etc. search tags need updated for UPI. We cannot guarantee the customer VPC *exactly* set like IPI and should try to avoid requiring it. We need to make sure we call out how to update the operator to find the right things if something like the infrastructure ID is not represented. The instructions & templates left omitted this intentionally for this purpose. [1]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-1-dynamic-compute-using-machine-api
See related BZ1697262.
(In reply to Stephen Cuppett from comment #1) > In all the CloudFormation, the infrastructure ID is not passed in as a > parameter (e.g. "jialiuuuu1-xbhm2", cluster=jialiuuuu1, > infrastructure_id=xbhm2). The UPI instructions indicate a few places [1] > where security group, subnet, etc. search tags need updated for UPI. We > cannot guarantee the customer VPC *exactly* set like IPI and should try to > avoid requiring it. We need to make sure we call out how to update the > operator to find the right things if something like the infrastructure ID is > not represented. The instructions & templates left omitted this > intentionally for this purpose. > > [1]: > https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi. > md#option-1-dynamic-compute-using-machine-api I totally understand infrastructure_id is not easy to pass in upon initial cloudformation stack creation. But from customer view, I do not think user could get a working cluster by following steps mentioned in this doc. From the doc, I did not find any place to ask user to update subnet tag to include infrastructure_id, and even no place to explain where to get infrastructure_id. And I do not think elb for ingress router would be provisoned successfully if user update security group, subnet, etc. search tags need updated for UPI following [1] (and this is only one option, user maybe also take option 2). And a worse news is [1] does not work for me, refer to BZ#1697968. In my testing, I have to follow option 2 [2]. [1]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-1-dynamic-compute-using-machine-api [2]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#option-2-manually-launching-worker-instances > We need to make sure we call out how to update the > operator to find the right things if something like the infrastructure ID is > not represented. Yeah, agree. Operator should be more wiser, take UPI install into consideration, but not only IPI. At a minimal, user need a thoughtful doc to guide them step by step how to update every resource, especially for those tags. Of course, if I am a customer, I would insist operator should do everything automatically without human interaction.
Looking at the original defect text more closely, the expected/experienced behavior is around ingress. We do not add tags to the subnets in the current vpc template (because we want to be able to reuse VPC and subnets or leverage existing ones). We need to document in [1] how we make the filter match and find the appropriate subnets. For multiple clusters, can multiple tags of this form be created? The default template need not be used (existing VPC) so we can add a tag, but we'd still need to document the different ways these can be made to match up. [1]: https://github.com/openshift/installer/blob/master/docs/user/aws/install_upi.md#configure-router-for-upi-dns
> We do not add tags to the subnets in the current vpc template... Easiest way to handle this for UPI is probably configuring the router/ingress ELB's subnet by ID. But searching through openshift/api/config/v1, I don't see any subnet knobs. Should we grow some? Where is the code creating the router/ingress ELB anyway?
> Where is the code creating the router/ingress ELB anyway? Abhinav pointed out that this is core Kubernetes itself fulfilling the Route (with [1], or the version of that code that still lives in k8s.io/kubernetes). So yeah, you're going to need to tag your subnet with 'kubernetes.io/cluster/{}' to match however you tag your instances (I think the instance tags are how the cloud provider discovers the cluster name to use). I'm going to make it easier to get the installer-generated infrastructure name out of our asset store, but folks who want to use a different infrastructure name will probably need to run a multi-step install and adjust their Machine(Set)s and other resources to fill in their alternative infrastructure name. [1]: https://github.com/kubernetes/cloud-provider-aws/blob/1448c509b4fbc7b7ac6d666225eca7fb951cf644/pkg/cloudprovider/providers/aws/tags.go#L30-L34
Here's how I'm handling this in the WIP CI work [1,2,3]. The '--tags' tags both the stack and resources created while fulfilling the stack, so everything is all set up for the cluster to consume and for 'openshift-install destroy cluster' to remove (once [4] lands). [1]: https://github.com/openshift/release/commit/799fe4a7ba0388befed7bc8fd6a106e37fc8a7c8#diff-2b1b845b92f8062711789a2bfdb27290R287 [2]: https://github.com/openshift/release/commit/799fe4a7ba0388befed7bc8fd6a106e37fc8a7c8#diff-2b1b845b92f8062711789a2bfdb27290R298 [3]: https://github.com/openshift/release/pull/3440 [4]: https://github.com/openshift/installer/pull/1595
Have added needed subnet tags and doc hit commit to: https://github.com/openshift/installer/pull/1590/
(In reply to Stephen Cuppett from comment #14) > Have added needed subnet tags and doc hit commit to: > https://github.com/openshift/installer/pull/1590/ In this PR, for subnet tag issues, private subnet created by CF stack would be tagged with 'kubernetes.io/role/internal-elb=""' tag, I have some questions for it. 1. The newly added tag would resolve router/ingress ELB provision issue? no need "kubernates.io/cluster/<cluster_name>=owned" tag? 2. According to my initial test report, the "kubernates.io/cluster/<cluster_name>=owned" tag need to be added to *public* subnet, but not *private* subnet. 3. Or the newly added tag - 'kubernetes.io/role/internal-elb=""' would not resolve any issue, user have to use '--tags' parameter with "kubernates.io/cluster/<InfrastructureName>=owned" to tag both the stack and resources created while fulfilling the stack to meet router/ingress operator ELB provision prerequisite?
There is a change in 02_cluster_infra.yaml as well to add "kubernates.io/cluster/<cluster_name>=shared" tag. Both of these changes together will make the subnet tagging similar to IPI.
The PR is merged now, I re-test this bug with 4.0.0-0.nightly-2019-04-10-182914, still does not work. Follow the doc to create UPI-on-AWS (Note: I did not add '--tags' parameter with "kubernates.io/cluster/<InfrastructureName>=owned" when creating CF stack). After installation, check, ingress route still is not provisioned. # oc get event -n openshift-ingress LAST SEEN TYPE REASON OBJECT MESSAGE 12s Normal EnsuringLoadBalancer service/router-default Ensuring load balancer 115m Warning CreatingLoadBalancerFailed service/router-default Error creating load balancer (will retry): failed to ensure load balancer for service openshift-ingress/router-default: could not find any suitable subnets for creating the ELB Get subnets tags, found no "kubernates.io/cluster/<InfrastructureName>=owned" tags for public subnets, the initial issue still reproduced.
Created attachment 1555156 [details] subnets tags
And one more thing need to be highlighted, because instance created by CF template is created with "kubernetes.io/cluster/jialiu-upi1=owned", but the created subnet is tagged with "kubernetes.io/cluster/jialiu-upi1-pgkgh=shared", the tag key names are even insistent.
You should see "kubernates.io/cluster/<InfrastructureName>=shared", not owned. I do see it in the attachment. The instance is "owned", the subnet is "shared" and these tags represent the correct state. If the functionality does not work with this tag, can you open a separate bug? Thanks!
Also submitted https://github.com/openshift/installer/pull/1620 to ensure the InfrastructureName tag is used on instances as well to hopefully improve consistency and perhaps some of these features. Will leave bug in MODIFIED for the moment, please retest after this pull lands.
(In reply to Stephen Cuppett from comment #20) > You should see "kubernates.io/cluster/<InfrastructureName>=shared", not > owned. I do see it in the attachment. The instance is "owned", the subnet > is "shared" and these tags represent the correct state. > > If the functionality does not work with this tag, can you open a separate > bug? Thanks! Actually per my test result, I think ingress ELB provision is expecting public subnet tagged with "kubernates.io/cluster/<clustername>=<owned_shared>", because instances where router is running on is tagged with "kubernates.io/cluster/<clustername>", the key point here is tag name, but not tag value, I think the tag name inconsistency between instances and subnets by CF templates cause ELB provision failure. (In reply to Stephen Cuppett from comment #21) > Also submitted https://github.com/openshift/installer/pull/1620 to ensure > the InfrastructureName tag is used on instances as well to hopefully improve > consistency and perhaps some of these features. Will leave bug in MODIFIED > for the moment, please retest after this pull lands. Yeah, this should be some root fix for such failure.
The PR is merged, run verification using the latest CF templates against 4.0.0-0.nightly-2019-04-10-182914, and PASS. instances are tagged with "kubernetes.io/cluster/jialiu-upi2-ml7bx=owned". private subnets are tagged with "kubernetes.io/cluster/jialiu-upi2-ml7bx=shared" and "kubernetes.io/role/internal-elb=''". public subnets are tagged with "kubernetes.io/cluster/jialiu-upi2-ml7bx=shared". In such env, ELB for ingress router is provisioned successfully.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758