Bug 1913932
Summary: | gcp-vpc-move-route, gcp-vpc-move-vip: Failures if the instance is in a shared service project [RHEL 8] | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Reid Wahl <nwahl> | |
Component: | resource-agents | Assignee: | Oyvind Albrigtsen <oalbrigt> | |
Status: | CLOSED ERRATA | QA Contact: | Brandon Perkins <bperkins> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 8.3 | CC: | agk, bperkins, cfeist, cluster-maint, fdinitto, jwboyer, michael.varun | |
Target Milestone: | rc | Keywords: | Triaged | |
Target Release: | 8.4 | |||
Hardware: | All | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | resource-agents-4.1.1-87.el8 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1913936 (view as bug list) | Environment: | ||
Last Closed: | 2021-05-18 15:12:05 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Deadline: | 2021-02-15 |
Description
Reid Wahl
2021-01-07 19:56:54 UTC
It should be pretty much straight forward, If there is an resource attribute "project" and an option to source the service account for the "project" There was issue reported for the fence agents in GCE reported in the BZ1704348 There has been changes to use the oauthclient however we would also request to honour OS environment variable along with oauthclient if there are considerations on how to process authentication for route move Let me know if any help / validation/ testing is required would be happy to help Additional patch to fix stop-action and incorrect call of a function: https://github.com/ClusterLabs/resource-agents/pull/1609 I'm going to need more information here (especially error messages). I just setup an environment using the instructions at: https://cloud.google.com/vpc/docs/provisioning-shared-vpc and using the 8.3.0 GA version of RA: resource-agents-4.1.1-68.el8.x86_64 and I was unable to reproduce any failure. There must be some configuration difference that I'm missing or isn't captured in the description or comments. My configuration is simply a host project and a service project where the host project has two VPCs (with a subnet in each), and each instance has two NICs connected to each of the two subnets. All fencing and IPv4 tests are passing with this setup. Any help would be greatly appreciated. I'm clearing the needinfo flag from myself and leaving it set for Michael, who reported the issue. I never reproduced this one (nothing's changed since comment 0 in that regard). I still need some feedback on this. Without the failure case, I can't be sure if this is fixed. This also will impact the delivery of the fix in bug 1913936, bug 1919343, and bug 1919344. @brandon Perkins I assume you have VPC's in the host projects as shared VPC's and you have created compute engine instances in the shared project where the VPC's in the host project is shared with Also, gcp-vpc-move-vip resource uses a IP-ADDRESS from a subnet which is not visible to the VPC , This is a technical limitation in GCP that you cant have more than two ipaddress from the same subnet on the host. I assume the use case here considers a use of secondary-ipaddress for the resource agent gcp-vpc-move-vip while a floating IP address CIDR range is used for gcp-vpc-move-route & gcp-vpc-move-ip We are more interested in the later and not the former , definition on VIP vs IP is confusing and there are no proper scenarios where we run secondary-ip ranges unless its in container world so technically lets say you have host project: project-network VPC: HANA DB with subnet 10.0.0.1/26 Secondary IP Address 192.0.0.1/26 service project: project-db Instance A : node001 10.0.0.1/32 Instance B : node002 10.0.0.2/32 VIP: 192.0.0.1/32 (Floating IP Address) I have a dedicated service account which can create/delete network Advanced Routes[1] in the case of resource agent gcp-vpc-move-route, network routes can be administered only in the host project since the VPC is in the host project, So my service account will bear the host project details in it currently, in the script[3] i dont see a way how the object "ctx.conn" in line number 215 is created through authentication(I infer it is Oauth2.0), The documentation doesn't provide a detail information on supported authentication/authorisation [2]mechanism , a parametrised option service_account will be helpful incase users use service accounts instead Oauth Same for the script[4] where the project details are determined based on metadata along with no details on the gcloud authentication/authroization Let me know if we should hop on a call for a discussion [1]https://cloud.google.com/vpc/docs/routes [2]https://cloud.google.com/docs/authentication [3]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-vpc-move-vip.in [4]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-vpc-move-ip.in (In reply to Michael Varun from comment #8) > @brandon Perkins > > I assume you have VPC's in the host projects as shared VPC's and you have > created compute engine instances in the shared project where the VPC's in > the host project is shared with > > Also, gcp-vpc-move-vip resource uses a IP-ADDRESS from a subnet which is not > visible to the VPC , This is a technical limitation in GCP that you cant > have more than two ipaddress from the same subnet on the host. > > I assume the use case here considers a use of secondary-ipaddress for the > resource agent gcp-vpc-move-vip while a floating IP address CIDR range is > used for gcp-vpc-move-route & gcp-vpc-move-ip > > We are more interested in the later and not the former , definition on VIP > vs IP is confusing and there are no proper scenarios where we run > secondary-ip ranges unless its in container world > > > so technically lets say you have > > host project: project-network > VPC: HANA DB with subnet 10.0.0.1/26 Secondary IP Address 192.0.0.1/26 > > service project: project-db > Instance A : node001 10.0.0.1/32 > Instance B : node002 10.0.0.2/32 > VIP: 192.0.0.1/32 (Floating IP Address) > > I have a dedicated service account which can create/delete network Advanced > Routes[1] in the case of resource agent gcp-vpc-move-route, network routes > can be administered only in the host project since the VPC is in the host > project, So my service account will bear the host project details in it > > currently, in the script[3] i dont see a way how the object "ctx.conn" in > line number 215 is created through authentication(I infer it is Oauth2.0), > The documentation doesn't provide a detail information on supported > authentication/authorisation [2]mechanism , a parametrised option > service_account will be helpful incase users use service accounts instead > Oauth > > Same for the script[4] where the project details are determined based on > metadata along with no details on the gcloud authentication/authroization > > > Let me know if we should hop on a call for a discussion > > [1]https://cloud.google.com/vpc/docs/routes > [2]https://cloud.google.com/docs/authentication > [3]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp- > vpc-move-vip.in > [4]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp- > vpc-move-ip.in Correction linked tagged as [3] is refering to the script https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-vpc-move-route.in and not gcp-vpc-move-vip.in I believe I have the projects setup correctly now, at least I could get gcp-vpc-move-route to fail with the GA RA. However, when I upgraded to the fixed in version for this bug, I noticed two different things. If I setup the agent with project=rhel-ha-service-project (or not specifying a project argument), we get: ERROR:gcp-vpc-move-route:VPC network not found because it's setting ctx.vpc_network_url to use rhel-ha-service-project instead of rhel-ha-host-project. So, then I flipped the logic to "solve the problem" and setup the agent with project=rhel-ha-host-project. In this situation the resource starts and the IP can be pinged from the machine local to the running resource. However, nothing else can get to it. In looking at the route that was created by the resource we see: =============================== [root@nodea ~]# gcloud-ra --project rhel-ha-host-project compute routes describe bperkins84svc-ip creationTimestamp: '2021-02-09T13:35:00.921-08:00' description: '' destRange: 172.16.132.99/32 id: '3952597249666498283' kind: compute#route name: bperkins84svc-ip network: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected nextHopInstance: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/zones/us-east1-b/instances/bperkins84svc-nodeb-vm priority: 1000 selfLink: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/routes/bperkins84svc-ip warnings: - code: NEXT_HOP_INSTANCE_NOT_FOUND data: - key: instance value: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/zones/us-east1-b/instances/bperkins84svc-nodeb-vm message: Next hop instance 'https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/zones/us-east1-b/instances/bperkins84svc-nodeb-vm' does not exist. =============================== Notice that the nextHopInstance is pointing at rhel-ha-host-project instead of rhel-ha-service-project indicating that ctx.instance_url is incorrect. Next, I decided to modify the RA to make the network and instance projects hard-coded with what should (presumably) be the correct values. When doing a debug-start of the resource at that point, we get the following: =============================== [root@nodea ~]# pcs resource debug-start gcp-vpc-move-route Operation start for gcp-vpc-move-route (ocf:heartbeat:gcp-vpc-move-route) returned: 'error' (1) INFO:gcp-vpc-move-route:Bringing up the floating IP 172.16.132.99 Traceback (most recent call last): File "/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route", line 464, in <module> main() File "/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route", line 451, in main ip_and_route_start(ctx) File "/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route", line 340, in ip_and_route_start wait_for_operation(ctx, request.execute()) File "/usr/lib/resource-agents/bundled/gcp/google-cloud-sdk/lib/third_party/googleapiclient/_helpers.py", line 131, in positional_wrapper return wrapped(*args, **kwargs) File "/usr/lib/resource-agents/bundled/gcp/google-cloud-sdk/lib/third_party/googleapiclient/http.py", line 852, in execute raise HttpError(resp, content, uri=self.uri) googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/routes?alt=json returned "Invalid value for field 'resource.nextHopInstance': 'https://www.googleapis.com/compute/v1/projects/rhel-ha-service-project/zones/us-east1-b/instances/bperkins84svc-nodea-vm'. Cross project referencing is not allowed for this resource."> =============================== which matches what happens if one tries to do this with the gcloud CLI: =============================== [root@nodea ~]# gcloud-ra compute routes create bperkins84svc-ip --destination-range=172.16.132.99/32 --next-hop-instance=bperkins84svc-nodeb-vm --network=projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected ERROR: (gcloud-ra.compute.routes.create) Could not fetch resource: - Invalid value for field 'resource.network': 'https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected'. Cross project referencing is not allowed for this resource. =============================== This makes me wonder if a) this can possibly work at all, and b) do I still have something configured wrong. Michael, this doesnt even seem to work with the gcloud command (see Comment 10). We also tested it with updated gcloud command (via the web console) without any luck. Is there any other commands that needs to be run via gcloud to get it working? @brandon Perkins it will not work with the Instance name Google Route's have three possible options to define static routes, Moreover the intent of service project vs host project design primarily focuses on separating the network components as well provide separated access controls to through IAM . Keeping this in mind instances will not reside in Host project hence we need to compliment the route with the Next Hop to IP instead Next Hop to Instance Next Hop to IP should be the participating nodes in the cluster since IP address will be available at the Host project, This needs to be dynamically determined. All this while we have been managing the GCP resource through a custom OCF provider where we define Next Hop to Ip address Snippet from out custom OCF script , We modified the script according to our needs sourcing it from [1] $GCLOUD compute routes create $OCF_RESKEY_name \ --network=${OCF_RESKEY_network} \ --destination-range=${OCF_RESKEY_prefix}/${OCF_RESKEY_prefix_length} \ --next-hop-address=$(get_instance_ip) [1] https://github.com/torchbox/pcmk-gcproute/blob/master/gcproute.sh Regards Michael I'm making an educated guess here, but it seems the fact that we are using 'nextHopInstance' on lines 336 and 382, instead of 'nextHopIp' may be what's getting us here. If I create the route via the CLI, things seem to be correct: [root@nodea ~]# gcloud-ra --project rhel-ha-host-project compute routes describe bperkins84svc-ip creationTimestamp: '2021-02-24T06:39:45.420-08:00' description: '' destRange: 172.16.132.99/32 id: '7690883616705600478' kind: compute#route name: bperkins84svc-ip network: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected nextHopIp: 172.16.66.2 priority: 1000 selfLink: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/routes/bperkins84svc-ip but, the resource agent isn't looking for this value. Patch to make vpc_network optional (defaults to default when not set): https://github.com/ClusterLabs/resource-agents/pull/1615 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:1736 |