Bug 1913932

Summary: gcp-vpc-move-route, gcp-vpc-move-vip: Failures if the instance is in a shared service project [RHEL 8]
Product: Red Hat Enterprise Linux 8 Reporter: Reid Wahl <nwahl>
Component: resource-agentsAssignee: Oyvind Albrigtsen <oalbrigt>
Status: CLOSED ERRATA QA Contact: Brandon Perkins <bperkins>
Severity: high Docs Contact:
Priority: high    
Version: 8.3CC: agk, bperkins, cfeist, cluster-maint, fdinitto, jwboyer, michael.varun
Target Milestone: rcKeywords: Triaged
Target Release: 8.4   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: resource-agents-4.1.1-87.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1913936 (view as bug list) Environment:
Last Closed: 2021-05-18 15:12:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Deadline: 2021-02-15   

Description Reid Wahl 2021-01-07 19:56:54 UTC
Description of problem:

From customer (SAP):
"Groking through the code I see the project is determined using the metadata services of the Instance, this will not work in the case of a shared service project 
There needs to be a feature enabled to support a project to be a user-defined input within the context of the VPC since the VPC and its routes may reside in the host project, while the instance may reside in the service project so companies who leverages shared service project will not benefit by using this resource agent since it will not work as expected, it will fail to determine the VPC and the routes since it will not exist"

Seems to be tied conceptually to this: https://cloud.google.com/vpc/docs/shared-vpc

I have not constructed a test environment for a shared service project (I'm not sure whether I even have the necessary privileges to do so). But it should be straightforward to add a resource attribute for project and sanity-test it, if this fix prevents the resources from breaking valid customer use cases.

-----

Version-Release number of selected component (if applicable):

resource-agents-gcp-4.1.1-68
master

-----

How reproducible:

Always (based on description)

-----

Steps to Reproduce:

Create an instance so that "the VPC and its routes may reside in the host project, while the instance may reside in the service project."
  - https://cloud.google.com/vpc/docs/shared-vpc

Attempt to start a gcp-vpc-move-route or gcp-vpc-move-vip resource.

-----

Actual results:

Operation fails. Will request failure details from reporter.

-----

Expected results:

Operation succeeds.

Comment 1 Michael Varun 2021-01-08 04:44:54 UTC
It should be pretty much straight forward, If there is an resource attribute "project" and an option to source the service account for the "project" 
There was issue reported for the fence agents in GCE reported in the BZ1704348 There has been changes to use the oauthclient however we would also request to honour OS environment variable along with oauthclient if there are considerations  on how to process authentication for route move 

Let me know if any help / validation/ testing is required would be happy to help

Comment 4 Oyvind Albrigtsen 2021-01-19 10:06:04 UTC
Additional patch to fix stop-action and incorrect call of a function: https://github.com/ClusterLabs/resource-agents/pull/1609

Comment 5 Brandon Perkins 2021-01-26 15:11:33 UTC
I'm going to need more information here (especially error messages).  I just setup an environment using the instructions at:

https://cloud.google.com/vpc/docs/provisioning-shared-vpc

and using the 8.3.0 GA version of RA: resource-agents-4.1.1-68.el8.x86_64

and I was unable to reproduce any failure.  There must be some configuration difference that I'm missing or isn't captured in the description or comments.

My configuration is simply a host project and a service project where the host project has two VPCs (with a subnet in each), and each instance has two NICs connected to each of the two subnets.  All fencing and IPv4 tests are passing with this setup.

Any help would be greatly appreciated.

Comment 6 Reid Wahl 2021-01-26 17:13:03 UTC
I'm clearing the needinfo flag from myself and leaving it set for Michael, who reported the issue. I never reproduced this one (nothing's changed since comment 0 in that regard).

Comment 7 Brandon Perkins 2021-02-01 15:56:49 UTC
I still need some feedback on this.  Without the failure case, I can't be sure if this is fixed.  This also will impact the delivery of the fix in bug 1913936, bug 1919343, and bug 1919344.

Comment 8 Michael Varun 2021-02-02 07:45:05 UTC
@brandon Perkins 

I assume you have VPC's in the host projects as shared VPC's and you have created compute engine instances in the shared project where the VPC's in the host project is shared with 

Also, gcp-vpc-move-vip resource uses a IP-ADDRESS from a subnet which is not visible to the VPC , This is a technical limitation in GCP that you cant have more than two ipaddress from the same subnet on the host.

I assume the use case here considers a use of secondary-ipaddress for the resource agent gcp-vpc-move-vip  while a floating IP address CIDR range is used for gcp-vpc-move-route & gcp-vpc-move-ip 

We are more interested in the later and not the former , definition on VIP vs IP is confusing and there are no proper scenarios where we run secondary-ip ranges unless its in container world 


so technically lets say you have 

host project: project-network 
VPC: HANA DB with subnet 10.0.0.1/26   Secondary IP Address 192.0.0.1/26 

service project: project-db 
Instance A : node001 10.0.0.1/32
Instance B : node002 10.0.0.2/32 
VIP: 192.0.0.1/32 (Floating IP Address) 

I have a dedicated service account which can create/delete network Advanced Routes[1] in the case of resource agent gcp-vpc-move-route, network routes can be administered only in the host project since the VPC is in the host project, So my service account will bear the host project details in it 

currently, in the script[3] i dont see a way how the object "ctx.conn" in line number 215 is created through authentication(I infer it is Oauth2.0), The documentation doesn't provide a detail information on  supported  authentication/authorisation [2]mechanism , a parametrised option service_account will be helpful incase users use service accounts instead Oauth

Same for the script[4] where the project details are determined based on metadata along with no details on the gcloud authentication/authroization 


Let me know if we should hop on a call for a discussion 

[1]https://cloud.google.com/vpc/docs/routes
[2]https://cloud.google.com/docs/authentication
[3]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-vpc-move-vip.in
[4]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-vpc-move-ip.in

Comment 9 Michael Varun 2021-02-02 07:47:20 UTC
(In reply to Michael Varun from comment #8)
> @brandon Perkins 
> 
> I assume you have VPC's in the host projects as shared VPC's and you have
> created compute engine instances in the shared project where the VPC's in
> the host project is shared with 
> 
> Also, gcp-vpc-move-vip resource uses a IP-ADDRESS from a subnet which is not
> visible to the VPC , This is a technical limitation in GCP that you cant
> have more than two ipaddress from the same subnet on the host.
> 
> I assume the use case here considers a use of secondary-ipaddress for the
> resource agent gcp-vpc-move-vip  while a floating IP address CIDR range is
> used for gcp-vpc-move-route & gcp-vpc-move-ip 
> 
> We are more interested in the later and not the former , definition on VIP
> vs IP is confusing and there are no proper scenarios where we run
> secondary-ip ranges unless its in container world 
> 
> 
> so technically lets say you have 
> 
> host project: project-network 
> VPC: HANA DB with subnet 10.0.0.1/26   Secondary IP Address 192.0.0.1/26 
> 
> service project: project-db 
> Instance A : node001 10.0.0.1/32
> Instance B : node002 10.0.0.2/32 
> VIP: 192.0.0.1/32 (Floating IP Address) 
> 
> I have a dedicated service account which can create/delete network Advanced
> Routes[1] in the case of resource agent gcp-vpc-move-route, network routes
> can be administered only in the host project since the VPC is in the host
> project, So my service account will bear the host project details in it 
> 
> currently, in the script[3] i dont see a way how the object "ctx.conn" in
> line number 215 is created through authentication(I infer it is Oauth2.0),
> The documentation doesn't provide a detail information on  supported 
> authentication/authorisation [2]mechanism , a parametrised option
> service_account will be helpful incase users use service accounts instead
> Oauth
> 
> Same for the script[4] where the project details are determined based on
> metadata along with no details on the gcloud authentication/authroization 
> 
> 
> Let me know if we should hop on a call for a discussion 
> 
> [1]https://cloud.google.com/vpc/docs/routes
> [2]https://cloud.google.com/docs/authentication
> [3]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-
> vpc-move-vip.in
> [4]https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-
> vpc-move-ip.in


Correction linked tagged as [3] is refering to the script https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/gcp-vpc-move-route.in and not gcp-vpc-move-vip.in

Comment 10 Brandon Perkins 2021-02-09 21:58:32 UTC
I believe I have the projects setup correctly now, at least I could get gcp-vpc-move-route to fail with the GA RA.  However, when I upgraded to the fixed in version for this bug, I noticed two different things.  If I setup the agent with project=rhel-ha-service-project (or not specifying a project argument), we get:

ERROR:gcp-vpc-move-route:VPC network not found

because it's setting ctx.vpc_network_url to use rhel-ha-service-project instead of rhel-ha-host-project.

So, then I flipped the logic to "solve the problem" and setup the agent with project=rhel-ha-host-project.  In this situation the resource starts and the IP can be pinged from the machine local to the running resource.  However, nothing else can get to it.  In looking at the route that was created by the resource we see:

===============================
[root@nodea ~]# gcloud-ra --project rhel-ha-host-project compute routes describe bperkins84svc-ip
creationTimestamp: '2021-02-09T13:35:00.921-08:00'
description: ''
destRange: 172.16.132.99/32
id: '3952597249666498283'
kind: compute#route
name: bperkins84svc-ip
network: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected
nextHopInstance: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/zones/us-east1-b/instances/bperkins84svc-nodeb-vm
priority: 1000
selfLink: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/routes/bperkins84svc-ip
warnings:
- code: NEXT_HOP_INSTANCE_NOT_FOUND
  data:
  - key: instance
    value: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/zones/us-east1-b/instances/bperkins84svc-nodeb-vm
  message: Next hop instance 'https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/zones/us-east1-b/instances/bperkins84svc-nodeb-vm'
    does not exist.
===============================

Notice that the nextHopInstance is pointing at rhel-ha-host-project instead of rhel-ha-service-project indicating that ctx.instance_url is incorrect.  

Next, I decided to modify the RA to make the network and instance projects hard-coded with what should (presumably) be the correct values.  When doing a debug-start of the resource at that point, we get the following:

===============================
[root@nodea ~]# pcs resource debug-start gcp-vpc-move-route 
Operation start for gcp-vpc-move-route (ocf:heartbeat:gcp-vpc-move-route) returned: 'error' (1)
INFO:gcp-vpc-move-route:Bringing up the floating IP 172.16.132.99
Traceback (most recent call last):
  File "/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route", line 464, in <module>
    main()
  File "/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route", line 451, in main
    ip_and_route_start(ctx)
  File "/usr/lib/ocf/resource.d/heartbeat/gcp-vpc-move-route", line 340, in ip_and_route_start
    wait_for_operation(ctx, request.execute())
  File "/usr/lib/resource-agents/bundled/gcp/google-cloud-sdk/lib/third_party/googleapiclient/_helpers.py", line 131, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/usr/lib/resource-agents/bundled/gcp/google-cloud-sdk/lib/third_party/googleapiclient/http.py", line 852, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 400 when requesting https://compute.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/routes?alt=json returned "Invalid value for field 'resource.nextHopInstance': 'https://www.googleapis.com/compute/v1/projects/rhel-ha-service-project/zones/us-east1-b/instances/bperkins84svc-nodea-vm'. Cross project referencing is not allowed for this resource.">
===============================

which matches what happens if one tries to do this with the gcloud CLI:

===============================
[root@nodea ~]# gcloud-ra compute routes create bperkins84svc-ip --destination-range=172.16.132.99/32 --next-hop-instance=bperkins84svc-nodeb-vm --network=projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected
ERROR: (gcloud-ra.compute.routes.create) Could not fetch resource:
 - Invalid value for field 'resource.network': 'https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected'. Cross project referencing is not allowed for this resource.

===============================

This makes me wonder if a) this can possibly work at all, and b) do I still have something configured wrong.

Comment 12 Oyvind Albrigtsen 2021-02-12 14:18:15 UTC
Michael, this doesnt even seem to work with the gcloud command (see Comment 10). We also tested it with updated gcloud command (via the web console) without any luck.

Is there any other commands that needs to be run via gcloud to get it working?

Comment 14 Michael Varun 2021-02-15 06:27:19 UTC
@brandon Perkins 

it will not work with the Instance name 

Google Route's have three possible options to define static routes, Moreover the intent of service project vs host project design primarily focuses on separating the network components as well provide separated access controls to through IAM . Keeping this in mind instances will not reside in Host project hence we need to compliment the route with the Next Hop to IP instead Next Hop to Instance 

Next Hop to IP  should be the participating nodes in the cluster since IP address will be available at the Host project, This needs to be dynamically determined.  

All this while we have been managing the GCP resource through a custom OCF provider  where we define Next Hop to Ip address 

Snippet from out custom OCF script , We modified the script according to our needs sourcing it from [1]



	$GCLOUD compute routes create $OCF_RESKEY_name 					\
		--network=${OCF_RESKEY_network}						\
		--destination-range=${OCF_RESKEY_prefix}/${OCF_RESKEY_prefix_length}	\
        --next-hop-address=$(get_instance_ip)



[1] https://github.com/torchbox/pcmk-gcproute/blob/master/gcproute.sh


Regards
Michael

Comment 17 Brandon Perkins 2021-02-24 15:00:15 UTC
I'm making an educated guess here, but it seems the fact that we are using 'nextHopInstance' on lines 336 and 382, instead of 'nextHopIp' may be what's getting us here.  If I create the route via the CLI, things seem to be correct:

[root@nodea ~]# gcloud-ra --project rhel-ha-host-project compute routes describe bperkins84svc-ip
creationTimestamp: '2021-02-24T06:39:45.420-08:00'
description: ''
destRange: 172.16.132.99/32
id: '7690883616705600478'
kind: compute#route
name: bperkins84svc-ip
network: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/networks/bperkinshost-vpc-protected
nextHopIp: 172.16.66.2
priority: 1000
selfLink: https://www.googleapis.com/compute/v1/projects/rhel-ha-host-project/global/routes/bperkins84svc-ip

but, the resource agent isn't looking for this value.

Comment 18 Oyvind Albrigtsen 2021-02-25 15:53:19 UTC
Patch to make vpc_network optional (defaults to default when not set): https://github.com/ClusterLabs/resource-agents/pull/1615

Comment 22 errata-xmlrpc 2021-05-18 15:12:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (resource-agents bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:1736