Bug 1846732

Summary:	gcp-vpc-move-vip: An existing alias IP range is removed when a second alias IP range is added [rhel-7.9.z]
Product:	Red Hat Enterprise Linux 7	Reporter:	Reid Wahl <nwahl>
Component:	resource-agents	Assignee:	Oyvind Albrigtsen <oalbrigt>
Status:	CLOSED ERRATA	QA Contact:	Brandon Perkins <bperkins>
Severity:	high	Docs Contact:
Priority:	high
Version:	7.9	CC:	agk, akaiser, bfrank, bperkins, cfeist, cluster-maint, fdinitto, jreznik, oalbrigt
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	resource-agents-4.1.1-61.el7_9.1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1846733 1862987 (view as bug list)		Environment:
Last Closed:	2020-11-10 12:56:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1846733, 1862987

Description Reid Wahl 2020-06-13 08:14:02 UTC

Description of problem:

If a cluster contains two gcp-vpc-move-vip resources, only one can run on a particular node at a given time. If a second gcp-vpc-move-vip resource starts up on a node where one is already running, the existing alias IP range is removed before the new one is added.

This places unnecessary limits on functionality. Per the GCP documentation: "A VM instance virtual interface can have up to 10 alias IP ranges assigned to it."
  - Configuring alias IP ranges (https://cloud.google.com/vpc/docs/configure-alias-ip-ranges)

I consider this a bug rather than an RFE. This behavior prevents one node from being able to effectively host two floating IP addresses simultaneously (unless they are in a contiguous range and can be managed as a single unit, which is uncommon).

This impacts users running SAP NetWeaver, in which an ASCS instance and an ERS instance each has its own floating IP address. I believe ASCS and ERS prefer to run on separate nodes; however, if one node fails, they should be able to run together.

This may also impact certain users running SAP HANA in Pacemaker clusters. Only one floating IP is required for SAP HANA instance. However, I believe it is possible to run multiple HANA instances in a single cluster.

This is in addition to any other assorted use cases that may require multiple cluster-managed floating IP addresses, which may end up on the same node.

In the code below, note that the set_alias function starts the aliasIpRanges list from scratch, removing any alias IP ranges that are already present. Instead, it should getting the existing list and updating it to include the range specified in OCF_RESKEY_alias_ip.

~~~
def gcp_alias_start(alias):
...
  # If I already have the IP, exit. If it has an alias IP that isn't the VIP,
  # then remove it
  if my_alias == alias:
    logger.info(
        '%s already has %s attached. No action required' % (THIS_VM, alias))
    sys.exit(OCF_SUCCESS)
  elif my_alias:
    logger.info('Removing %s from %s' % (my_alias, THIS_VM))
    set_alias(project, my_zone, THIS_VM, '')


def set_alias(project, zone, instance, alias, alias_range_name=None):
  fingerprint = get_network_ifaces(project, zone, instance)[0]['fingerprint']
  body = {
      'aliasIpRanges': [],
      'fingerprint': fingerprint
  }
  if alias:
    obj = {'ipCidrRange': alias}
    if alias_range_name:
      obj['subnetworkRangeName'] = alias_range_name
    body['aliasIpRanges'].append(obj)

  request = CONN.instances().updateNetworkInterface(
      instance=instance, networkInterface='nic0', project=project, zone=zone,
      body=body)
  operation = request.execute()
  wait_for_operation(project, zone, operation)
~~~


Demonstration:
~~~
[root@nwahl-rhel7-node1 ~]# pcs resource show alias_ip1 
 Resource: alias_ip1 (class=ocf provider=heartbeat type=gcp-vpc-move-vip)
  Attributes: alias_ip=10.138.0.30/32
  Operations: monitor interval=60s timeout=15s (alias_ip1-monitor-interval-60s)
              start interval=0s timeout=300s (alias_ip1-start-interval-0s)
              stop interval=0s timeout=15s (alias_ip1-stop-interval-0s)

[root@nwahl-rhel7-node1 ~]# pcs resource show | grep alias_ip1
     alias_ip1	(ocf::heartbeat:gcp-vpc-move-vip):	Started nwahl-rhel7-node2

[root@nwahl-rhel7-node1 ~]# date && pcs resource create alias_ip2 gcp-vpc-move-vip alias_ip=10.138.0.31/32 --group vipgrp2
Sat Jun 13 08:05:55 UTC 2020
Assumed agent name 'ocf:heartbeat:gcp-vpc-move-vip' (deduced from 'gcp-vpc-move-vip')


# # alias_ip2 successfully on node 2
Jun 13 08:06:10 nwahl-rhel7-node2 crmd[1601]:  notice: Result of start operation for alias_ip2 on nwahl-rhel7-node2: 0 (ok)

# # alias_ip1 then fails its next monitor operation on node 2, because its alias has been removed by alias_ip2
# # The two gcp-vpc-move-vip resources then start fighting with each other.
# # One will fail its monitor operation and restart, thus removing the other one's alias IP.
# # Then the other will fail as a result, and the cycle repeats.
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]: warning: Processing failed monitor of alias_ip1 on nwahl-rhel7-node2: not running
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]:  notice:  * Recover    alias_ip1     (                      nwahl-rhel7-node2 )
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]:  notice:  * Restart    vip1          (                      nwahl-rhel7-node2 )   due to required alias_ip1 start
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]:  notice: Calculated transition 30, saving inputs in /var/lib/pacemaker/pengine/pe-input-24.bz2
Jun 13 08:06:46 nwahl-rhel7-node2 crmd[1601]:  notice: Initiating stop operation vip1_stop_0 locally on nwahl-rhel7-node2
...
Jun 13 08:07:00 nwahl-rhel7-node2 crmd[1601]:  notice: Result of start operation for alias_ip1 on nwahl-rhel7-node2: 0 (ok)
...
Jun 13 08:07:11 nwahl-rhel7-node2 crmd[1601]:  notice: Initiating stop operation alias_ip2_stop_0 locally on nwahl-rhel7-node2
~~~

-----

Version-Release number of selected component (if applicable):

resource-agents-gcp-4.1.1-46.el7_8.1.x86_64

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create a two-node GCP cluster and place one node in standby so that all resources will run on the same node.
2. Create two gcp-vpc-move-vip resources, each with a different value for the alias_ip attribute.

-----

Actual results:

The two gcp-vpc-move-vip resources enter a cycle of:
  - rsc2 starts and removes rsc1's alias IP range
  - rsc1 fails its monitor operation
  - rsc1 restarts and removes rsc2's alias IP range
  - rsc2 fails its monitor operation
  - rsc2 restarts and removes rsc1's alias IP range

-----

Expected results:

Up to 10 gcp-vpc-move-vip resources can coexist on a single node.

-----

Additional info:

See also:
  - Alias IP ranges overview (https://cloud.google.com/vpc/docs/alias-ip)
  - Method: instances.updateNetworkInterface (https://cloud.google.com/compute/docs/reference/rest/v1/instances/updateNetworkInterface)

Comment 21 errata-xmlrpc 2020-11-10 12:56:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5004