1846732 – gcp-vpc-move-vip: An existing alias IP range is removed when a second alias IP range is added [rhel-7.9.z]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1846732 - gcp-vpc-move-vip: An existing alias IP range is removed when a second alias IP range is added [rhel-7.9.z]

Summary: gcp-vpc-move-vip: An existing alias IP range is removed when a second alias I...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	resource-agents
Sub Component:
Version:	7.9
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Oyvind Albrigtsen
QA Contact:	Brandon Perkins
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1846733 1862987
TreeView+	depends on / blocked

Reported:	2020-06-13 08:14 UTC by Reid Wahl
Modified:	2024-03-25 16:03 UTC (History)
CC List:	9 users (show)
Fixed In Version:	resource-agents-4.1.1-61.el7_9.1
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1846733 1862987 (view as bug list)
Environment:
Last Closed:	2020-11-10 12:56:47 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	ClusterLabs resource-agents pull 1524	0	None	closed	[gcp-vpc-move-vip] Support for multiple alias IPs, and metadata improvements	2020-11-25 15:59:24 UTC
Red Hat Knowledge Base (Solution)	5154961	0	None	None	None	2020-06-13 08:27:51 UTC

Description Reid Wahl 2020-06-13 08:14:02 UTC

Description of problem:

If a cluster contains two gcp-vpc-move-vip resources, only one can run on a particular node at a given time. If a second gcp-vpc-move-vip resource starts up on a node where one is already running, the existing alias IP range is removed before the new one is added.

This places unnecessary limits on functionality. Per the GCP documentation: "A VM instance virtual interface can have up to 10 alias IP ranges assigned to it."
  - Configuring alias IP ranges (https://cloud.google.com/vpc/docs/configure-alias-ip-ranges)

I consider this a bug rather than an RFE. This behavior prevents one node from being able to effectively host two floating IP addresses simultaneously (unless they are in a contiguous range and can be managed as a single unit, which is uncommon).

This impacts users running SAP NetWeaver, in which an ASCS instance and an ERS instance each has its own floating IP address. I believe ASCS and ERS prefer to run on separate nodes; however, if one node fails, they should be able to run together.

This may also impact certain users running SAP HANA in Pacemaker clusters. Only one floating IP is required for SAP HANA instance. However, I believe it is possible to run multiple HANA instances in a single cluster.

This is in addition to any other assorted use cases that may require multiple cluster-managed floating IP addresses, which may end up on the same node.

In the code below, note that the set_alias function starts the aliasIpRanges list from scratch, removing any alias IP ranges that are already present. Instead, it should getting the existing list and updating it to include the range specified in OCF_RESKEY_alias_ip.

~~~
def gcp_alias_start(alias):
...
  # If I already have the IP, exit. If it has an alias IP that isn't the VIP,
  # then remove it
  if my_alias == alias:
    logger.info(
        '%s already has %s attached. No action required' % (THIS_VM, alias))
    sys.exit(OCF_SUCCESS)
  elif my_alias:
    logger.info('Removing %s from %s' % (my_alias, THIS_VM))
    set_alias(project, my_zone, THIS_VM, '')


def set_alias(project, zone, instance, alias, alias_range_name=None):
  fingerprint = get_network_ifaces(project, zone, instance)[0]['fingerprint']
  body = {
      'aliasIpRanges': [],
      'fingerprint': fingerprint
  }
  if alias:
    obj = {'ipCidrRange': alias}
    if alias_range_name:
      obj['subnetworkRangeName'] = alias_range_name
    body['aliasIpRanges'].append(obj)

  request = CONN.instances().updateNetworkInterface(
      instance=instance, networkInterface='nic0', project=project, zone=zone,
      body=body)
  operation = request.execute()
  wait_for_operation(project, zone, operation)
~~~


Demonstration:
~~~
[root@nwahl-rhel7-node1 ~]# pcs resource show alias_ip1 
 Resource: alias_ip1 (class=ocf provider=heartbeat type=gcp-vpc-move-vip)
  Attributes: alias_ip=10.138.0.30/32
  Operations: monitor interval=60s timeout=15s (alias_ip1-monitor-interval-60s)
              start interval=0s timeout=300s (alias_ip1-start-interval-0s)
              stop interval=0s timeout=15s (alias_ip1-stop-interval-0s)

[root@nwahl-rhel7-node1 ~]# pcs resource show | grep alias_ip1
     alias_ip1	(ocf::heartbeat:gcp-vpc-move-vip):	Started nwahl-rhel7-node2

[root@nwahl-rhel7-node1 ~]# date && pcs resource create alias_ip2 gcp-vpc-move-vip alias_ip=10.138.0.31/32 --group vipgrp2
Sat Jun 13 08:05:55 UTC 2020
Assumed agent name 'ocf:heartbeat:gcp-vpc-move-vip' (deduced from 'gcp-vpc-move-vip')


# # alias_ip2 successfully on node 2
Jun 13 08:06:10 nwahl-rhel7-node2 crmd[1601]:  notice: Result of start operation for alias_ip2 on nwahl-rhel7-node2: 0 (ok)

# # alias_ip1 then fails its next monitor operation on node 2, because its alias has been removed by alias_ip2
# # The two gcp-vpc-move-vip resources then start fighting with each other.
# # One will fail its monitor operation and restart, thus removing the other one's alias IP.
# # Then the other will fail as a result, and the cycle repeats.
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]: warning: Processing failed monitor of alias_ip1 on nwahl-rhel7-node2: not running
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]:  notice:  * Recover    alias_ip1     (                      nwahl-rhel7-node2 )
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]:  notice:  * Restart    vip1          (                      nwahl-rhel7-node2 )   due to required alias_ip1 start
Jun 13 08:06:46 nwahl-rhel7-node2 pengine[1600]:  notice: Calculated transition 30, saving inputs in /var/lib/pacemaker/pengine/pe-input-24.bz2
Jun 13 08:06:46 nwahl-rhel7-node2 crmd[1601]:  notice: Initiating stop operation vip1_stop_0 locally on nwahl-rhel7-node2
...
Jun 13 08:07:00 nwahl-rhel7-node2 crmd[1601]:  notice: Result of start operation for alias_ip1 on nwahl-rhel7-node2: 0 (ok)
...
Jun 13 08:07:11 nwahl-rhel7-node2 crmd[1601]:  notice: Initiating stop operation alias_ip2_stop_0 locally on nwahl-rhel7-node2
~~~

-----

Version-Release number of selected component (if applicable):

resource-agents-gcp-4.1.1-46.el7_8.1.x86_64

-----

How reproducible:

Always

-----

Steps to Reproduce:
1. Create a two-node GCP cluster and place one node in standby so that all resources will run on the same node.
2. Create two gcp-vpc-move-vip resources, each with a different value for the alias_ip attribute.

-----

Actual results:

The two gcp-vpc-move-vip resources enter a cycle of:
  - rsc2 starts and removes rsc1's alias IP range
  - rsc1 fails its monitor operation
  - rsc1 restarts and removes rsc2's alias IP range
  - rsc2 fails its monitor operation
  - rsc2 restarts and removes rsc1's alias IP range

-----

Expected results:

Up to 10 gcp-vpc-move-vip resources can coexist on a single node.

-----

Additional info:

See also:
  - Alias IP ranges overview (https://cloud.google.com/vpc/docs/alias-ip)
  - Method: instances.updateNetworkInterface (https://cloud.google.com/compute/docs/reference/rest/v1/instances/updateNetworkInterface)

Comment 21 errata-xmlrpc 2020-11-10 12:56:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Low: resource-agents security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5004

Note You need to log in before you can comment on or make changes to this bug.