Bug 1961162

Summary:	ARP request flooding for 172.24.0.1 gateway
Product:	Red Hat OpenStack	Reporter:	anil venkata <vkommadi>
Component:	tripleo-ansible	Assignee:	Gregory Thiemonge <gthiemon>
Status:	CLOSED ERRATA	QA Contact:	Omer Schwartz <oschwart>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	beagles, cgoncalves, gregraka, gthiemon, igallagh, jelynch, joflynn, lpeer, majopela, michjohn, oschwart, scohen
Target Milestone:	z9	Keywords:	Triaged
Target Release:	16.1 (Train on RHEL 8.2)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	tripleo-ansible-0.5.1-1.20220906163309.902c3c8.el8ost	Doc Type:	Bug Fix
Doc Text:	Before this update, a nonexistent gateway address was configured on the load-balancing management network. This caused excessive Address Resolution Protocol (ARP) requests on the load-balancing management network.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-12-07 20:24:45 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description anil venkata 2021-05-17 11:50:16 UTC

Description of problem:
While Octavia scale testing, we observed all the VMs continuously sending ARP requests for 172.24.0.1 ip address as the VMs configured with this IP as the default gateway. As all the VMs are in the same broadcast domain, all of them also receive these ARP request packets.

As Octavia is not using the default gateway, it shouldn't create the network with default gateway option.

[cloud-user@amphora-f7749858-5649-466e-9cac-876210d61e7a ~]$ ip r
default via 172.24.0.1 dev eth0 proto dhcp metric 100 
169.254.169.254 via 172.24.0.2 dev eth0 proto dhcp metric 100 
172.24.0.0/16 dev eth0 proto kernel scope link src 172.24.19.165 metric 100 

(overcloud) [stack@undercloud ~]$ neutron subnet-show 764269e8-bf7f-46e0-8ce2-188b0790b6cb
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+-------------------+--------------------------------------------------+
| Field             | Value                                            |
+-------------------+--------------------------------------------------+
| allocation_pools  | {"start": "172.24.0.2", "end": "172.24.255.254"} |
| cidr              | 172.24.0.0/16                                    |
| created_at        | 2021-05-05T20:08:26Z                             |
| description       |                                                  |
| dns_nameservers   |                                                  |
| enable_dhcp       | True                                             |
| gateway_ip        | 172.24.0.1                                       |
| host_routes       |                                                  |
| id                | 764269e8-bf7f-46e0-8ce2-188b0790b6cb             |
| ip_version        | 4                                                |
| ipv6_address_mode |                                                  |
| ipv6_ra_mode      |                                                  |
| name              | lb-mgmt-subnet                                   |
| network_id        | 8ce010d1-fc76-4008-a5b0-5294ce1b9415             |
| project_id        | d9dc980fa43f4c64998a7889cf458d8f                 |
| revision_number   | 0                                                |
| segment_id        |                                                  |
| service_types     |                                                  |
| subnetpool_id     |                                                  |
| tags              |                                                  |
| tenant_id         | d9dc980fa43f4c64998a7889cf458d8f                 |
| updated_at        | 2021-05-05T20:08:26Z                             |
+-------------------+--------------------------------------------------+

Ovs-vswitchd CPU usage on compute node has drastically reduced to 5% from 40%-90% after creating a VM on lb-mgmt-net with 172.24.0.1 address to temporarily fix this ARP issue (disabling gateway with “neutron subnet-update” is not helping as well).

Comment 1 Michael Johnson 2021-05-17 16:30:34 UTC

lol, well, this is kind of funny. It's probably a bug in OVS that the CPU load goes up for handling normal ARP traffic (which is super small and easy to process). I wouldn't expect 5,600 VMs to cause that much trouble in OVS.

On the Octavia side, we don't touch or create these ARPs. They are all handled directly by the kernel and the network stack of RHEL.

It is tripleo that is creating the subnet with the default gateway set. If the OSP role being used does not require routing for the lb-mgmt-net, it should not be configuring a gateway on that subnet, especially without a router listening on it.

The amphora automatically pick that up from the neutron subnet configuration at nova boot time.

I would agree, this is a tripleo bug for the role(s).

Comment 2 Carlos Goncalves 2021-05-17 16:40:17 UTC

Pointers:
  - https://github.com/openstack/tripleo-heat-templates/blob/fe2373225f039d795970b70fe9b2f28e0e7cd6a4/deployment/octavia/octavia-deployment-config.j2.yaml#L115-L118
  - https://github.com/openstack/tripleo-ansible/blob/8ef33773a2d3eaca062bb4629bf2077b6eb1349b/tripleo_ansible/roles/octavia_overcloud_config/tasks/network.yml#L26

Comment 6 Gregory Thiemonge 2022-02-22 09:23:30 UTC

Backport proposed to stable/train

Comment 11 Omer Schwartz 2022-11-14 12:24:29 UTC

After deploying the latest passed_phase2 compose:

(overcloud) [stack@undercloud-0 ~]$ cat core_puddle_version
RHOS-16.1-RHEL-8-20221108.n.1


The lb-mgmt-subnet does not have a gateway_ip:
(overcloud) [stack@undercloud-0 ~]$ openstack subnet show -c gateway_ip lb-mgmt-subnet
+------------+-------+
| Field      | Value |
+------------+-------+
| gateway_ip | None  |
+------------+-------+


No ARP requests were sent to 172.24.0.1 when I ran
[tripleo-admin@controller-0 ~]$ sudo tcpdump -nn -i o-hm0 arp

while creating an LB.


I am moving the status of this BZ to VERIFIED.

Comment 15 Michael Johnson 2022-12-05 16:54:20 UTC

Updating the doctext to more accurately reflect the issue resolved.

Comment 20 errata-xmlrpc 2022-12-07 20:24:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 16.1.9 bug fix and enhancement advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:8795