Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1992835

Summary:	[OSP16.1] Rebooting controller nodes creates outages for instances on the tenant network (MAC addresses are regenerated)
Product:	Red Hat OpenStack	Reporter:	ggrimaux
Component:	os-net-config	Assignee:	OSP Team <rhos-maint>
Status:	CLOSED DUPLICATE	QA Contact:	nlevinki <nlevinki>
Severity:	high	Docs Contact:
Priority:	high
Version:	16.1 (Train)	CC:	bfournie, hbrock, jslagle, mburns, sbaker
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-08-17 19:59:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description ggrimaux 2021-08-11 19:17:15 UTC

Description of problem:
In customer own words:
~~~
We did an OpenStack upgrade in a RHOSP16-based environment. After the reboot of the controller nodes (one-by-one) we had a network outage in the tenant network:

- Running instances were not reachable from outside via floating IP addresses
- Running instances were not able to reach anything outside of the tenant network via the Neutron router
- Newly created instances didn't get their IP addresses via DHCP service
~~~

It was discovered by them that restarting the controller node would generate a new MAC address breaking up the traffic until the TTL of the entry and then the traffic would work again.

How to reproduce the issue. Again using customer own words:
~~~
Controller 1 which was active for the Neutron router was rebooted.
When the controller node came back online it had a completely new MAC address for the VLAN interface "vlan280" for the tenant network.
Randomly assigning a new MAC address to this VLAN interface seems to happen after every reboot on all three controller nodes.

We observed that controller 1 correctly announced its new MAC address via gARP to all other controller and compute nodes in the underlay network.

This gARP seems to have been ignored by the tunnel neighbor and ARP cache.

At this point, everything was still working fine.
Afterwards, we rebooted controller 2 which became the active Neutron router after controller 1 was rebooted.
When controller 2 was rebooted, the controller 1 became active for the Neutron router again and exactly at this point of time, we had again a network outage in the tenant network.
Because the IP address on controller 1 was still pointing to the wrong MAC address on all compute nodes. Therefore, all traffic in the tenant network was sent to nowhere.
We waited more than 1 hour to see if it gets automatically fixed again but this was not the case. We had to restarted OVS or had to flush the two tables.

Now, we would like to know a) why the MAC addresses are re-generated after every controller node and b) why gARPs are ignored by the tunnel neighbor and ARP cache.
~~~

I will be sharing all the output in private comments below.

If you need anything please let me know.

We also have sosreport from all controller nodes.

Version-Release number of selected component (if applicable):
OSP16.1.6
os-net-config-11.3.2-1.20210406083710.f49ab16.el8ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Reboot controller nodes one by one
2. Lose tenant network for the instances.
3.

Actual results:
Lost of connectivity on the tenant network because of MAC address change.

Expected results:
Keeping the same MAC address on the VLAN this way not impacting routing when controller nodes are restarted.
Or if the MAC address is changed that the gARP is sent and the cache is updated on the other nodes with the new entry.

Additional info:
sosreport of the controller nodes are available.

Comment 4 Steve Baker 2021-08-17 19:59:29 UTC

Closing this as a duplicate of #1989057. A workaround will be provided, and 1989057 will become an RFE.

*** This bug has been marked as a duplicate of bug 1989057 ***