1567128 – [HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable

Bug 1567128 - [HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable

Summary: [HA] Taking down cluster leader triggers router update event from mdsal which...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	opendaylight
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	13.0 (Queens)
Assignee:	Josh Hershberg
QA Contact:	Tomas Jamrisko
Docs Contact:
URL:
Whiteboard:	odl_netvirt, odl_ha
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-13 13:25 UTC by Tomas Jamrisko
Modified:	2023-09-14 04:26 UTC (History)
CC List:	11 users (show)
Fixed In Version:	opendaylight-8.0.0-11.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	N/A
Last Closed:	2018-06-27 13:50:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
logs from controllers (11.24 MB, application/x-tar) 2018-04-13 13:50 UTC, Tomas Jamrisko	no flags	Details
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH (5.81 MB, text/plain) 2018-05-16 08:53 UTC, Josh Hershberg	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenDaylight Bug	NETVIRT-1279	None	None	None	2018-05-22 09:40:24 UTC
OpenDaylight gerrit	72307	None	None	None	2018-05-28 08:02:10 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:51:40 UTC

Description Tomas Jamrisko 2018-04-13 13:25:03 UTC

Description of problem:

VMs become unreachable after leader of the cluster changes.

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-5

How reproducible:
100 %

Steps to Reproduce:
Reboot controllers in HA deployment, make sure the leader changes

Actual results:
Floating IPs become unreachable

Comment 2 Aswin Suryanarayanan 2018-04-13 13:38:13 UTC

From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful.

After the restart, controller-1 became the master  but the ovs seems to still managed by controller-2. Now  punted packet(ping response from the vm with FIP) never reached controller 1. 

When the controller-0 and controller-1 was shutdown, FIP started working.

There seems to be an issue when the OVS is managed by a controller other than master.

Comment 3 Tomas Jamrisko 2018-04-13 13:50:05 UTC

Created attachment 1421399 [details]
logs from controllers

Comment 4 Itzik Brown 2018-04-15 07:29:33 UTC

I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup

Comment 8 Josh Hershberg 2018-05-16 08:53:47 UTC

Created attachment 1437199 [details]
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH

Comment 9 Josh Hershberg 2018-05-16 09:12:52 UTC

Well, it seems there are at least a few issues here.

The chain of events as observed in the odl logs is as follows:

At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update:
 
2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}}

this function ends with a call to:  

 snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL);

which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat):

 table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26

So, ISSUE 1:
 it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE.

However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter:

2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager             | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96]

ISSUE 2:
This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)

Comment 12 Mike Kolesnik 2018-05-21 12:37:40 UTC

Stephen, can you please look at comment #9 and respond with your inputs?

Comment 13 Stephen Kitt 2018-05-22 09:40:24 UTC

I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).

Comment 24 errata-xmlrpc 2018-06-27 13:50:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Comment 27 Red Hat Bugzilla 2023-09-14 04:26:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.