Bug 1567128

Summary: [HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable
Product: Red Hat OpenStack Reporter: Tomas Jamrisko <tjamrisk>
Component: opendaylightAssignee: Josh Hershberg <jhershbe>
Status: CLOSED ERRATA QA Contact: Tomas Jamrisko <tjamrisk>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 13.0 (Queens)CC: aadam, asuryana, jhershbe, jluhrsen, knylande, mkolesni, nyechiel, oblaut, sgaddam, skitt, tjamrisk
Target Milestone: rcKeywords: Triaged
Target Release: 13.0 (Queens)Flags: knylande: needinfo? (jhershbe)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: odl_netvirt, odl_ha
Fixed In Version: opendaylight-8.0.0-11.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:50:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Description Flags
logs from controllers
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH none

Description Tomas Jamrisko 2018-04-13 13:25:03 UTC
Description of problem:

VMs become unreachable after leader of the cluster changes.

Version-Release number of selected component (if applicable):

How reproducible:
100 %

Steps to Reproduce:
Reboot controllers in HA deployment, make sure the leader changes

Actual results:
Floating IPs become unreachable

Comment 2 Aswin Suryanarayanan 2018-04-13 13:38:13 UTC
From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful.

After the restart, controller-1 became the master  but the ovs seems to still managed by controller-2. Now  punted packet(ping response from the vm with FIP) never reached controller 1. 

When the controller-0 and controller-1 was shutdown, FIP started working.

There seems to be an issue when the OVS is managed by a controller other than master.

Comment 3 Tomas Jamrisko 2018-04-13 13:50:05 UTC
Created attachment 1421399 [details]
logs from controllers

Comment 4 Itzik Brown 2018-04-15 07:29:33 UTC
I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup

Comment 8 Josh Hershberg 2018-05-16 08:53:47 UTC
Created attachment 1437199 [details]
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH

Comment 9 Josh Hershberg 2018-05-16 09:12:52 UTC
Well, it seems there are at least a few issues here.

The chain of events as observed in the odl logs is as follows:

At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update:
2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}}

this function ends with a call to:  

 snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL);

which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat):

 table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26

So, ISSUE 1:
 it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE.

However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter:

2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager             | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96]

This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)

Comment 12 Mike Kolesnik 2018-05-21 12:37:40 UTC
Stephen, can you please look at comment #9 and respond with your inputs?

Comment 13 Stephen Kitt 2018-05-22 09:40:24 UTC
I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).

Comment 24 errata-xmlrpc 2018-06-27 13:50:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.