Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1567128 - [HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable [NEEDINFO]
[HA] Taking down cluster leader triggers router update event from mdsal which...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight (Show other bugs)
13.0 (Queens)
Unspecified Unspecified
urgent Severity urgent
: rc
: 13.0 (Queens)
Assigned To: Josh Hershberg
Tomas Jamrisko
odl_netvirt, odl_ha
: Triaged
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2018-04-13 09:25 EDT by Tomas Jamrisko
Modified: 2018-10-18 03:18 EDT (History)
11 users (show)

See Also:
Fixed In Version: opendaylight-8.0.0-11.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
N/A
Last Closed: 2018-06-27 09:50:58 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
knylande: needinfo? (jhershbe)


Attachments (Terms of Use)
logs from controllers (11.24 MB, application/x-tar)
2018-04-13 09:50 EDT, Tomas Jamrisko
no flags Details
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH (5.81 MB, text/plain)
2018-05-16 04:53 EDT, Josh Hershberg
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
OpenDaylight Bug NETVIRT-1279 None None None 2018-05-22 05:40 EDT
OpenDaylight gerrit 72307 None None None 2018-05-28 04:02 EDT
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 09:51 EDT

  None (edit)
Description Tomas Jamrisko 2018-04-13 09:25:03 EDT
Description of problem:

VMs become unreachable after leader of the cluster changes.

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-5

How reproducible:
100 %

Steps to Reproduce:
Reboot controllers in HA deployment, make sure the leader changes

Actual results:
Floating IPs become unreachable
Comment 2 Aswin Suryanarayanan 2018-04-13 09:38:13 EDT
From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful.

After the restart, controller-1 became the master  but the ovs seems to still managed by controller-2. Now  punted packet(ping response from the vm with FIP) never reached controller 1. 

When the controller-0 and controller-1 was shutdown, FIP started working.

There seems to be an issue when the OVS is managed by a controller other than master.
Comment 3 Tomas Jamrisko 2018-04-13 09:50 EDT
Created attachment 1421399 [details]
logs from controllers
Comment 4 Itzik Brown 2018-04-15 03:29:33 EDT
I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup
Comment 8 Josh Hershberg 2018-05-16 04:53 EDT
Created attachment 1437199 [details]
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH
Comment 9 Josh Hershberg 2018-05-16 05:12:52 EDT
Well, it seems there are at least a few issues here.

The chain of events as observed in the odl logs is as follows:

At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update:
 
2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}}

this function ends with a call to:  

 snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL);

which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat):

 table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26

So, ISSUE 1:
 it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE.

However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter:

2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager             | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96]

ISSUE 2:
This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)
Comment 12 Mike Kolesnik 2018-05-21 08:37:40 EDT
Stephen, can you please look at comment #9 and respond with your inputs?
Comment 13 Stephen Kitt 2018-05-22 05:40:24 EDT
I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).
Comment 24 errata-xmlrpc 2018-06-27 09:50:58 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.