Bug 1567128 - [HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable [NEEDINFO]
Summary: [HA] Taking down cluster leader triggers router update event from mdsal which...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 13.0 (Queens)
Assignee: Josh Hershberg
QA Contact: Tomas Jamrisko
URL:
Whiteboard: odl_netvirt, odl_ha
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-13 13:25 UTC by Tomas Jamrisko
Modified: 2018-10-18 07:18 UTC (History)
11 users (show)

Fixed In Version: opendaylight-8.0.0-11.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-06-27 13:50:58 UTC
Target Upstream Version:
knylande: needinfo? (jhershbe)


Attachments (Terms of Use)
logs from controllers (11.24 MB, application/x-tar)
2018-04-13 13:50 UTC, Tomas Jamrisko
no flags Details
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH (5.81 MB, text/plain)
2018-05-16 08:53 UTC, Josh Hershberg
no flags Details


Links
System ID Priority Status Summary Last Updated
OpenDaylight Bug NETVIRT-1279 None None None 2018-05-22 09:40:24 UTC
OpenDaylight gerrit 72307 None None None 2018-05-28 08:02:10 UTC
Red Hat Product Errata RHEA-2018:2086 None None None 2018-06-27 13:51:40 UTC

Description Tomas Jamrisko 2018-04-13 13:25:03 UTC
Description of problem:

VMs become unreachable after leader of the cluster changes.

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-5

How reproducible:
100 %

Steps to Reproduce:
Reboot controllers in HA deployment, make sure the leader changes

Actual results:
Floating IPs become unreachable

Comment 2 Aswin Suryanarayanan 2018-04-13 13:38:13 UTC
From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful.

After the restart, controller-1 became the master  but the ovs seems to still managed by controller-2. Now  punted packet(ping response from the vm with FIP) never reached controller 1. 

When the controller-0 and controller-1 was shutdown, FIP started working.

There seems to be an issue when the OVS is managed by a controller other than master.

Comment 3 Tomas Jamrisko 2018-04-13 13:50:05 UTC
Created attachment 1421399 [details]
logs from controllers

Comment 4 Itzik Brown 2018-04-15 07:29:33 UTC
I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup

Comment 8 Josh Hershberg 2018-05-16 08:53:47 UTC
Created attachment 1437199 [details]
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH

Comment 9 Josh Hershberg 2018-05-16 09:12:52 UTC
Well, it seems there are at least a few issues here.

The chain of events as observed in the odl logs is as follows:

At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update:
 
2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}}

this function ends with a call to:  

 snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL);

which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat):

 table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26

So, ISSUE 1:
 it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE.

However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter:

2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager             | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96]

ISSUE 2:
This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)

Comment 12 Mike Kolesnik 2018-05-21 12:37:40 UTC
Stephen, can you please look at comment #9 and respond with your inputs?

Comment 13 Stephen Kitt 2018-05-22 09:40:24 UTC
I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).

Comment 24 errata-xmlrpc 2018-06-27 13:50:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086


Note You need to log in before you can comment on or make changes to this bug.