Bug 1567128 - [HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable
Summary: [HA] Taking down cluster leader triggers router update event from mdsal which...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 13.0 (Queens)
Assignee: Josh Hershberg
QA Contact: Tomas Jamrisko
URL:
Whiteboard: odl_netvirt, odl_ha
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-13 13:25 UTC by Tomas Jamrisko
Modified: 2023-09-14 04:26 UTC (History)
11 users (show)

Fixed In Version: opendaylight-8.0.0-11.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-06-27 13:50:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs from controllers (11.24 MB, application/x-tar)
2018-04-13 13:50 UTC, Tomas Jamrisko
no flags Details
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH (5.81 MB, text/plain)
2018-05-16 08:53 UTC, Josh Hershberg
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenDaylight Bug NETVIRT-1279 0 None None None 2018-05-22 09:40:24 UTC
OpenDaylight gerrit 72307 0 None None None 2018-05-28 08:02:10 UTC
Red Hat Product Errata RHEA-2018:2086 0 None None None 2018-06-27 13:51:40 UTC

Description Tomas Jamrisko 2018-04-13 13:25:03 UTC
Description of problem:

VMs become unreachable after leader of the cluster changes.

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-5

How reproducible:
100 %

Steps to Reproduce:
Reboot controllers in HA deployment, make sure the leader changes

Actual results:
Floating IPs become unreachable

Comment 2 Aswin Suryanarayanan 2018-04-13 13:38:13 UTC
From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful.

After the restart, controller-1 became the master  but the ovs seems to still managed by controller-2. Now  punted packet(ping response from the vm with FIP) never reached controller 1. 

When the controller-0 and controller-1 was shutdown, FIP started working.

There seems to be an issue when the OVS is managed by a controller other than master.

Comment 3 Tomas Jamrisko 2018-04-13 13:50:05 UTC
Created attachment 1421399 [details]
logs from controllers

Comment 4 Itzik Brown 2018-04-15 07:29:33 UTC
I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup

Comment 8 Josh Hershberg 2018-05-16 08:53:47 UTC
Created attachment 1437199 [details]
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH

Comment 9 Josh Hershberg 2018-05-16 09:12:52 UTC
Well, it seems there are at least a few issues here.

The chain of events as observed in the odl logs is as follows:

At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update:
 
2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}}

this function ends with a call to:  

 snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL);

which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat):

 table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26

So, ISSUE 1:
 it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE.

However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter:

2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager             | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96]

ISSUE 2:
This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)

Comment 12 Mike Kolesnik 2018-05-21 12:37:40 UTC
Stephen, can you please look at comment #9 and respond with your inputs?

Comment 13 Stephen Kitt 2018-05-22 09:40:24 UTC
I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).

Comment 24 errata-xmlrpc 2018-06-27 13:50:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Comment 27 Red Hat Bugzilla 2023-09-14 04:26:49 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.