Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1567128

Summary:

[HA] Taking down cluster leader triggers router update event from mdsal which leads to FIPs becoming unreachable

Product:

Red Hat OpenStack

Reporter:

Tomas Jamrisko <tjamrisk>

Component:

opendaylight

Assignee:

Josh Hershberg <jhershbe>

Status:

CLOSED ERRATA

QA Contact:

Tomas Jamrisko <tjamrisk>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

13.0 (Queens)

CC:

aadam, asuryana, jhershbe, jluhrsen, knylande, mkolesni, nyechiel, oblaut, sgaddam, skitt, tjamrisk

Target Milestone:

Keywords:

Triaged

Target Release:

13.0 (Queens)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

odl_netvirt, odl_ha

Fixed In Version:

opendaylight-8.0.0-11.el7ost

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

N/A

Last Closed:

2018-06-27 13:50:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
logs from controllers	none
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH	none

Description Tomas Jamrisko 2018-04-13 13:25:03 UTC

Description of problem:

VMs become unreachable after leader of the cluster changes.

Version-Release number of selected component (if applicable):
opendaylight-8.0.0-5

How reproducible:
100 %

Steps to Reproduce:
Reboot controllers in HA deployment, make sure the leader changes

Actual results:
Floating IPs become unreachable

Comment 2 Aswin Suryanarayanan 2018-04-13 13:38:13 UTC

From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful.

After the restart, controller-1 became the master  but the ovs seems to still managed by controller-2. Now  punted packet(ping response from the vm with FIP) never reached controller 1. 

When the controller-0 and controller-1 was shutdown, FIP started working.

There seems to be an issue when the OVS is managed by a controller other than master.

Comment 3 Tomas Jamrisko 2018-04-13 13:50:05 UTC

Created attachment 1421399 [details]
logs from controllers

Comment 4 Itzik Brown 2018-04-15 07:29:33 UTC

I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup

Comment 8 Josh Hershberg 2018-05-16 08:53:47 UTC

Created attachment 1437199 [details]
karaf logs from after problem occurs, marked up with nodes beginning with #JOSH

Comment 9 Josh Hershberg 2018-05-16 09:12:52 UTC

Well, it seems there are at least a few issues here.

The chain of events as observed in the odl logs is as follows:

At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update:
 
2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}}

this function ends with a call to:  

 snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL);

which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat):

 table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26

So, ISSUE 1:
 it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE.

However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter:

2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager             | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96]

ISSUE 2:
This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)

Comment 12 Mike Kolesnik 2018-05-21 12:37:40 UTC

Stephen, can you please look at comment #9 and respond with your inputs?

Comment 13 Stephen Kitt 2018-05-22 09:40:24 UTC

I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).

Comment 24 errata-xmlrpc 2018-06-27 13:50:58 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Comment 27 Red Hat Bugzilla 2023-09-14 04:26:49 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days