Description of problem: VMs become unreachable after leader of the cluster changes. Version-Release number of selected component (if applicable): opendaylight-8.0.0-5 How reproducible: 100 % Steps to Reproduce: Reboot controllers in HA deployment, make sure the leader changes Actual results: Floating IPs become unreachable
From the logs it looks like initially controller 2 was the master and ovs connects to the controller 2. With this controller was receiving the punted packet (the return traffic from vm ) for resolving the undercloud ip. The traffic was successful. After the restart, controller-1 became the master but the ovs seems to still managed by controller-2. Now punted packet(ping response from the vm with FIP) never reached controller 1. When the controller-0 and controller-1 was shutdown, FIP started working. There seems to be an issue when the OVS is managed by a controller other than master.
Created attachment 1421399 [details] logs from controllers
I observed the same problem when restarting the opendaylight_api container on non-HA bare metal setup
Created attachment 1437199 [details] karaf logs from after problem occurs, marked up with nodes beginning with #JOSH
Well, it seems there are at least a few issues here. The chain of events as observed in the odl logs is as follows: At some point in the log we find a call to SnatCentralizedSwitchChangeListener.update: 2018-05-16T05:49:20,246 | DEBUG | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | SnatCentralizedSwitchChangeListener | 356 - org.opendaylight.netvirt.natservice-impl - 0.6.0.redhat-9 | Updating old RouterToNaptSwitch{getPrimarySwitchId=0, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} new RouterToNaptSwitch{getPrimarySwitchId=13665695449163, getRouterName=5316dc24-48c3-456f-a0cd-e626b8adda96, augmentations={}} this function ends with a call to: snatServiceManger.notify(router, primarySwitchId, null, SnatServiceManager.Action.SNAT_ALL_SWITCH_DISBL); which results (further up-stack) in a call to AbstractSnatService#installDefaultFibRouteForSNAT which removes this flow (this is what breaks snat): table=21, n_packets=2604, n_bytes=255192, priority=10,ip,metadata=0x33c36/0xfffffe actions=goto_table:26 So, ISSUE 1: it would seem that the call to notify(...SNAT_ALL_SWITCH_DISBL) should be followed by an identical call but with SNAT_ALL_SWITCH_ENABLE. However, it seems this is being cause by a more nefarious and deep problem. Some time before all the above we find the following line printed from NeutronvpnNatManager#handleExternalNetworkForRouter: 2018-05-16T05:49:16,096 | TRACE | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronvpnNatManager | 358 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-9 | handleExternalNetwork for router Uuid [_value=5316dc24-48c3-456f-a0cd-e626b8adda96] ISSUE 2: This method is called from two places and can only occur if a neutron router is added or updated. After checking the neutron logs it is clear that networking-odl has not made any rest calls to modify a router for many hours prior to this. Since this happens right after shutting down the leader, I suspect that something is broken with ha/clustering and we are getting some spurious notifications (I saw some others in the log as well that did not seem to break anything.)
Stephen, can you please look at comment #9 and respond with your inputs?
I’ve created https://jira.opendaylight.org/browse/NETVIRT-1279 for issue 1, which is straightforward to fix (although a robust fix is more involved).
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days