Bug 1567507

Summary: [Infra] Thread terminated due to an uncaught exception
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: opendaylightAssignee: Vishal Thapar <vthapar>
Status: CLOSED ERRATA QA Contact: Tomas Jamrisko <tjamrisk>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aadam, jluhrsen, jschluet, mkolesni, nyechiel, sclewis, shague, smalleni, tjamrisk, vthapar, wznoinsk
Target Milestone: z1Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: Infra
Fixed In Version: opendaylight-8.3.0-1.el7ost Doc Type: No Doc Update
Doc Text:
undefined
Story Points: ---
Clone Of: Environment:
N/A
Last Closed: 2018-07-19 13:53:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1576414    
Attachments:
Description Flags
data store dumps
none
datastore-latest-odltools-0.1.8
none
karaf logs none

Description Sai Sindhur Malleni 2018-04-14 14:08:35 UTC
Description of problem:
When running a Browbeat+Rally use case in my downstream scale CI which does:
Create a network
Create a sbunet
Create a router
Attach router to subnet and oublic network
Boot VM with floating IP
Ping VM

with concurrency 8 and times set to 50 we see that some VMs remain unpingable even after 300 seconds and we see a lot of the following exceptions in karaf logs

2018-04-13T22:57:05,411 | ERROR | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | AsyncDataTreeChangeListenerBase  | 263 - org.opendaylight.genius.mdsalutil-api - 0.4.0.redhat-5 | Thread terminated due to uncaught exception: org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0
java.lang.IllegalArgumentException: Invalid range: -1, expected: [[0..18446744073709551615]].
        at org.opendaylight.yang.gen.v1.urn.opendaylight.model.match.types.rev131026.match.TunnelBuilder.checkTunnelIdRange(TunnelBuilder.java:76) [391:org.opendaylight.openflowplugin.model.flow-base:0.6.0.redhat-5]
        at org.opendaylight.yang.gen.v1.urn.opendaylight.model.match.types.rev131026.match.TunnelBuilder.setTunnelId(TunnelBuilder.java:81) [391:org.opendaylight.openflowplugin.model.flow-base:0.6.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.matches.MatchTunnelId.populateBuilder(MatchTunnelId.java:40) [263:org.opendaylight.genius.mdsalutil-api:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.matches.MatchTunnelId.populateBuilder(MatchTunnelId.java:18) [263:org.opendaylight.genius.mdsalutil-api:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.matches.MatchInfoHelper.createInnerMatchBuilder(MatchInfoHelper.java:29) [263:org.opendaylight.genius.mdsalutil-api:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.MDSALUtil.buildMatches(MDSALUtil.java:336) [263:org.opendaylight.genius.mdsalutil-api:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.FlowEntity.getFlowBuilder(FlowEntity.java:87) [263:org.opendaylight.genius.mdsalutil-api:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.internal.MDSALManager.writeFlowEntityInternal(MDSALManager.java:180) [264:org.opendaylight.genius.mdsalutil-impl:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.internal.MDSALManager.installFlowInternal(MDSALManager.java:142) [264:org.opendaylight.genius.mdsalutil-impl:0.4.0.redhat-5]
        at org.opendaylight.genius.mdsalutil.internal.MDSALManager.installFlow(MDSALManager.java:609) [264:org.opendaylight.genius.mdsalutil-impl:0.4.0.redhat-5]
        at Proxya2c1fd24_dd19_4d37_b669_2a61174cba99.installFlow(Unknown Source) [?:?]
        at Proxy8f8e72d8_5b27_48d7_a9e4_a78c5736e1ae.installFlow(Unknown Source) [?:?]
        at org.opendaylight.netvirt.natservice.internal.AbstractSnatService.syncFlow(AbstractSnatService.java:329) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.ConntrackBasedSnatService.installTerminatingServiceTblEntry(ConntrackBasedSnatService.java:148) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.ConntrackBasedSnatService.installSnatSpecificEntriesForNaptSwitch(ConntrackBasedSnatService.java:73) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.AbstractSnatService.handleSnat(AbstractSnatService.java:126) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.FlatVlanConntrackBasedSnatService.handleSnat(FlatVlanConntrackBasedSnatService.java:49) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.AbstractSnatService.handleSnatAllSwitch(AbstractSnatService.java:105) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.FlatVlanConntrackBasedSnatService.handleSnatAllSwitch(FlatVlanConntrackBasedSnatService.java:39) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.internal.SnatServiceManagerImpl.notify(SnatServiceManagerImpl.java:57) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.ha.SnatCentralizedSwitchChangeListener.add(SnatCentralizedSwitchChangeListener.java:87) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.netvirt.natservice.ha.SnatCentralizedSwitchChangeListener.add(SnatCentralizedSwitchChangeListener.java:32) [360:org.opendaylight.netvirt.natservice-impl:0.6.0.redhat-5]
        at org.opendaylight.genius.datastoreutils.AsyncDataTreeChangeListenerBase$DataTreeChangeHandler.run(AsyncDataTreeChangeListenerBase.java:167) [263:org.opendaylight.genius.mdsalutil-api:0.4.0.redhat-5]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]


Version-Release number of selected component (if applicable):
OSP 13
opendaylight-8.0.0-5.el7ost.noarch

How reproducible: 100%


Steps to Reproduce:
1. Install OSP13 + ODl Oxygen
2. Run scale use case as mentioned above
3.

Actual results:

Some VMS remain unpingable with karaf exceptions

Expected results:
No exceptions

Additional info:

Comment 1 Ariel Adam 2018-04-22 09:08:13 UTC
Key point is that the VMs are loosing connectivity and not the exception itself.

Comment 2 Vishal Thapar 2018-04-23 10:14:19 UTC
This is either a race condition or failure at an earlier part of code path. Need ODL Datastore dumps to figure out the issue and any potential fix, specifically:

restconf/config/odl-l3vpn:vpn-instance-to-vpn-id/

Comment 4 jamo luhrsen 2018-04-24 03:05:07 UTC
(In reply to Vishal Thapar from comment #2)
> This is either a race condition or failure at an earlier part of code path.
> Need ODL Datastore dumps to figure out the issue and any potential fix,
> specifically:
> 
> restconf/config/odl-l3vpn:vpn-instance-to-vpn-id/

Not sure if it's related, but I see a similar thing (it's an NPE though)
in upstream 3node (aka clustered) CSIT. That is not a 'scale'/'perf' kind
of environment. we do collect the datastore dumps over there.

If we think it's related, I can dig up more info from the upstream jobs.

Comment 5 jamo luhrsen 2018-04-24 03:05:35 UTC
(In reply to jamo luhrsen from comment #4)
> (In reply to Vishal Thapar from comment #2)
> > This is either a race condition or failure at an earlier part of code path.
> > Need ODL Datastore dumps to figure out the issue and any potential fix,
> > specifically:
> > 
> > restconf/config/odl-l3vpn:vpn-instance-to-vpn-id/
> 
> Not sure if it's related, but I see a similar thing (it's an NPE though)
> in upstream 3node (aka clustered) CSIT. That is not a 'scale'/'perf' kind
> of environment. we do collect the datastore dumps over there.
> 
> If we think it's related, I can dig up more info from the upstream jobs.

sorry, meant to paste the Exception:

2018-04-20T08:02:02,469 | ERROR | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | AsyncDataTreeChangeListenerBase  | 258 - org.opendaylight.genius.mdsalutil-api - 0.4.1 | Thread terminated due to uncaught exception: org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0
java.lang.NullPointerException: null
	at org.opendaylight.netvirt.natservice.ha.NatDataUtil.addtoRouterMap(NatDataUtil.java:22) [354:org.opendaylight.netvirt.natservice-impl:0.6.1]
	at org.opendaylight.netvirt.natservice.ha.SnatCentralizedSwitchChangeListener.add(SnatCentralizedSwitchChangeListener.java:86) [354:org.opendaylight.netvirt.natservice-impl:0.6.1]
	at org.opendaylight.netvirt.natservice.ha.SnatCentralizedSwitchChangeListener.add(SnatCentralizedSwitchChangeListener.java:32) [354:org.opendaylight.netvirt.natservice-impl:0.6.1]
	at org.opendaylight.genius.datastoreutils.AsyncDataTreeChangeListenerBase$DataTreeChangeHandler.run(AsyncDataTreeChangeListenerBase.java:170) [258:org.opendaylight.genius.mdsalutil-api:0.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?]
	at java.lang.Thread.run(Thread.java:748) [?:?]

Comment 6 Vishal Thapar 2018-04-24 08:51:58 UTC
@Jamo: No, this one is different. The exception Sai got is because api to get VPNID returned null/-1. A check would avoid NPE but not solve underlying problem that we got a router without vpnid. Since this occurs on a setup with higher concurrency I do suspect race condition.

Comment 8 Sai Sindhur Malleni 2018-05-01 20:20:48 UTC
This happened in a clustered setup.


 How do I get data store dumps?

Comment 9 Sam Hague 2018-05-01 20:28:22 UTC
(In reply to Sai Sindhur Malleni from comment #8)
> This happened in a clustered setup.
> 
> 
>  How do I get data store dumps?

which python - make sure running 2.7, probably works with 3 but not tested yet
pip install odltools
python -m odltools model dump -h

python -m odltools model dump --ip 127.0.0.1 ---port 8181 -user admin --pw admin -p /tmp/models

That will do restconf calls to ip:port using admin/admin and dump them pretty_printed in /tmp/models

Comment 11 Sai Sindhur Malleni 2018-05-03 14:48:59 UTC
Created attachment 1430744 [details]
data store dumps

Comment 12 Sai Sindhur Malleni 2018-05-03 16:28:12 UTC
Created attachment 1430781 [details]
datastore-latest-odltools-0.1.8

Comment 13 Sai Sindhur Malleni 2018-05-03 16:29:15 UTC
Created attachment 1430782 [details]
karaf logs

Comment 14 Vishal Thapar 2018-05-04 07:34:34 UTC
From the logs it is confirmed this is a case of race condition. Notifcation for GW add is coming in before router add.

2018-05-03T13:22:28,967 | WARN  | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.ports.rev150712.ports.attributes.ports.Port_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronPortChangeListener        | 362 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-8 | No router found for router GW port c69fd64f-4c6a-43a6-ae21-c72f11d29a72 for router 29ec1425-72df-4b11-9327-527cf41d58a9


2018-05-03T13:22:29,263 | INFO  | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncClusteredDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronRouterChangeListener      | 356 - org.opendaylight.netvirt.ipv6service-impl - 0.6.0.redhat-8 | Add Router notification handler is invoked Uuid [_value=29ec1425-72df-4b11-9327-527cf41d58a9].

2018-05-03T13:22:29,283 | INFO  | ForkJoinPool-2-worker-34 | VpnInstanceListener              | 370 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.redhat-8 | VPN-ADD: addVpnInstance: VPN Id 106159 generated for VpnInstanceName 29ec1425-72df-4b11-9327-527cf41d58a9

There can be 2 reasons for this:

1. n-odl is sending notifications in wrong order. Need neutron/n-odl logs to confirm/deny.
2. Netvirt code is getting notifications out of order, this can happen depending on if rest calls from n-odl land on leader or follower.

Even if 1 is the cause for this particular issue, netvirt should handle out of order notifications as 2 is very much a possibility with mdsal.


ADDITIONAL INFO REQUIRED:
-------------------------
Neutron/n-odl logs.

Comment 15 Mike Kolesnik 2018-05-06 09:07:55 UTC
Vishal, based on your analysis would you say this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1550699 ?

Comment 16 Vishal Thapar 2018-05-07 03:57:56 UTC
Mike, yes if we manage resource dependencies on ODL side this issue will not happen. 1550699 would be ideal fix for this as it should eliminate such ordering issues. Without it, we can try duct tape fixes which will not solve all problems and will still be open to some race conditions, just with a narrower window.

Comment 18 Vishal Thapar 2018-05-07 13:58:22 UTC
Re-looked the code and logs based on Josh's inputs. Josh mentioned his patch should have addressed out of order issue. But it is not able to due to this part of code:


        Router router = neutronvpnUtils.getNeutronRouter(routerId);
        if (router == null) {
            LOG.warn("No router found for router GW port {} for router {}", routerGwPort.getUuid().getValue(),
                    routerId.getValue());
            return;
        }
        gwMacResolver.sendArpRequestsToExtGateways(router);

        setExternalGwMac(routerGwPort, routerId);
    }

    private void setExternalGwMac(Port routerGwPort, Uuid routerId) {
        // During full-sync networking-odl syncs routers before ports. As such,
        // the MAC of the router's gw port is not available to be set when the
        // router is written. We catch that here.


We've seen the log entry above in karaf logs. This means it is not an issue of notifications out of order, but getting gw port notification before router was created in mdsal. The most likely reason for this is n-odl sending out of order, but can't confirm without neutron logs. There is a very small chance of this happening due to delayed writes depending on which node rest call ended on.

So, this looks even more like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1550699

Comment 20 Waldemar Znoinski 2018-05-09 14:16:34 UTC
possibly another occurence of this issue is desrbied: https://bugzilla.redhat.com/show_bug.cgi?id=1576414 with some logs and captures etc.

Comment 21 Waldemar Znoinski 2018-05-09 15:00:10 UTC
added neutron (networking-odl) logs to https://bugzilla.redhat.com/show_bug.cgi?id=1576414

Comment 22 Waldemar Znoinski 2018-05-14 10:55:44 UTC
*** Bug 1576414 has been marked as a duplicate of this bug. ***

Comment 23 Vishal Thapar 2018-05-31 12:15:01 UTC
Set requires_doc_text=-1 because with the fix this exception will never be seen and we're seeing the out of order events that are the root cause much more gracefully. So there is no user intervention required.

Comment 36 errata-xmlrpc 2018-07-19 13:53:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2215