Bug 1567507
Summary: | [Infra] Thread terminated due to an uncaught exception | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> | ||||||||
Component: | opendaylight | Assignee: | Vishal Thapar <vthapar> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Tomas Jamrisko <tjamrisk> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 13.0 (Queens) | CC: | aadam, jluhrsen, jschluet, mkolesni, nyechiel, sclewis, shague, smalleni, tjamrisk, vthapar, wznoinsk | ||||||||
Target Milestone: | z1 | Keywords: | Triaged, ZStream | ||||||||
Target Release: | 13.0 (Queens) | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | Infra | ||||||||||
Fixed In Version: | opendaylight-8.3.0-1.el7ost | Doc Type: | No Doc Update | ||||||||
Doc Text: |
undefined
|
Story Points: | --- | ||||||||
Clone Of: | Environment: |
N/A
|
|||||||||
Last Closed: | 2018-07-19 13:53:05 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1576414 | ||||||||||
Attachments: |
|
Description
Sai Sindhur Malleni
2018-04-14 14:08:35 UTC
Key point is that the VMs are loosing connectivity and not the exception itself. This is either a race condition or failure at an earlier part of code path. Need ODL Datastore dumps to figure out the issue and any potential fix, specifically: restconf/config/odl-l3vpn:vpn-instance-to-vpn-id/ (In reply to Vishal Thapar from comment #2) > This is either a race condition or failure at an earlier part of code path. > Need ODL Datastore dumps to figure out the issue and any potential fix, > specifically: > > restconf/config/odl-l3vpn:vpn-instance-to-vpn-id/ Not sure if it's related, but I see a similar thing (it's an NPE though) in upstream 3node (aka clustered) CSIT. That is not a 'scale'/'perf' kind of environment. we do collect the datastore dumps over there. If we think it's related, I can dig up more info from the upstream jobs. (In reply to jamo luhrsen from comment #4) > (In reply to Vishal Thapar from comment #2) > > This is either a race condition or failure at an earlier part of code path. > > Need ODL Datastore dumps to figure out the issue and any potential fix, > > specifically: > > > > restconf/config/odl-l3vpn:vpn-instance-to-vpn-id/ > > Not sure if it's related, but I see a similar thing (it's an NPE though) > in upstream 3node (aka clustered) CSIT. That is not a 'scale'/'perf' kind > of environment. we do collect the datastore dumps over there. > > If we think it's related, I can dig up more info from the upstream jobs. sorry, meant to paste the Exception: 2018-04-20T08:02:02,469 | ERROR | org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | AsyncDataTreeChangeListenerBase | 258 - org.opendaylight.genius.mdsalutil-api - 0.4.1 | Thread terminated due to uncaught exception: org.opendaylight.yang.gen.v1.urn.opendaylight.netvirt.natservice.rev160111.napt.switches.RouterToNaptSwitch_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 java.lang.NullPointerException: null at org.opendaylight.netvirt.natservice.ha.NatDataUtil.addtoRouterMap(NatDataUtil.java:22) [354:org.opendaylight.netvirt.natservice-impl:0.6.1] at org.opendaylight.netvirt.natservice.ha.SnatCentralizedSwitchChangeListener.add(SnatCentralizedSwitchChangeListener.java:86) [354:org.opendaylight.netvirt.natservice-impl:0.6.1] at org.opendaylight.netvirt.natservice.ha.SnatCentralizedSwitchChangeListener.add(SnatCentralizedSwitchChangeListener.java:32) [354:org.opendaylight.netvirt.natservice-impl:0.6.1] at org.opendaylight.genius.datastoreutils.AsyncDataTreeChangeListenerBase$DataTreeChangeHandler.run(AsyncDataTreeChangeListenerBase.java:170) [258:org.opendaylight.genius.mdsalutil-api:0.4.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:?] at java.lang.Thread.run(Thread.java:748) [?:?] @Jamo: No, this one is different. The exception Sai got is because api to get VPNID returned null/-1. A check would avoid NPE but not solve underlying problem that we got a router without vpnid. Since this occurs on a setup with higher concurrency I do suspect race condition. This happened in a clustered setup. How do I get data store dumps? (In reply to Sai Sindhur Malleni from comment #8) > This happened in a clustered setup. > > > How do I get data store dumps? which python - make sure running 2.7, probably works with 3 but not tested yet pip install odltools python -m odltools model dump -h python -m odltools model dump --ip 127.0.0.1 ---port 8181 -user admin --pw admin -p /tmp/models That will do restconf calls to ip:port using admin/admin and dump them pretty_printed in /tmp/models Created attachment 1430744 [details]
data store dumps
Created attachment 1430781 [details]
datastore-latest-odltools-0.1.8
Created attachment 1430782 [details]
karaf logs
From the logs it is confirmed this is a case of race condition. Notifcation for GW add is coming in before router add. 2018-05-03T13:22:28,967 | WARN | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.ports.rev150712.ports.attributes.ports.Port_AsyncDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronPortChangeListener | 362 - org.opendaylight.netvirt.neutronvpn-impl - 0.6.0.redhat-8 | No router found for router GW port c69fd64f-4c6a-43a6-ae21-c72f11d29a72 for router 29ec1425-72df-4b11-9327-527cf41d58a9 2018-05-03T13:22:29,263 | INFO | org.opendaylight.yang.gen.v1.urn.opendaylight.neutron.l3.rev150712.routers.attributes.routers.Router_AsyncClusteredDataTreeChangeListenerBase-DataTreeChangeHandler-0 | NeutronRouterChangeListener | 356 - org.opendaylight.netvirt.ipv6service-impl - 0.6.0.redhat-8 | Add Router notification handler is invoked Uuid [_value=29ec1425-72df-4b11-9327-527cf41d58a9]. 2018-05-03T13:22:29,283 | INFO | ForkJoinPool-2-worker-34 | VpnInstanceListener | 370 - org.opendaylight.netvirt.vpnmanager-impl - 0.6.0.redhat-8 | VPN-ADD: addVpnInstance: VPN Id 106159 generated for VpnInstanceName 29ec1425-72df-4b11-9327-527cf41d58a9 There can be 2 reasons for this: 1. n-odl is sending notifications in wrong order. Need neutron/n-odl logs to confirm/deny. 2. Netvirt code is getting notifications out of order, this can happen depending on if rest calls from n-odl land on leader or follower. Even if 1 is the cause for this particular issue, netvirt should handle out of order notifications as 2 is very much a possibility with mdsal. ADDITIONAL INFO REQUIRED: ------------------------- Neutron/n-odl logs. Vishal, based on your analysis would you say this is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1550699 ? Mike, yes if we manage resource dependencies on ODL side this issue will not happen. 1550699 would be ideal fix for this as it should eliminate such ordering issues. Without it, we can try duct tape fixes which will not solve all problems and will still be open to some race conditions, just with a narrower window. Re-looked the code and logs based on Josh's inputs. Josh mentioned his patch should have addressed out of order issue. But it is not able to due to this part of code: Router router = neutronvpnUtils.getNeutronRouter(routerId); if (router == null) { LOG.warn("No router found for router GW port {} for router {}", routerGwPort.getUuid().getValue(), routerId.getValue()); return; } gwMacResolver.sendArpRequestsToExtGateways(router); setExternalGwMac(routerGwPort, routerId); } private void setExternalGwMac(Port routerGwPort, Uuid routerId) { // During full-sync networking-odl syncs routers before ports. As such, // the MAC of the router's gw port is not available to be set when the // router is written. We catch that here. We've seen the log entry above in karaf logs. This means it is not an issue of notifications out of order, but getting gw port notification before router was created in mdsal. The most likely reason for this is n-odl sending out of order, but can't confirm without neutron logs. There is a very small chance of this happening due to delayed writes depending on which node rest call ended on. So, this looks even more like a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1550699 possibly another occurence of this issue is desrbied: https://bugzilla.redhat.com/show_bug.cgi?id=1576414 with some logs and captures etc. added neutron (networking-odl) logs to https://bugzilla.redhat.com/show_bug.cgi?id=1576414 *** Bug 1576414 has been marked as a duplicate of this bug. *** Set requires_doc_text=-1 because with the fix this exception will never be seen and we're seeing the out of order events that are the root cause much more gracefully. So there is no user intervention required. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2215 |