Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1486060

Summary: ODL is up and ports are listening, however it becomes non-functional during longevity testing
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: opendaylightAssignee: Michael Vorburger <vorburger>
Status: CLOSED UPSTREAM QA Contact: Itzik Brown <itbrown>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: mkolesni, nyechiel, trozet, vorburger
Target Milestone: betaKeywords: Triaged
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-13 12:37:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sai Sindhur Malleni 2017-08-28 21:26:08 UTC
Description of problem: On running longevity tests in a clustered ODL setup we see that one of the ODL instances seems to be up and running as reported by ps output, systemctl and netstat listening ports, however it doesn't seem to be functional. We could not even ssh into the karaf terminal using ssh -p 8101 karaf.0.16 until we restarted opendaylight. On performing a service restart we were able to get into the karaf shell and ODL seemed to come back up.
Out of the other two instances of ODL, one was killed due to OOM and the other seemed to be running fine. This happens after about 42 hours of running the tests.
Setup:
3 ODLs
3 OpenStack Controllers
3 Compute nodes

Test:
Create 40 neutron resources (rotuers, networks etc) 2 at a time using Rally and delete them over and over again. This is a long running low stress test.





Version-Release number of selected component (if applicable):
OSP 12 Puddle from 2017-8-18.3
ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP12 with ODL
2. Run low stress Rally tests for a long time
3.

Actual results:
ODL become non-functional on controller-0

Expected results:

ODL should be running fine as this just a low stress test over a long time

Additional info:
Entire Karaf Log: http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz

Comment 1 Sai Sindhur Malleni 2017-08-28 21:52:55 UTC
ODL became non-functional around 10:44 UTC 08/28/2017. This was confirmed as collectd which talks tothe Karaf JMX suddenly stopped reporting values for heap size. Collectd was able to talk to Karaf JMX after the service restart. The break can be clearly observed at: https://snapshot.raintank.io/dashboard/snapshot/nf6OWq7jNSeT6vwjM71jlUSWc31E9LdW

Comment 2 Sai Sindhur Malleni 2017-08-29 11:18:29 UTC
If it helps: The karaf thread count https://snapshot.raintank.io/dashboard/snapshot/EgrJsRB7HJ6tl1pjLlSY4hb6wWvJS7nT

We can see that arund 10:44 UTC the thread count suddenly spikes and falls back after a restart.

Comment 3 Sai Sindhur Malleni 2017-08-29 11:24:54 UTC
ODL RPM used was opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch

Comment 4 Michael Vorburger 2017-09-13 12:37:30 UTC
Closing this as the linked upstream ODL Bug 9063 has been closed (as dupe).

Sai (reporter), please re-open if you disagree.