Bug 1486060 - ODL is up and ports are listening, however it becomes non-functional during longevity testing
Summary: ODL is up and ports are listening, however it becomes non-functional during l...
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: opendaylight
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 12.0 (Pike)
Assignee: Michael Vorburger
QA Contact: Itzik Brown
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-28 21:26 UTC by Sai Sindhur Malleni
Modified: 2017-09-13 12:37 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-13 12:37:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenDaylight Bug 9063 0 None None None 2017-08-29 08:56:55 UTC

Description Sai Sindhur Malleni 2017-08-28 21:26:08 UTC
Description of problem: On running longevity tests in a clustered ODL setup we see that one of the ODL instances seems to be up and running as reported by ps output, systemctl and netstat listening ports, however it doesn't seem to be functional. We could not even ssh into the karaf terminal using ssh -p 8101 karaf.0.16 until we restarted opendaylight. On performing a service restart we were able to get into the karaf shell and ODL seemed to come back up.
Out of the other two instances of ODL, one was killed due to OOM and the other seemed to be running fine. This happens after about 42 hours of running the tests.
Setup:
3 ODLs
3 OpenStack Controllers
3 Compute nodes

Test:
Create 40 neutron resources (rotuers, networks etc) 2 at a time using Rally and delete them over and over again. This is a long running low stress test.





Version-Release number of selected component (if applicable):
OSP 12 Puddle from 2017-8-18.3
ODL RPM from upstream: python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch


How reproducible:
100%

Steps to Reproduce:
1. Deploy OSP12 with ODL
2. Run low stress Rally tests for a long time
3.

Actual results:
ODL become non-functional on controller-0

Expected results:

ODL should be running fine as this just a low stress test over a long time

Additional info:
Entire Karaf Log: http://8.43.86.1:8088/smalleni/karaf-controller-0.log.tar.gz

Comment 1 Sai Sindhur Malleni 2017-08-28 21:52:55 UTC
ODL became non-functional around 10:44 UTC 08/28/2017. This was confirmed as collectd which talks tothe Karaf JMX suddenly stopped reporting values for heap size. Collectd was able to talk to Karaf JMX after the service restart. The break can be clearly observed at: https://snapshot.raintank.io/dashboard/snapshot/nf6OWq7jNSeT6vwjM71jlUSWc31E9LdW

Comment 2 Sai Sindhur Malleni 2017-08-29 11:18:29 UTC
If it helps: The karaf thread count https://snapshot.raintank.io/dashboard/snapshot/EgrJsRB7HJ6tl1pjLlSY4hb6wWvJS7nT

We can see that arund 10:44 UTC the thread count suddenly spikes and falls back after a restart.

Comment 3 Sai Sindhur Malleni 2017-08-29 11:24:54 UTC
ODL RPM used was opendaylight-6.2.0-0.1.20170817rel1931.el7.noarch

Comment 4 Michael Vorburger 2017-09-13 12:37:30 UTC
Closing this as the linked upstream ODL Bug 9063 has been closed (as dupe).

Sai (reporter), please re-open if you disagree.


Note You need to log in before you can comment on or make changes to this bug.