Bug 1436799 - Cloud breaks down during Scale testing when creating Neutron Resources
Summary: Cloud breaks down during Scale testing when creating Neutron Resources
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-odl
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ga
: 12.0 (Pike)
Assignee: Mike Kolesnik
QA Contact: Itzik Brown
URL:
Whiteboard: scale_lab
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-28 17:30 UTC by Sai Sindhur Malleni
Modified: 2018-10-18 07:20 UTC (History)
8 users (show)

Fixed In Version: python-networking-odl-11.0.1-0.20170831202719.81010b8.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2017-12-13 21:20:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
filtered mysql slow_log showing timestamp and query_time (3.02 MB, text/plain)
2017-03-29 02:43 UTC, Dave Wilson
no flags Details
mysql slow_query.log (12.82 MB, application/x-gzip)
2017-03-29 02:49 UTC, Dave Wilson
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 486606 0 None MERGED Move journal periodic processing to a worker 2020-09-28 15:05:33 UTC
Red Hat Product Errata RHEA-2017:3462 0 normal SHIPPED_LIVE Red Hat OpenStack Platform 12.0 Enhancement Advisory 2018-02-16 01:43:25 UTC

Description Sai Sindhur Malleni 2017-03-28 17:30:32 UTC
Description of problem:
We are trying the test the performance/scale of RHOP 10 with ML2/ODL as the backend. We are seeing that during the execution of the following scenario in Rally, the cloud just breaks down and become unusable (none of the OpenStack commands even work). However same issue is not seen with ML2/OVS with same test. Rally reports that it couldn't create the requested neutron resources due to "NeutronClientException: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html>" 

Scenario:
Do below actions in sequence 500 times

Create network1 and subnet1
Create network2 and subnet2
Create router
Add interfaces on subnet1 and subnet2 to router
List routers 


We see that in neutron-server logs, it is unable to get the state of it agents in https://gist.github.com/smalleni/8894bc7e6b583df0fc666ac51c95ca92 and we also see failure to get tokens as in https://gist.github.com/smalleni/fc8783d8aafc4a3b336bffd3c8849164

We investigated this from a Keystone/RabbitMQ standpoint but on doing tcpdump on port 5672, we verified that RabbitMQ didn't cease communication. There were no errors in httpd/keystone logs that hinted at keystone problems.

We started investigating this from a MySQL/Galera point of view, and verified that during the run of this test, the max_connection_limit of 4096 nor the open files limit was being reached. However around the time of the failure we see several locks in the DB (https://snapshot.raintank.io/dashboard/snapshot/EGjZehSK5ZHG0SgYPcUtlGJHhzO4nInX) and read_rnd_next shoot up to 1.4 million (https://snapshot.raintank.io/dashboard/snapshot/tKmO5SjJfUuz6d8syEkmq6ZEu2sqJsDj).

All the system resources can be viewed at:

http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=opendaylight&var-NodeType=*&var-Node=overcloud-controller-0&var-Interface=interface-br-ex&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1490712023000&to=1490714423000

The total failure occured at around 15:04 UTC if you are looking at the grafana charts above. (Look in OSP Process CPU under "Everything Else" dahboard for karaf CPU utilization, look at MARIADB  tab for Mysql Monitoring data)

Tim Rozet was able to poke around at the environment and thinks the following is causing this:

1. We have neutron-server processes set to number of cores, so on this 56 core machine that's 56 API workers and 56 RPC workers.
2. For each neutron-server process networking-odl spawns a journal thread 
3. Each router create is a pending row and each thread is scanning entire journal table for oldest db row to check if it is complete
4.  https://github.com/openstack/networking-odl/blob/master/networking_odl/db/db.py#L95 could use some optimization

Worth mentioning that this does not happen when we try the above test with say 50 routers or lesser instead of 500 and also doesn't happen when neutron-server workers is set 6 (some low value compared to 56 by default). This problem is also not seen when networking-odl v1 driver is used. The issue cannot also be reproduced with ML2/OVS and hence is specific to ML2/ODL.

Two things to follow up on :
1. We have to check if we really need 56 workers or can limit them to 8 as in Ocata.
2. Investigate journaling of networkin-odl v2 as the issue is not reproducible with v1.

Version-Release number of selected component (if applicable):
RHOP 10, networking-odl v2

How reproducible:
100%

Steps to Reproduce:
1. Deploy RHOP 10 with ODL with defaults
2. Run https://github.com/openstack/browbeat/blob/master/rally/neutron/neutron-create-list-router-cc.yml with times set to 500


Actual results:
The cloud becomes totally useless

Expected results:
The scenario should finish successfully

Additional info:

Karaf user CPU consumption during test (14:54-15:04 UTC)https://snapshot.raintank.io/dashboard/snapshot/Kf1d2Qi9u7h85p5OczoQpWgIt0TB1NtG

Karaf RSS( (14:54-15:04 UTC)) https://snapshot.raintank.io/dashboard/snapshot/UXq1NQMHI34dhkLEsjGUnCTIuKpjvVw5

Comment 1 Dave Wilson 2017-03-29 02:43:19 UTC
Created attachment 1267219 [details]
filtered mysql slow_log showing  timestamp and query_time

Comment 2 Sai Sindhur Malleni 2017-03-29 02:49:27 UTC
Dave, would be great if you can post what you think is the slow query.

Comment 3 Dave Wilson 2017-03-29 02:49:45 UTC
Created attachment 1267220 [details]
mysql slow_query.log

Comment 4 Dave Wilson 2017-03-29 02:50:59 UTC
he issue may linked to the mysql ovs_neutron.opendaylightjournal table. Profiling with the slow_query_log=on reveals query:

"SELECT opendaylightjournal.seqnum AS opendaylightjournal_seqnum,
opendaylightjournal.object_type AS opendaylightjournal_object_type, opendaylightjournal.object_uuid AS opendaylightjournal_object_uuid, opendaylightjournal.operation AS opendaylightjournal_operation, opendaylightjournal.data AS opendaylightjournal_data, opendaylightjournal.state AS opendaylightjournal_state, opendaylightjournal.retry_count AS opendaylightjournal_retry_count, opendaylightjournal.created_at AS opendaylightjournal_created_at, opendaylightjournal.last_retried AS opendaylightjournal_last_retried
FROM opendaylightjournal
WHERE opendaylightjournal.state = 'pending' ORDER BY opendaylightjournal.last_retried ASC
LIMIT 1 FOR UPDATE;"

as rowcount increases query lock time increases to what appears to be unacceptable times ( see attached 'time.out' which is parsed from the slow log showing the above query with timestamp query_time, and lock_time. Possibly an index on "state" may mitigate rows_examined and reduce lock time. Also attached is the slow_query.log

Comment 5 Nir Yechiel 2017-07-20 08:52:01 UTC
@Mike, was this fixed already? Should we target this for RHOSP 12, move to ON_QA and let the scale team test again?

Comment 6 Mike Kolesnik 2017-07-23 11:10:10 UTC
(In reply to Nir Yechiel from comment #5)
> @Mike, was this fixed already? Should we target this for RHOSP 12, move to
> ON_QA and let the scale team test again?

It hasn't been fixed yet upstream, I'll get to it as soon as I can and will update the bug when the fix is available

Comment 7 lpeer 2017-08-06 07:34:25 UTC
Mike - would you mind updating this bug? add a link to the patch etc.

Comment 12 Sai Sindhur Malleni 2017-12-07 14:01:00 UTC
Verified this for OSP12, fixed.

Comment 15 errata-xmlrpc 2017-12-13 21:20:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:3462


Note You need to log in before you can comment on or make changes to this bug.