Hide Forgot
Description of problem: We are trying the test the performance/scale of RHOP 10 with ML2/ODL as the backend. We are seeing that during the execution of the following scenario in Rally, the cloud just breaks down and become unusable (none of the OpenStack commands even work). However same issue is not seen with ML2/OVS with same test. Rally reports that it couldn't create the requested neutron resources due to "NeutronClientException: <html><body><h1>504 Gateway Time-out</h1> The server didn't respond in time. </body></html>" Scenario: Do below actions in sequence 500 times Create network1 and subnet1 Create network2 and subnet2 Create router Add interfaces on subnet1 and subnet2 to router List routers We see that in neutron-server logs, it is unable to get the state of it agents in https://gist.github.com/smalleni/8894bc7e6b583df0fc666ac51c95ca92 and we also see failure to get tokens as in https://gist.github.com/smalleni/fc8783d8aafc4a3b336bffd3c8849164 We investigated this from a Keystone/RabbitMQ standpoint but on doing tcpdump on port 5672, we verified that RabbitMQ didn't cease communication. There were no errors in httpd/keystone logs that hinted at keystone problems. We started investigating this from a MySQL/Galera point of view, and verified that during the run of this test, the max_connection_limit of 4096 nor the open files limit was being reached. However around the time of the failure we see several locks in the DB (https://snapshot.raintank.io/dashboard/snapshot/EGjZehSK5ZHG0SgYPcUtlGJHhzO4nInX) and read_rnd_next shoot up to 1.4 million (https://snapshot.raintank.io/dashboard/snapshot/tKmO5SjJfUuz6d8syEkmq6ZEu2sqJsDj). All the system resources can be viewed at: http://norton.perf.lab.eng.rdu.redhat.com:3000/dashboard/db/openstack-general-system-performance?var-Cloud=opendaylight&var-NodeType=*&var-Node=overcloud-controller-0&var-Interface=interface-br-ex&var-Disk=disk-sda&var-cpus0=All&var-cpus00=All&from=1490712023000&to=1490714423000 The total failure occured at around 15:04 UTC if you are looking at the grafana charts above. (Look in OSP Process CPU under "Everything Else" dahboard for karaf CPU utilization, look at MARIADB tab for Mysql Monitoring data) Tim Rozet was able to poke around at the environment and thinks the following is causing this: 1. We have neutron-server processes set to number of cores, so on this 56 core machine that's 56 API workers and 56 RPC workers. 2. For each neutron-server process networking-odl spawns a journal thread 3. Each router create is a pending row and each thread is scanning entire journal table for oldest db row to check if it is complete 4. https://github.com/openstack/networking-odl/blob/master/networking_odl/db/db.py#L95 could use some optimization Worth mentioning that this does not happen when we try the above test with say 50 routers or lesser instead of 500 and also doesn't happen when neutron-server workers is set 6 (some low value compared to 56 by default). This problem is also not seen when networking-odl v1 driver is used. The issue cannot also be reproduced with ML2/OVS and hence is specific to ML2/ODL. Two things to follow up on : 1. We have to check if we really need 56 workers or can limit them to 8 as in Ocata. 2. Investigate journaling of networkin-odl v2 as the issue is not reproducible with v1. Version-Release number of selected component (if applicable): RHOP 10, networking-odl v2 How reproducible: 100% Steps to Reproduce: 1. Deploy RHOP 10 with ODL with defaults 2. Run https://github.com/openstack/browbeat/blob/master/rally/neutron/neutron-create-list-router-cc.yml with times set to 500 Actual results: The cloud becomes totally useless Expected results: The scenario should finish successfully Additional info: Karaf user CPU consumption during test (14:54-15:04 UTC)https://snapshot.raintank.io/dashboard/snapshot/Kf1d2Qi9u7h85p5OczoQpWgIt0TB1NtG Karaf RSS( (14:54-15:04 UTC)) https://snapshot.raintank.io/dashboard/snapshot/UXq1NQMHI34dhkLEsjGUnCTIuKpjvVw5
Created attachment 1267219 [details] filtered mysql slow_log showing timestamp and query_time
Dave, would be great if you can post what you think is the slow query.
Created attachment 1267220 [details] mysql slow_query.log
he issue may linked to the mysql ovs_neutron.opendaylightjournal table. Profiling with the slow_query_log=on reveals query: "SELECT opendaylightjournal.seqnum AS opendaylightjournal_seqnum, opendaylightjournal.object_type AS opendaylightjournal_object_type, opendaylightjournal.object_uuid AS opendaylightjournal_object_uuid, opendaylightjournal.operation AS opendaylightjournal_operation, opendaylightjournal.data AS opendaylightjournal_data, opendaylightjournal.state AS opendaylightjournal_state, opendaylightjournal.retry_count AS opendaylightjournal_retry_count, opendaylightjournal.created_at AS opendaylightjournal_created_at, opendaylightjournal.last_retried AS opendaylightjournal_last_retried FROM opendaylightjournal WHERE opendaylightjournal.state = 'pending' ORDER BY opendaylightjournal.last_retried ASC LIMIT 1 FOR UPDATE;" as rowcount increases query lock time increases to what appears to be unacceptable times ( see attached 'time.out' which is parsed from the slow log showing the above query with timestamp query_time, and lock_time. Possibly an index on "state" may mitigate rows_examined and reduce lock time. Also attached is the slow_query.log
@Mike, was this fixed already? Should we target this for RHOSP 12, move to ON_QA and let the scale team test again?
(In reply to Nir Yechiel from comment #5) > @Mike, was this fixed already? Should we target this for RHOSP 12, move to > ON_QA and let the scale team test again? It hasn't been fixed yet upstream, I'll get to it as soon as I can and will update the bug when the fix is available
Mike - would you mind updating this bug? add a link to the patch etc.
Verified this for OSP12, fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462