Description of problem: With opendaylight journaling, we are observing a very high CPU usage pattern for the mysqld process when running Browbeat+Rally performance tests. The test environment is as follows: 3 OSP Controllers 3 Computes 3 ODLs clustered on OpenStack Controllers We ran the Browbeat+Rally neutron test suite with concurrencies of 8,16 and 32 and times set to 500. What essentially happens is neutron resources such as ports, networks, routers etc are created 500 times with 8,16 or 32 resources being created in parallel based on concurrency. After creation the resources are deleted and the cycle is repeated over and over again. 12 Neutron worker processes were present in the environment mysqld CPU usage during the entire test duration: https://snapshot.raintank.io/dashboard/snapshot/KLVyeojHtfX46pYI1YGNdIrhyeXE57FX We can see that it reaches a peak of around 43 cores on a 56 core machine and slowly goes back to normal when the test suite runs are finished MySQL slow query log: http://8.43.86.1:8088/smalleni/overcloud-controller-0-slow-profile.log In some cases we see that even the Master changes in Galera on the controllers: Mysqld on controller-0(master) dies and mysqld on controller-1 becomes the master midway during the test as can be seen by these graphs: mysqld threads go from ~2k to 0 on controller-0: https://snapshot.raintank.io/dashboard/snapshot/9aO2m4pg07vYv6MnPVnGWJbsIU7VB1gQ?orgId=2 mysqld threads go from 29 to ~2k on controller-1: https://snapshot.raintank.io/dashboard/snapshot/UJfUVcdcYIulb7vp8017fyZAYP9PRzkC?orgId=2 After controller-1 becomes the master, similar high CPU usage is seen on it as well: https://snapshot.raintank.io/dashboard/snapshot/Qc6m1t05Hr3B6QSm56v7WB1Dd1hZ6v3Q Even after the master changes fro mcontroller-0 to controller-1, high CPU usage is observed on controller-1. To compare, we ran the same test suite with ML2/OVS: Low CPU usage by mysqld was observed with a maximum of 112% : https://snapshot.raintank.io/dashboard/snapsho /9fjDBAbNZqt80fUi6ElAFJXy2R1ynoHm?orgId=2 Version-Release number of selected component (if applicable): OSP12+ODL Carbon python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch How reproducible: Steps to Reproduce: 1. Deploy overcloud with ODL 2. Run Browbeat+Rally stress tests 3. Monitor system utilization through Browbeat dashboards Actual results: High CPU usage of mysqld process with opendaylight journaling Expected results: CPU usage should be more or less similar to reference implementation with Ml2/OVS Additional info:
Mysqld CPU usage with ML2/OVS: https://snapshot.raintank.io/dashboard/snapshot/9fjDBAbNZqt80fUi6ElAFJXy2R1ynoHm?orgId=2
MySQL logs during another test run where the primary changes (it can be seen that mysqld is shutting down) https://gist.githubusercontent.com/smalleni/5fb89d7826bcbcb5df0d204fe4bf49ae/raw/b0b1c1212dea5adc4263c34370e6cbc5c2b7a2b4/gistfile1.txt
Bottom line here is mysql CPU is spiking 360 x ML2/OVS average CPU for the same test of creating 500 routers, 8 at a time. Also results in a sql crash. This test seems more like a stress test than normal operation. Can we get a datapoint for the CPU usage during normal operation?
Tim. So, I did some longevity testing over the weekend which pretty much simulates normal cloud operation. I create 40 neutron resources 2 at a time and then delete all 40. The same test was run over an over again, for more than 48 hours. So at no given point of time were there more than 40 neutron resources present nor were resources being created to "stress" the cloud (only 2 resource being created concurrently). After about 24 hours of operation, we see the CPU usage hovering at around 3000% consistently. Here is a link to Grafana https://snapshot.raintank.io/dashboard/snapshot/kcjv6kl7tLlD2cTGuT5RRjy45Ui7ho5O It seems to be related to the number of rows i nthe opendaylightjournaltable, since rows keep piling up. The number of rows after about 48 hours of operation is as follows: MariaDB [(none)]> use ovs_neutron; Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed MariaDB [ovs_neutron]> select count(*) from opendaylightjournal; +----------+ | count(*) | +----------+ | 182463 | +----------+ Looks like all the entries are pending MariaDB [ovs_neutron]> select count(*) from opendaylightjournal where state='pending' -> ; +----------+ | count(*) | +----------+ | 182944 | There are several errors in the karaf logs as well as neutron server logs and I am going to leave the system in the current state for some debugging. However, given that this high CPU usage is a possibility even during normal operation (most likely due to the large number of rows to scan), I feel this bug is high priority to fix. Happy to provide more information.
Sai, If I understand correctly - the following scenario should verify: Over 24 hours do the following 1. Create 20 routers and add interfaces to an internal network 2. Delete the interfaces and the routers Get the CPU data and check that Mysqld hasn't consumed too much CPU.
Yeh, what I used was rally with times set to 40 and concurrency set to 2. The scenario was to create routers. Keep running the same scenario mentioned above for a long time (wrap the rally command for above scenario in a bash script for example) and observe the mysqld usage. At no point should it peak. It is also worth inspecting the DB after the tests.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0617
I have rerun scale test son OSP13 + ODL OXygen and can confirm that I am no longer seeing this. Mysqld CPU usage never goes above 1 core. https://snapshot.raintank.io/dashboard/snapshot/orgKEjMKRFqM5qYE9TW5YEVm5b5byc30