Bug 1487788 - High CPU usage of Mysqld process with opendaylight journaling
Summary: High CPU usage of Mysqld process with opendaylight journaling
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-networking-odl
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: beta
: 13.0 (Queens)
Assignee: Mike Kolesnik
QA Contact: Sai Sindhur Malleni
URL:
Whiteboard: scale_lab
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-09-01 22:24 UTC by Sai Sindhur Malleni
Modified: 2018-10-18 07:19 UTC (History)
4 users (show)

Fixed In Version: python-networking-odl-11.0.0-3.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
N/A
Last Closed: 2018-03-28 19:08:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1717018 0 None None None 2017-10-18 07:23:07 UTC
OpenStack gerrit 504043 0 None MERGED Delete completed rows immediately when retention=0 2020-07-19 06:21:18 UTC
OpenStack gerrit 516030 0 None MERGED Delete completed rows immediately when retention=0 2020-07-19 06:21:18 UTC
Red Hat Product Errata RHBA-2018:0617 0 None None None 2018-03-28 19:08:59 UTC

Description Sai Sindhur Malleni 2017-09-01 22:24:28 UTC
Description of problem:

With opendaylight journaling, we are observing a very high CPU usage pattern for the mysqld process when running Browbeat+Rally performance tests.


The test environment is as follows:
3 OSP Controllers
3 Computes
3 ODLs clustered on OpenStack Controllers

We ran the Browbeat+Rally neutron test suite with concurrencies of 8,16 and 32 and times set to 500. What essentially happens is neutron resources such as ports, networks, routers etc are created 500 times with 8,16 or 32 resources being created in parallel based on concurrency. After creation the resources are deleted and the cycle is repeated over and over again.


12 Neutron worker processes were present in the environment


mysqld CPU usage during the entire test duration: https://snapshot.raintank.io/dashboard/snapshot/KLVyeojHtfX46pYI1YGNdIrhyeXE57FX

We can see that it reaches a peak of around 43 cores on a 56 core machine and slowly goes back to normal when the test suite runs are finished

MySQL slow query log: http://8.43.86.1:8088/smalleni/overcloud-controller-0-slow-profile.log


In some cases we see that even the Master changes in Galera on the controllers:


Mysqld on controller-0(master) dies and mysqld on controller-1 becomes the master midway during the test as can be seen by these graphs:

mysqld threads go from ~2k to 0 on controller-0:  https://snapshot.raintank.io/dashboard/snapshot/9aO2m4pg07vYv6MnPVnGWJbsIU7VB1gQ?orgId=2

mysqld threads go from 29 to ~2k on controller-1: https://snapshot.raintank.io/dashboard/snapshot/UJfUVcdcYIulb7vp8017fyZAYP9PRzkC?orgId=2
After controller-1 becomes the master, similar high CPU usage is seen on it as well: https://snapshot.raintank.io/dashboard/snapshot/Qc6m1t05Hr3B6QSm56v7WB1Dd1hZ6v3Q


Even after the master changes fro mcontroller-0 to controller-1, high CPU usage is observed on controller-1.

To compare, we ran the same test suite with ML2/OVS:
Low CPU usage by mysqld was observed with a maximum of 112% : https://snapshot.raintank.io/dashboard/snapsho /9fjDBAbNZqt80fUi6ElAFJXy2R1ynoHm?orgId=2

Version-Release number of selected component (if applicable):

OSP12+ODL Carbon
python-networking-odl-11.0.0-0.20170806093629.2e78dca.el7ost.noarch
How reproducible:


Steps to Reproduce:
1. Deploy overcloud with ODL
2. Run Browbeat+Rally stress tests
3. Monitor system utilization through Browbeat dashboards

Actual results:

High CPU usage of mysqld process with opendaylight journaling

Expected results:
CPU usage should be more or less similar to reference implementation with Ml2/OVS

Additional info:

Comment 1 Sai Sindhur Malleni 2017-09-01 22:25:15 UTC
Mysqld CPU usage with ML2/OVS: https://snapshot.raintank.io/dashboard/snapshot/9fjDBAbNZqt80fUi6ElAFJXy2R1ynoHm?orgId=2

Comment 2 Sai Sindhur Malleni 2017-09-08 15:07:52 UTC
MySQL logs during another test run where the primary changes (it can be seen that mysqld is shutting down)
https://gist.githubusercontent.com/smalleni/5fb89d7826bcbcb5df0d204fe4bf49ae/raw/b0b1c1212dea5adc4263c34370e6cbc5c2b7a2b4/gistfile1.txt

Comment 3 Tim Rozet 2017-09-08 15:26:34 UTC
Bottom line here is mysql CPU is spiking 360 x ML2/OVS average CPU for the same  test of creating 500 routers, 8 at a time.  Also results in a sql crash.  This test seems more like a stress test than normal operation.  Can we get a datapoint for the CPU usage during normal operation?

Comment 4 Tim Rozet 2017-09-08 15:27:12 UTC
Bottom line here is mysql CPU is spiking 360 x ML2/OVS average CPU for the same  test of creating 500 routers, 8 at a time.  Also results in a sql crash.  This test seems more like a stress test than normal operation.  Can we get a datapoint for the CPU usage during normal operation?

Comment 5 Sai Sindhur Malleni 2017-09-11 03:13:14 UTC
Tim.

So, I did some longevity testing over the weekend which pretty much simulates normal cloud operation. I create 40 neutron resources 2 at a time and then delete all 40. The same test was run over an over again, for more than 48 hours. So at no given point of time were there more than 40 neutron resources present nor were resources being created to "stress" the cloud (only 2 resource being created concurrently). After about 24 hours of operation, we see the CPU usage hovering at around 3000% consistently. Here is a link to Grafana https://snapshot.raintank.io/dashboard/snapshot/kcjv6kl7tLlD2cTGuT5RRjy45Ui7ho5O

 It seems to be related to the number of rows i nthe opendaylightjournaltable, since rows keep piling up. The number of rows after about 48 hours of operation is as follows:

MariaDB [(none)]> use ovs_neutron;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [ovs_neutron]> select count(*) from opendaylightjournal;
+----------+
| count(*) |
+----------+
|   182463 |
+----------+


Looks like all the entries are pending
MariaDB [ovs_neutron]> select count(*) from opendaylightjournal where state='pending'
    -> ;
+----------+
| count(*) |
+----------+
|   182944 |

There are several errors in the karaf logs as well as neutron server logs and I am going to leave the system in the current state for some debugging. However, given that this high CPU usage is a possibility even during normal operation (most likely due to the large number of rows to scan), I feel this bug is high priority to fix. Happy to provide more information.

Comment 19 Itzik Brown 2018-03-22 09:14:55 UTC
Sai,
If I understand correctly - the following scenario should verify:

Over 24 hours do the following
1. Create 20 routers and add interfaces to an internal network
2. Delete the interfaces and the routers 

Get the CPU data and check that Mysqld hasn't consumed too much CPU.

Comment 20 Sai Sindhur Malleni 2018-03-22 12:53:28 UTC
Yeh, what I used was rally with times set to 40 and concurrency set to 2. The scenario was to create routers. Keep running the same scenario mentioned above for a long time (wrap the rally command for above scenario in a bash script for example) and observe the mysqld usage. At no point should it peak. It is also worth inspecting the DB after the tests.

Comment 23 errata-xmlrpc 2018-03-28 19:08:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0617

Comment 24 Sai Sindhur Malleni 2018-04-13 14:12:28 UTC
I have rerun scale test son OSP13 + ODL OXygen and can confirm that I am no longer seeing this. Mysqld CPU usage never goes above 1 core.

https://snapshot.raintank.io/dashboard/snapshot/orgKEjMKRFqM5qYE9TW5YEVm5b5byc30


Note You need to log in before you can comment on or make changes to this bug.