Bug 1476213 - Cinder Volume lose MySQL connection during query under heavy load
Cinder Volume lose MySQL connection during query under heavy load
Status: ASSIGNED
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder (Show other bugs)
9.0 (Mitaka)
Unspecified Unspecified
high Severity high
: ---
: 9.0 (Mitaka)
Assigned To: Alan Bishop
Avi Avraham
: Triaged, ZStream
Depends On:
Blocks: 1509062
  Show dependency treegraph
 
Reported: 2017-07-28 06:06 EDT by fiezzi
Modified: 2018-05-24 21:59 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1509062 (view as bug list)
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description fiezzi 2017-07-28 06:06:05 EDT
Description of problem:
Cinder-Volume even during a single creation use 100% CPU. If multiple volumes creation happens at the same time - either orchestrated by API calls or through Heat Stacks - sometimes Cinder-Volume loses the MySQL connection during the queries. The result is that the volume hangs in the creation process forever.

I notice that this problem only happens with a mix of empty volumes and bootable volumes.
It looks like Cinder doesn't have enough resources to handle everything.

I was successfully able to reproduce the problem with an EMC VNX iSCSI backend in OSP9 and OSP10 environment as well.
Also, in my home lab environment with a tiny Ceph setup, I see the same behavior: 100% cpu usage even with a single creation.

Below the Cinder's olso.db tuning
[database]
idle_timeout = 60
min_pool_size = 100
max_pool_size = 100
max_retries = 20
max_overflow = 200
db_retry_interval = 2
db_inc_retry_interval = true
db_max_retry_interval = 2
db_max_retries = 50

Below the Cinder's oslo.messaging tuning
[DEFAULT]
rpc_response_timeout = 600
executor_thread_pool_size = 120
rpc_conn_pool_size = 120

Version-Release number of selected component (if applicable):
RHEL 7.3 latest updates.
Kernel: 3.10.0-514.26.2.el7
RH-OSP9, Cinder package: openstack-cinder-8.1.1-10.el7ost.noarc
RH-OSP10, Cinder package: openstack-cinder-9.1.4-3.el7ost.noarch

I didn't test with OSP11.

How reproducible:
First off, it doesn't always happen. Below a simple script to reproduce the error.
###
source ~/overcloudrc

for _ID in $(seq 0 80)
do
        if [[ "$((${_ID}%10))" == "0" ]]
        then
                openstack volume create --image 00fdb5c4-6dc3-4268-86fb-718af55641f9 --size 40 VOL-${_ID} 2>/dev/null &
        else
                openstack volume create --size 40 VOL-${_ID} 2>/dev/null &
        fi
done
###
You will need to use another image ID, maybe not a CirrOS or even the default RHEL 7.3 because is not representative of a real production environment.
I'm using customer images of about 9 to 10GB size.


Actual results:
Sometimes I got MySQL lost connection and the volume hangs forever in creation status.
When the "MySQL lost connection" happens 

Expected results:
Cinder shouldn't lose the MySQL connection.
Cinder shouldn't use 100% CPU for a single volume creation. 

Additional info:
################################
# Cinder LOG

2017-07-28 02:46:29.332 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'FAILURE' from state 'RUNNING'
1 predecessors (most recent first):
  Flow 'volume_create_manager'
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager Traceback (most recent call last):
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/executor.py", line 82, in _execute_task
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     result = task.execute(**arguments)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 210, in execute
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return objects.Volume.get_by_id(context, volume_id)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 181, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     result = fn(cls, context, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/objects/base.py", line 166, in get_by_id
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     orm_obj = db.get_by_id(context, model, id, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/api.py", line 1127, in get_by_id
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return IMPL.get_by_id(context, model, id, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return f(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 4466, in get_by_id
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return _GET_METHODS[model](context, id, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return f(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 1480, in volume_get
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return _volume_get(context, volume_id)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return f(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 1422, in _volume_get
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     result = result.filter_by(id=volume_id).first()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2634, in first
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     ret = list(self[0:1])
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2457, in __getitem__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return list(res)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2736, in __iter__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return self._execute_and_instances(context)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2749, in _execute_and_instances
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     close_with_result=True)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2740, in _connection_from_session
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     **kw)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 905, in connection
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     execution_options=execution_options)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 912, in _connection_for_bind
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     conn = engine.contextual_connect(**kw)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2039, in contextual_connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self._wrap_pool_connect(self.pool.connect, None),
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2078, in _wrap_pool_connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     e, dialect, self)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1401, in _handle_dbapi_exception_noconnection
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     util.raise_from_cause(newraise, exc_info)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     reraise(type(exception), exception, tb=exc_tb)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2074, in _wrap_pool_connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return fn()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 376, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return _ConnectionFairy._checkout(self)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 713, in _checkout
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     fairy = _ConnectionRecord.checkout(pool)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 485, in checkout
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     rec.checkin()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     compat.reraise(exc_type, exc_value, exc_tb)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 482, in checkout
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     dbapi_connection = rec.get_connection()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 594, in get_connection
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self.connection = self.__connect()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 607, in __connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     connection = self.__pool._invoke_creator(self)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 97, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return dialect.connect(*cargs, **cparams)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 385, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return self.dbapi.connect(*cargs, **cparams)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return Connection(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self.connect()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 851, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self._request_authentication()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1034, in _request_authentication
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     auth_packet = self._read_packet()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     packet_header = self._read_bytes(4)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 922, in _read_bytes
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     2013, "Lost connection to MySQL server during query")
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager 
2017-07-28 02:46:29.337 2385 DEBUG cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /usr/lib/python2.7/site-packages/taskflow/listeners/logging.py:189
2017-07-28 02:46:29.339 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
2017-07-28 02:46:29.341 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Flow 'volume_create_manager' (a1da4f5e-43bc-494d-a876-9adc5ea6e109) transitioned into state 'REVERTED' from state 'RUNNING'
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Exception during message handling: 'Failure' object has no attribute '__getitem__'
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     incoming.message))
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _dispatch
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     result = func(ctxt, **new_args)
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 654, in create_volume
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     self._update_allocated_capacity(vol_ref)
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 2727, in _update_allocated_capacity
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     pool = vol_utils.extract_host(vol['host'], 'pool')
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher TypeError: 'Failure' object has no attribute '__getitem__'
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher
Comment 2 fiezzi 2017-07-28 06:19:29 EDT
Sorry, I forget to mention the MySQL configs

MariaDB [(none)]> show variables like '%timeout%';
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 28800    |
| deadlock_timeout_long      | 50000000 |
| deadlock_timeout_short     | 10000    |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 600      |
| net_write_timeout          | 600      |
| slave_net_timeout          | 3600     |
| thread_pool_idle_timeout   | 60       |
| wait_timeout               | 28800    |
+----------------------------+----------+

The default timeouts I've changed are the following:
connect_timeout default to 10 seconds
net_read_timeout default to 30 seconds
net_write_timeout default to 60 seconds
Comment 6 Gorka Eguileor 2017-09-12 16:00:29 EDT
Reported MySQL connection drop is most likely related to high CPU usage on the node or cpu bound operations going on in Cinder.

I see some volume creation issues related to the scheduler not finding (on July dates) any host to place the volumes because the backends are down, and according to the logs the volume service was brought down on 23rd of June and never brought up again.

To analyze this issue we need full SOS reports where the reported stack trace can be found.  The log entries from the BZ description are from the 28th, while the latest logs are from the 13th of July.

Note You need to log in before you can comment on or make changes to this bug.