1476213 – [OSP10] Cinder Volume lose MySQL connection during query under heavy load

Bug 1476213 - [OSP10] Cinder Volume lose MySQL connection during query under heavy load

Summary: [OSP10] Cinder Volume lose MySQL connection during query under heavy load

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-cinder
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	z6
Target Release:	10.0 (Newton)
Assignee:	Sofia Enriquez
QA Contact:	Tzach Shefi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1509062
TreeView+	depends on / blocked

Reported:	2017-07-28 10:06 UTC by Federico Iezzi
Modified:	2022-07-09 10:25 UTC (History)
CC List:	10 users (show)
Fixed In Version:	openstack-cinder-9.1.4-46.el7ost
Doc Type:	Bug Fix
Doc Text:	Previously, image operations in Cinder failed due to file I/O issues. With this update, you can create volumes from an image in Cinder without file I/O issues.
Clone Of:
Clones:	1509062 (view as bug list)
Environment:
Last Closed:	2019-04-30 16:58:37 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-11306	0	None	None	None	2021-12-10 15:18:38 UTC
Red Hat Product Errata	RHSA-2019:0917	0	None	None	None	2019-04-30 16:58:44 UTC

Description Federico Iezzi 2017-07-28 10:06:05 UTC

Description of problem:
Cinder-Volume even during a single creation use 100% CPU. If multiple volumes creation happens at the same time - either orchestrated by API calls or through Heat Stacks - sometimes Cinder-Volume loses the MySQL connection during the queries. The result is that the volume hangs in the creation process forever.

I notice that this problem only happens with a mix of empty volumes and bootable volumes.
It looks like Cinder doesn't have enough resources to handle everything.

I was successfully able to reproduce the problem with an EMC VNX iSCSI backend in OSP9 and OSP10 environment as well.
Also, in my home lab environment with a tiny Ceph setup, I see the same behavior: 100% cpu usage even with a single creation.

Below the Cinder's olso.db tuning
[database]
idle_timeout = 60
min_pool_size = 100
max_pool_size = 100
max_retries = 20
max_overflow = 200
db_retry_interval = 2
db_inc_retry_interval = true
db_max_retry_interval = 2
db_max_retries = 50

Below the Cinder's oslo.messaging tuning
[DEFAULT]
rpc_response_timeout = 600
executor_thread_pool_size = 120
rpc_conn_pool_size = 120

Version-Release number of selected component (if applicable):
RHEL 7.3 latest updates.
Kernel: 3.10.0-514.26.2.el7
RH-OSP9, Cinder package: openstack-cinder-8.1.1-10.el7ost.noarc
RH-OSP10, Cinder package: openstack-cinder-9.1.4-3.el7ost.noarch

I didn't test with OSP11.

How reproducible:
First off, it doesn't always happen. Below a simple script to reproduce the error.
###
source ~/overcloudrc

for _ID in $(seq 0 80)
do
        if [[ "$((${_ID}%10))" == "0" ]]
        then
                openstack volume create --image 00fdb5c4-6dc3-4268-86fb-718af55641f9 --size 40 VOL-${_ID} 2>/dev/null &
        else
                openstack volume create --size 40 VOL-${_ID} 2>/dev/null &
        fi
done
###
You will need to use another image ID, maybe not a CirrOS or even the default RHEL 7.3 because is not representative of a real production environment.
I'm using customer images of about 9 to 10GB size.


Actual results:
Sometimes I got MySQL lost connection and the volume hangs forever in creation status.
When the "MySQL lost connection" happens 

Expected results:
Cinder shouldn't lose the MySQL connection.
Cinder shouldn't use 100% CPU for a single volume creation. 

Additional info:
################################
# Cinder LOG

2017-07-28 02:46:29.332 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'FAILURE' from state 'RUNNING'
1 predecessors (most recent first):
  Flow 'volume_create_manager'
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager Traceback (most recent call last):
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/executor.py", line 82, in _execute_task
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     result = task.execute(**arguments)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 210, in execute
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return objects.Volume.get_by_id(context, volume_id)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 181, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     result = fn(cls, context, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/objects/base.py", line 166, in get_by_id
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     orm_obj = db.get_by_id(context, model, id, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/api.py", line 1127, in get_by_id
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return IMPL.get_by_id(context, model, id, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return f(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 4466, in get_by_id
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return _GET_METHODS[model](context, id, *args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return f(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 1480, in volume_get
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return _volume_get(context, volume_id)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return f(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 1422, in _volume_get
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     result = result.filter_by(id=volume_id).first()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2634, in first
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     ret = list(self[0:1])
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2457, in __getitem__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return list(res)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2736, in __iter__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return self._execute_and_instances(context)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2749, in _execute_and_instances
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     close_with_result=True)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2740, in _connection_from_session
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     **kw)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 905, in connection
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     execution_options=execution_options)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 912, in _connection_for_bind
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     conn = engine.contextual_connect(**kw)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2039, in contextual_connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self._wrap_pool_connect(self.pool.connect, None),
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2078, in _wrap_pool_connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     e, dialect, self)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1401, in _handle_dbapi_exception_noconnection
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     util.raise_from_cause(newraise, exc_info)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     reraise(type(exception), exception, tb=exc_tb)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2074, in _wrap_pool_connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return fn()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 376, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return _ConnectionFairy._checkout(self)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 713, in _checkout
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     fairy = _ConnectionRecord.checkout(pool)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 485, in checkout
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     rec.checkin()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     compat.reraise(exc_type, exc_value, exc_tb)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 482, in checkout
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     dbapi_connection = rec.get_connection()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 594, in get_connection
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self.connection = self.__connect()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 607, in __connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     connection = self.__pool._invoke_creator(self)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 97, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return dialect.connect(*cargs, **cparams)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 385, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return self.dbapi.connect(*cargs, **cparams)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     return Connection(*args, **kwargs)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self.connect()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 851, in connect
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     self._request_authentication()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1034, in _request_authentication
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     auth_packet = self._read_packet()
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     packet_header = self._read_bytes(4)
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 922, in _read_bytes
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager     2013, "Lost connection to MySQL server during query")
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager 
2017-07-28 02:46:29.337 2385 DEBUG cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /usr/lib/python2.7/site-packages/taskflow/listeners/logging.py:189
2017-07-28 02:46:29.339 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None'
2017-07-28 02:46:29.341 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Flow 'volume_create_manager' (a1da4f5e-43bc-494d-a876-9adc5ea6e109) transitioned into state 'REVERTED' from state 'RUNNING'
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Exception during message handling: 'Failure' object has no attribute '__getitem__'
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last):
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     incoming.message))
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _dispatch
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     return self._do_dispatch(endpoint, method, ctxt, args)
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     result = func(ctxt, **new_args)
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 654, in create_volume
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     self._update_allocated_capacity(vol_ref)
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 2727, in _update_allocated_capacity
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher     pool = vol_utils.extract_host(vol['host'], 'pool')
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher TypeError: 'Failure' object has no attribute '__getitem__'
2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher

Comment 2 Federico Iezzi 2017-07-28 10:19:29 UTC

Sorry, I forget to mention the MySQL configs

MariaDB [(none)]> show variables like '%timeout%';
+----------------------------+----------+
| Variable_name              | Value    |
+----------------------------+----------+
| connect_timeout            | 28800    |
| deadlock_timeout_long      | 50000000 |
| deadlock_timeout_short     | 10000    |
| delayed_insert_timeout     | 300      |
| innodb_lock_wait_timeout   | 50       |
| innodb_rollback_on_timeout | OFF      |
| interactive_timeout        | 28800    |
| lock_wait_timeout          | 31536000 |
| net_read_timeout           | 600      |
| net_write_timeout          | 600      |
| slave_net_timeout          | 3600     |
| thread_pool_idle_timeout   | 60       |
| wait_timeout               | 28800    |
+----------------------------+----------+

The default timeouts I've changed are the following:
connect_timeout default to 10 seconds
net_read_timeout default to 30 seconds
net_write_timeout default to 60 seconds

Comment 6 Gorka Eguileor 2017-09-12 20:00:29 UTC

Reported MySQL connection drop is most likely related to high CPU usage on the node or cpu bound operations going on in Cinder.

I see some volume creation issues related to the scheduler not finding (on July dates) any host to place the volumes because the backends are down, and according to the logs the volume service was brought down on 23rd of June and never brought up again.

To analyze this issue we need full SOS reports where the reported stack trace can be found.  The log entries from the BZ description are from the 28th, while the latest logs are from the 13th of July.

Comment 16 Gorka Eguileor 2019-01-14 17:04:26 UTC

This seems to be happening when creating volumes from an image, so I may have already fixed it on https://review.openstack.org/615934
We backported the fix to OSP14 (bz https://bugzilla.redhat.com/show_bug.cgi?id=1661356) and OSP13 (bz https://bugzilla.redhat.com/show_bug.cgi?id=1646940), so we would have to backport it to OSP10 as well.

Comment 28 Tzach Shefi 2019-04-04 12:42:07 UTC

Verified on:
openstack-cinder-9.1.4-49.el7ost.noarch

Configured Kaminario K2 iSCSI as Cinder's backend

Bump up default Cinder quotas
#openstack quota set --volumes 100 admin
#openstack quota set --gigabytes 2000 admin

Created a random data filled 8G disk, uploaded to glance once as raw once as qcow2. 



for _ID in $(seq 0 80)
do
        if [[ "$((${_ID}%10))" == "0" ]]
        then
                openstack volume create --image 8G.raw --size 40 VOL-${_ID} 2>/dev/null &
        else
                openstack volume create --size 40 VOL-${_ID} 2>/dev/null &
        fi
done

for i in $(cinder list | grep VOL | awk '{print $2}'); do cinder delete $i; done 


I'd repeated these two loops several time, 
each cycle bumping up number of volumes 15 30 60 70.
Up till 70 all volumes were successfully created. 
I then retested with a qcow2 image instead of raw
again all volumes created successfully. 


However on reaching 80 volumes I'd started hitting errors on the volumes created from an image. 
Debugging with Gorka we noticed we hit system throughput and backend capacity issues.  
By adding "sleep 5" in the loop I'd gotten less volumes with errors. 

As these two factors are related to my deployment limitations
I'm ignoring them and verifying bz, as up till 70 volumes system worked fine.

Comment 30 errata-xmlrpc 2019-04-30 16:58:37 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:0917

Note You need to log in before you can comment on or make changes to this bug.