Description of problem: Cinder-Volume even during a single creation use 100% CPU. If multiple volumes creation happens at the same time - either orchestrated by API calls or through Heat Stacks - sometimes Cinder-Volume loses the MySQL connection during the queries. The result is that the volume hangs in the creation process forever. I notice that this problem only happens with a mix of empty volumes and bootable volumes. It looks like Cinder doesn't have enough resources to handle everything. I was successfully able to reproduce the problem with an EMC VNX iSCSI backend in OSP9 and OSP10 environment as well. Also, in my home lab environment with a tiny Ceph setup, I see the same behavior: 100% cpu usage even with a single creation. Below the Cinder's olso.db tuning [database] idle_timeout = 60 min_pool_size = 100 max_pool_size = 100 max_retries = 20 max_overflow = 200 db_retry_interval = 2 db_inc_retry_interval = true db_max_retry_interval = 2 db_max_retries = 50 Below the Cinder's oslo.messaging tuning [DEFAULT] rpc_response_timeout = 600 executor_thread_pool_size = 120 rpc_conn_pool_size = 120 Version-Release number of selected component (if applicable): RHEL 7.3 latest updates. Kernel: 3.10.0-514.26.2.el7 RH-OSP9, Cinder package: openstack-cinder-8.1.1-10.el7ost.noarc RH-OSP10, Cinder package: openstack-cinder-9.1.4-3.el7ost.noarch I didn't test with OSP11. How reproducible: First off, it doesn't always happen. Below a simple script to reproduce the error. ### source ~/overcloudrc for _ID in $(seq 0 80) do if [[ "$((${_ID}%10))" == "0" ]] then openstack volume create --image 00fdb5c4-6dc3-4268-86fb-718af55641f9 --size 40 VOL-${_ID} 2>/dev/null & else openstack volume create --size 40 VOL-${_ID} 2>/dev/null & fi done ### You will need to use another image ID, maybe not a CirrOS or even the default RHEL 7.3 because is not representative of a real production environment. I'm using customer images of about 9 to 10GB size. Actual results: Sometimes I got MySQL lost connection and the volume hangs forever in creation status. When the "MySQL lost connection" happens Expected results: Cinder shouldn't lose the MySQL connection. Cinder shouldn't use 100% CPU for a single volume creation. Additional info: ################################ # Cinder LOG 2017-07-28 02:46:29.332 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'FAILURE' from state 'RUNNING' 1 predecessors (most recent first): Flow 'volume_create_manager' 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager Traceback (most recent call last): 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/taskflow/engines/action_engine/executor.py", line 82, in _execute_task 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager result = task.execute(**arguments) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/volume/flows/manager/create_volume.py", line 210, in execute 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return objects.Volume.get_by_id(context, volume_id) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 181, in wrapper 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager result = fn(cls, context, *args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/objects/base.py", line 166, in get_by_id 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager orm_obj = db.get_by_id(context, model, id, *args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/api.py", line 1127, in get_by_id 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return IMPL.get_by_id(context, model, id, *args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return f(*args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 4466, in get_by_id 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return _GET_METHODS[model](context, id, *args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return f(*args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 1480, in volume_get 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return _volume_get(context, volume_id) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 193, in wrapper 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return f(*args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 1422, in _volume_get 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager result = result.filter_by(id=volume_id).first() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2634, in first 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager ret = list(self[0:1]) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2457, in __getitem__ 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return list(res) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2736, in __iter__ 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return self._execute_and_instances(context) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2749, in _execute_and_instances 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager close_with_result=True) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 2740, in _connection_from_session 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager **kw) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 905, in connection 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager execution_options=execution_options) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 912, in _connection_for_bind 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager conn = engine.contextual_connect(**kw) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2039, in contextual_connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager self._wrap_pool_connect(self.pool.connect, None), 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2078, in _wrap_pool_connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager e, dialect, self) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1401, in _handle_dbapi_exception_noconnection 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager util.raise_from_cause(newraise, exc_info) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 200, in raise_from_cause 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager reraise(type(exception), exception, tb=exc_tb) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 2074, in _wrap_pool_connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return fn() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 376, in connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return _ConnectionFairy._checkout(self) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 713, in _checkout 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager fairy = _ConnectionRecord.checkout(pool) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 485, in checkout 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager rec.checkin() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__ 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager compat.reraise(exc_type, exc_value, exc_tb) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 482, in checkout 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager dbapi_connection = rec.get_connection() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 594, in get_connection 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager self.connection = self.__connect() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/pool.py", line 607, in __connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager connection = self.__pool._invoke_creator(self) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/strategies.py", line 97, in connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return dialect.connect(*cargs, **cparams) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 385, in connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return self.dbapi.connect(*cargs, **cparams) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/pymysql/__init__.py", line 88, in Connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager return Connection(*args, **kwargs) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 657, in __init__ 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager self.connect() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 851, in connect 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager self._request_authentication() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1034, in _request_authentication 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager auth_packet = self._read_packet() 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 895, in _read_packet 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager packet_header = self._read_bytes(4) 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 922, in _read_bytes 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager 2013, "Lost connection to MySQL server during query") 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') 2017-07-28 02:46:29.332 2385 ERROR cinder.volume.manager 2017-07-28 02:46:29.337 2385 DEBUG cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'REVERTING' from state 'FAILURE' _task_receiver /usr/lib/python2.7/site-packages/taskflow/listeners/logging.py:189 2017-07-28 02:46:29.339 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Task 'cinder.volume.flows.manager.create_volume.ExtractVolumeRefTask;volume:create' (a5651d57-537d-428b-bca6-2d69c5033d8a) transitioned into state 'REVERTED' from state 'REVERTING' with result 'None' 2017-07-28 02:46:29.341 2385 WARNING cinder.volume.manager [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Flow 'volume_create_manager' (a1da4f5e-43bc-494d-a876-9adc5ea6e109) transitioned into state 'REVERTED' from state 'RUNNING' 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher [req-9d9233ea-4d62-46d7-bad5-77f3bd869836 285456467e7e49cf985d3e2bfe73a901 1b3e50d1b06c4335a41a55b733f54f32 - - -] Exception during message handling: 'Failure' object has no attribute '__getitem__' 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher Traceback (most recent call last): 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher incoming.message)) 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _dispatch 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args) 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher result = func(ctxt, **new_args) 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 654, in create_volume 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher self._update_allocated_capacity(vol_ref) 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 2727, in _update_allocated_capacity 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher pool = vol_utils.extract_host(vol['host'], 'pool') 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher TypeError: 'Failure' object has no attribute '__getitem__' 2017-07-28 02:46:29.341 2385 ERROR oslo_messaging.rpc.dispatcher
Sorry, I forget to mention the MySQL configs MariaDB [(none)]> show variables like '%timeout%'; +----------------------------+----------+ | Variable_name | Value | +----------------------------+----------+ | connect_timeout | 28800 | | deadlock_timeout_long | 50000000 | | deadlock_timeout_short | 10000 | | delayed_insert_timeout | 300 | | innodb_lock_wait_timeout | 50 | | innodb_rollback_on_timeout | OFF | | interactive_timeout | 28800 | | lock_wait_timeout | 31536000 | | net_read_timeout | 600 | | net_write_timeout | 600 | | slave_net_timeout | 3600 | | thread_pool_idle_timeout | 60 | | wait_timeout | 28800 | +----------------------------+----------+ The default timeouts I've changed are the following: connect_timeout default to 10 seconds net_read_timeout default to 30 seconds net_write_timeout default to 60 seconds
Reported MySQL connection drop is most likely related to high CPU usage on the node or cpu bound operations going on in Cinder. I see some volume creation issues related to the scheduler not finding (on July dates) any host to place the volumes because the backends are down, and according to the logs the volume service was brought down on 23rd of June and never brought up again. To analyze this issue we need full SOS reports where the reported stack trace can be found. The log entries from the BZ description are from the 28th, while the latest logs are from the 13th of July.
This seems to be happening when creating volumes from an image, so I may have already fixed it on https://review.openstack.org/615934 We backported the fix to OSP14 (bz https://bugzilla.redhat.com/show_bug.cgi?id=1661356) and OSP13 (bz https://bugzilla.redhat.com/show_bug.cgi?id=1646940), so we would have to backport it to OSP10 as well.
Verified on: openstack-cinder-9.1.4-49.el7ost.noarch Configured Kaminario K2 iSCSI as Cinder's backend Bump up default Cinder quotas #openstack quota set --volumes 100 admin #openstack quota set --gigabytes 2000 admin Created a random data filled 8G disk, uploaded to glance once as raw once as qcow2. for _ID in $(seq 0 80) do if [[ "$((${_ID}%10))" == "0" ]] then openstack volume create --image 8G.raw --size 40 VOL-${_ID} 2>/dev/null & else openstack volume create --size 40 VOL-${_ID} 2>/dev/null & fi done for i in $(cinder list | grep VOL | awk '{print $2}'); do cinder delete $i; done I'd repeated these two loops several time, each cycle bumping up number of volumes 15 30 60 70. Up till 70 all volumes were successfully created. I then retested with a qcow2 image instead of raw again all volumes created successfully. However on reaching 80 volumes I'd started hitting errors on the volumes created from an image. Debugging with Gorka we noticed we hit system throughput and backend capacity issues. By adding "sleep 5" in the loop I'd gotten less volumes with errors. As these two factors are related to my deployment limitations I'm ignoring them and verifying bz, as up till 70 volumes system worked fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:0917