Bug 1568512 - cinder-volume hangs after all controller nodes reset
Summary: cinder-volume hangs after all controller nodes reset
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: ---
Assignee: Alan Bishop
QA Contact: Avi Avraham
RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-17 16:27 UTC by Marian Krcmarik
Modified: 2020-01-07 15:32 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-01-07 15:32:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Marian Krcmarik 2018-04-17 16:27:33 UTC
Description of problem:
cinder-volume hangs after all controller nodes reset on Openstack deployment which consists of three controller nodes which host all the controller services including haproxy/DB/rabbitmq/cinder-volume.

There is a traceback in cinder-volume log right before the hang related to DB connection lost (see below), more logs will be attached with sosreport.
I tried to reproduce by stopping all galera resources and restarting cinder-volume but I did not mange to reproduce. The problem I can see constantly on OSP12 and OSP13 disaster recovery jobs.

2018-04-12 11:20:54.551 76 ERROR oslo_service.service [req-649ca7af-76e4-4fbf-8eb3-09c5dbc5eea5 - - - - -] Error starting thread.: DBConnectionError: (pymysql
.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'UPDATE services SET updated_at=%(updated_at)s, replication_status=%(repl
ication_status)s WHERE services.deleted = false AND services.id = %(id_1)s'] [parameters: {u'id_1': 5, 'replication_status': u'disabled', 'updated_at': dateti
me.datetime(2018, 4, 12, 11, 20, 53, 708739)}] (Background on this error at: http://sqlalche.me/e/e3q8)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service Traceback (most recent call last):
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_service/service.py", line 729, in run_service
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     service.start()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/service.py", line 267, in start
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     self.manager.init_host_with_rpc()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/volume/manager.py", line 556, in init_host_with_rpc
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     service.save()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/objects/service.py", line 181, in save
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     db.service_update(self._context, self.id, updates)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/db/api.py", line 154, in service_update
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     return IMPL.service_update(context, service_id, values)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 164, in wrapper
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     return f(*args, **kwargs)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 147, in wrapper
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     ectxt.value = e.inner_exc
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     self.force_reraise()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     six.reraise(self.type_, self.value, self.tb)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 135, in wrapper
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     return f(*args, **kwargs)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/cinder/db/sqlalchemy/api.py", line 560, in service_update
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     result = query.update(values)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/query.py", line 3369, in update
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     update_op.exec_()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1325, in exec_
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     self._do_exec()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1492, in _do_exec
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     self._execute_stmt(update_stmt)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1332, in _execute_stmt
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     mapper=self.mapper)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1176, in execute
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     bind, close_with_result=True).execute(clause, params or {})
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 948, in execute
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     return meth(self, multiparams, params)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 269, in _execute_on_connection
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     return connection._execute_clauseelement(self, multiparams, params)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1060, in _execute_clauseelement
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     compiled_sql, distilled_params
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1200, in _execute_context
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     context)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1409, in _handle_dbapi_exception
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     util.raise_from_cause(newraise, exc_info)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 203, in raise_from_cause
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     reraise(type(exception), exception, tb=exc_tb, cause=cause)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     context)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 507, in do_execute
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     cursor.execute(statement, parameters)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 166, in execute
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     result = self._query(query)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/cursors.py", line 322, in _query
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     conn.query(q)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 856, in query
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     self._affected_rows = self._read_query_result(unbuffered=unbuffered)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1057, in _read_query_result
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     result.read()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1340, in read
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     first_packet = self.connection._read_packet()
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 987, in _read_packet
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     packet_header = self._read_bytes(4)
2018-04-12 11:20:54.551 76 ERROR oslo_service.service   File "/usr/lib/python2.7/site-packages/pymysql/connections.py", line 1033, in _read_bytes
2018-04-12 11:20:54.551 76 ERROR oslo_service.service     CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
2018-04-12 11:20:54.551 76 ERROR oslo_service.service DBConnectionError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'UPDATE services SET updated_at=%(updated_at)s, replication_status=%(replication_status)s WHERE servi

Version-Release number of selected component (if applicable):
openstack-cinder-12.0.1-0.20180326201852.46c4ec1.el7ost.noarch


How reproducible:
Often

Steps to Reproduce:
1. Deploy OSP with 3 controllers hosting openstack services including cinder-volume and DB/Messaging
2. Reset ungracefully all the controller nodes at once

Actual results:
cinder-volumes hangs at start up
    2018-04-12 11:20:54.582 76 DEBUG oslo_concurrency.lockutils [req-649ca7af-76e4-4fbf-8eb3-09c5dbc5eea5 - - - - -] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
    2018-04-12 11:20:54.582 76 DEBUG oslo_concurrency.lockutils [req-649ca7af-76e4-4fbf-8eb3-09c5dbc5eea5 - - - - -] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:228
    2018-04-12 11:21:24.583 76 WARNING oslo_messaging.server [req-649ca7af-76e4-4fbf-8eb3-09c5dbc5eea5 - - - - -] Possible hang: wait is waiting for stop to complete

Additional info:
The hangs happened on controller-0 in attached logs.

Comment 10 Alan Bishop 2018-06-27 12:52:08 UTC
Still under investigation, updating flags to target 13z.

Comment 14 Alan Bishop 2019-11-18 20:38:28 UTC
I wonder if this BZ is a duplicate based on bug #1640388 comment #19. A fix for that issue was released in OSP-13z7 (bug #1693615).

Comment 16 Alan Bishop 2020-01-07 15:32:17 UTC
I'm confident of the assessment I stated in comment #14. If the problem is observed again then this BZ can be reopened.


Note You need to log in before you can comment on or make changes to this bug.