Description of problem: Nova-conductor is deadlocking when accessing the db All controllers nova.conf points towards the VIP for mariadb, yet nova-conductor and only nova-conductor throws the following errror. 4-09-15 15:40:22.574 34384 ERROR nova.servicegroup.drivers.db [-] model server went away 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db Traceback (most recent call last): 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in service_update 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/utils.py", line 967, in wrapper 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db return func(*args, **kwargs) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db return func(*args, **kwargs) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 491, in service_update 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db svc = self.db.service_update(context, service['id'], values) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/api.py", line 148, in service_update 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db return IMPL.service_update(context, service_id, values) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 146, in wrapper 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db return f(*args, **kwargs) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 533, in service_update 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db service_ref.update(values) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 447, in __exit__ 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db self.rollback() 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 58, in __exit__ 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db compat.reraise(exc_type, exc_value, exc_tb) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 444, in __exit__ 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db self.commit() 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 358, in commit 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db t[1].commit() 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1195, in commit 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db self._do_commit() 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in _do_commit 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db self.connection._commit_impl() 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 491, in _commit_impl 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db self._handle_dbapi_exception(e, None, None, None, None) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1024, in _handle_dbapi_exception 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db exc_info 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 196, in raise_from_cause 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db reraise(type(exception), exception, tb=exc_tb) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 488, in _commit_impl 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db self.engine.dialect.do_commit(self.connection) 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2030, in do_commit 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db dbapi_connection.commit() 2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None
Russel, looks like it didnt make a difference We updated all the installed packages in all controllers and nodes and restarted all nova services but the deadlock error still there, as you can see in the attached screenshot that I¹m adding into the ticket sometimes the error is displayed into horizon interface and our customer will see it. 2014-09-19 16:51:55.642 39633 TRACE nova.servicegroup.drivers.db 2014-09-19 16:51:55.676 39652 ERROR nova.servicegroup.drivers.db [-] model server went away 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db Traceback (most recent call last): 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line 95, in _report_state 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db service.service_ref, state_catalog) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in service_update 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db return self._manager.service_update(context, service, values) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/utils.py", line 967, in wrapper 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db return func(*args, **kwargs) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139, in inner 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db return func(*args, **kwargs) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 491, in service_update 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db svc = self.db.service_update(context, service['id'], values) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/api.py", line 148, in service_update 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db return IMPL.service_update(context, service_id, values) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 146, in wrapper 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db return f(*args, **kwargs) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 202, in wrapped 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db return f(*args, **kwargs) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 534, in service_update 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db service_ref.update(values) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 447, in __exit__ 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db self.rollback() 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 58, in __exit__ 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db compat.reraise(exc_type, exc_value, exc_tb) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 444, in __exit__ 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db self.commit() 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 358, in commit 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db t[1].commit() 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1195, in commit 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db self._do_commit() 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226, in _do_commit 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db self.connection._commit_impl() 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 491, in _commit_impl 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db self._handle_dbapi_exception(e, None, None, None, None) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1024, in _handle_dbapi_exception 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db exc_info 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 196, in raise_from_cause 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db reraise(type(exception), exception, tb=exc_tb) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 488, in _commit_impl 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db self.engine.dialect.do_commit(self.connection) 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db File "/usr/lib64/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py", line 2030, in do_commit 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db dbapi_connection.commit() 2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db OperationalError: (OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') None None
Serra Armani, Renato commented: "Some news is: if I ban galera in 2 controllers and let it up and running in just one of the Controllers this error stops and instances are launched 20 times faster than when I have more than one node. I performed more than 10 tests launching 10 instances on the same time and I didn't receive the error. So looks like the DeadLock error is directly caused by Galera sync delays I'm still thinking if is not better to change the Galera cluster to Active/passive? Instead still trying to make it work in active/active. Do you think it can fix that bug?
Mysql Log, when deadlocking 140919 17:47:36 [Note] WSREP: BF kill (1, seqno: 19588074), victim: (1962) trx: 54282168 140919 17:47:36 [Note] WSREP: Aborting query: commit 140919 17:47:36 [Note] WSREP: victim 54282168 in state 7 140919 17:47:36 [Note] WSREP: COMMIT failed, MDL released: 1962 140919 17:47:36 [Note] WSREP: cert failure, thd: 1962 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:47:36 [Note] WSREP: commit failed for reason: 3 140919 17:47:36 [Note] WSREP: conflict state: 0 140919 17:47:36 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:47:36 [Note] WSREP: Victim thread: THD: 1963, mode: local, state: executing, conflict: cert failure, seqno: 19588078 SQL: commit 140919 17:47:36 [Note] WSREP: COMMIT failed, MDL released: 1963 140919 17:47:36 [Note] WSREP: cert failure, thd: 1963 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:47:36 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:47:36 [Note] WSREP: commit failed for reason: 3 140919 17:47:36 [Note] WSREP: conflict state: 0 140919 17:47:36 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:47:36 [Note] WSREP: Victim thread: THD: 2012, mode: local, state: executing, conflict: cert failure, seqno: 19588102 SQL: commit 140919 17:47:36 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588101 SQL: (null) 140919 17:47:36 [Note] WSREP: Victim thread: THD: 2012, mode: local, state: executing, conflict: cert failure, seqno: 19588102 SQL: commit 140919 17:47:36 [Note] WSREP: BF kill (1, seqno: 19588101), victim: (2012) trx: 54282221 140919 17:47:36 [Note] WSREP: Aborting query: commit 140919 17:47:36 [Note] WSREP: victim 54282221 in state 7 140919 17:47:36 [Note] WSREP: COMMIT failed, MDL released: 2012 140919 17:47:36 [Note] WSREP: cert failure, thd: 2012 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:47:46 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:47:46 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588236 SQL: (null) 140919 17:47:46 [Note] WSREP: Victim thread: THD: 1953, mode: local, state: committing, conflict: no conflict, seqno: 19588238 SQL: commit 140919 17:47:46 [Note] WSREP: BF kill (1, seqno: 19588236), victim: (1953) trx: 54283255 140919 17:47:46 [Note] WSREP: Aborting query: commit 140919 17:47:46 [Note] WSREP: kill query for: 1953 140919 17:47:46 [Note] WSREP: kill trx QUERY_COMMITTING for 54283255 140919 17:47:46 [Note] WSREP: commit failed for reason: 3 140919 17:47:46 [Note] WSREP: COMMIT failed, MDL released: 1953 140919 17:47:46 [Note] WSREP: BF Aborted, thd: 1953 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:47:46 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:47:46 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588250 SQL: (null) 140919 17:47:46 [Note] WSREP: Victim thread: THD: 1962, mode: local, state: committing, conflict: no conflict, seqno: 19588253 SQL: commit 140919 17:47:46 [Note] WSREP: BF kill (1, seqno: 19588250), victim: (1962) trx: 54283283 140919 17:47:46 [Note] WSREP: Aborting query: commit 140919 17:47:46 [Note] WSREP: kill query for: 1962 140919 17:47:46 [Note] WSREP: kill trx QUERY_COMMITTING for 54283283 140919 17:47:46 [Note] WSREP: commit failed for reason: 3 140919 17:47:46 [Note] WSREP: COMMIT failed, MDL released: 1962 140919 17:47:46 [Note] WSREP: BF Aborted, thd: 1962 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:47:46 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:47:46 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588264 SQL: (null) 140919 17:47:46 [Note] WSREP: Victim thread: THD: 2012, mode: local, state: committing, conflict: no conflict, seqno: 19588265 SQL: commit 140919 17:47:46 [Note] WSREP: BF kill (1, seqno: 19588264), victim: (2012) trx: 54283340 140919 17:47:46 [Note] WSREP: Aborting query: commit 140919 17:47:46 [Note] WSREP: commit failed for reason: 3 140919 17:47:46 [Note] WSREP: conflict state: 0 140919 17:47:46 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:47:46 [Note] WSREP: Victim thread: THD: 2012, mode: local, state: executing, conflict: cert failure, seqno: 19588265 SQL: commit 140919 17:47:46 [Note] WSREP: victim 54283340 in state 7 140919 17:47:46 [Note] WSREP: COMMIT failed, MDL released: 2012 140919 17:47:46 [Note] WSREP: cert failure, thd: 2012 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:06 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:48:06 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588573 SQL: (null) 140919 17:48:06 [Note] WSREP: Victim thread: THD: 1953, mode: local, state: committing, conflict: no conflict, seqno: 19588575 SQL: commit 140919 17:48:06 [Note] WSREP: BF kill (1, seqno: 19588573), victim: (1953) trx: 54285471 140919 17:48:06 [Note] WSREP: Aborting query: commit 140919 17:48:06 [Note] WSREP: kill query for: 1953 140919 17:48:06 [Note] WSREP: kill trx QUERY_COMMITTING for 54285471 140919 17:48:06 [Note] WSREP: commit failed for reason: 3 140919 17:48:06 [Note] WSREP: COMMIT failed, MDL released: 1953 140919 17:48:06 [Note] WSREP: BF Aborted, thd: 1953 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:06 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:48:06 [Note] WSREP: commit failed for reason: 3 140919 17:48:06 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588591 SQL: (null) 140919 17:48:06 [Note] WSREP: Victim thread: THD: 1962, mode: local, state: executing, conflict: no conflict, seqno: 19588592 SQL: commit 140919 17:48:06 [Note] WSREP: BF kill (1, seqno: 19588591), victim: (1962) trx: 54285506 140919 17:48:06 [Note] WSREP: Aborting query: commit 140919 17:48:06 [Note] WSREP: conflict state: 0 140919 17:48:06 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:48:06 [Note] WSREP: Victim thread: THD: 1962, mode: local, state: executing, conflict: cert failure, seqno: 19588592 SQL: commit 140919 17:48:06 [Note] WSREP: victim 54285506 in state 7 140919 17:48:06 [Note] WSREP: COMMIT failed, MDL released: 1962 140919 17:48:06 [Note] WSREP: cert failure, thd: 1962 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:06 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:48:06 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588613 SQL: (null) 140919 17:48:06 [Note] WSREP: Victim thread: THD: 2012, mode: local, state: committing, conflict: no conflict, seqno: 19588617 SQL: commit 140919 17:48:06 [Note] WSREP: BF kill (1, seqno: 19588613), victim: (2012) trx: 54285552 140919 17:48:06 [Note] WSREP: Aborting query: commit 140919 17:48:06 [Note] WSREP: kill query for: 2012 140919 17:48:06 [Note] WSREP: kill trx QUERY_COMMITTING for 54285552 140919 17:48:06 [Note] WSREP: commit failed for reason: 3 140919 17:48:06 [Note] WSREP: COMMIT failed, MDL released: 2012 140919 17:48:06 [Note] WSREP: BF Aborted, thd: 2012 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:16 [Note] WSREP: commit failed for reason: 3 140919 17:48:16 [Note] WSREP: conflict state: 0 140919 17:48:16 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:48:16 [Note] WSREP: Victim thread: THD: 1963, mode: local, state: executing, conflict: cert failure, seqno: 19588701 SQL: commit 140919 17:48:16 [Note] WSREP: COMMIT failed, MDL released: 1963 140919 17:48:16 [Note] WSREP: cert failure, thd: 1963 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:26 [Note] WSREP: commit failed for reason: 3 140919 17:48:26 [Note] WSREP: conflict state: 0 140919 17:48:26 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:48:26 [Note] WSREP: Victim thread: THD: 1962, mode: local, state: executing, conflict: cert failure, seqno: 19588918 SQL: commit 140919 17:48:26 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:48:26 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19588917 SQL: (null) 140919 17:48:26 [Note] WSREP: Victim thread: THD: 1962, mode: local, state: executing, conflict: cert failure, seqno: 19588918 SQL: commit 140919 17:48:26 [Note] WSREP: BF kill (1, seqno: 19588917), victim: (1962) trx: 54286316 140919 17:48:26 [Note] WSREP: Aborting query: commit 140919 17:48:26 [Note] WSREP: victim 54286316 in state 7 140919 17:48:26 [Note] WSREP: COMMIT failed, MDL released: 1962 140919 17:48:26 [Note] WSREP: cert failure, thd: 1962 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:26 [Note] WSREP: commit failed for reason: 3 140919 17:48:26 [Note] WSREP: conflict state: 0 140919 17:48:26 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:48:26 [Note] WSREP: Victim thread: THD: 1963, mode: local, state: executing, conflict: cert failure, seqno: 19588921 SQL: commit 140919 17:48:26 [Note] WSREP: COMMIT failed, MDL released: 1963 140919 17:48:26 [Note] WSREP: cert failure, thd: 1963 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:36 [Note] WSREP: commit failed for reason: 3 140919 17:48:36 [Note] WSREP: conflict state: 0 140919 17:48:36 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:48:36 [Note] WSREP: Victim thread: THD: 1953, mode: local, state: executing, conflict: cert failure, seqno: 19589031 SQL: commit 140919 17:48:36 [Note] WSREP: COMMIT failed, MDL released: 1953 140919 17:48:36 [Note] WSREP: cert failure, thd: 1953 is_AC: 0, retry: 0 - 1 SQL: commit 140919 17:48:46 [Note] WSREP: cluster conflict due to high priority abort for threads: 140919 17:48:46 [Note] WSREP: commit failed for reason: 3 140919 17:48:46 [Note] WSREP: Winning thread: THD: 2, mode: applier, state: executing, conflict: no conflict, seqno: 19589164 SQL: (null) 140919 17:48:46 [Note] WSREP: conflict state: 0 140919 17:48:46 [Note] WSREP: cluster conflict due to certification failure for threads: 140919 17:48:46 [Note] WSREP: Victim thread: THD: 1953, mode: local, state: executing, conflict: cert failure, seqno: 19589165 SQL: commit 140919 17:48:46 [Note] WSREP: Victim thread: THD: 1953, mode: local, state: executing, conflict: no conflict, seqno: 19589165 SQL: commit 140919 17:48:46 [Note] WSREP: BF kill (1, seqno: 19589164), victim: (1953) trx: 54286886 140919 17:48:46 [Note] WSREP: Aborting query: commit 140919 17:48:46 [Note] WSREP: victim 54286886 in state 7 140919 17:48:46 [Note] WSREP: COMMIT failed, MDL released: 1953 140919 17:48:46 [Note] WSREP: cert failure, thd: 1953 is_AC: 0, retry: 0 - 1 SQL: commit
(In reply to Jeff Dexter from comment #4) > Serra Armani, Renato commented: > "Some news is: > if I ban galera in 2 controllers and let it up and running in just one of > the Controllers this error stops and instances are launched 20 times faster > than when I have more than one node. > > I performed more than 10 tests launching 10 instances on the same time and I > didn't receive the error. > > So looks like the DeadLock error is directly caused by Galera sync delays > > I'm still thinking if is not better to change the Galera cluster to > Active/passive? Instead still trying to make it work in active/active. Do > you think it can fix that bug? Galera only really works in active/active mode. Limiting writes to a single node might resolve this though. -- Vossel
(In reply to David Vossel from comment #6) > (In reply to Jeff Dexter from comment #4) > > Serra Armani, Renato commented: > > "Some news is: > > if I ban galera in 2 controllers and let it up and running in just one of > > the Controllers this error stops and instances are launched 20 times faster > > than when I have more than one node. > > > > I performed more than 10 tests launching 10 instances on the same time and I > > didn't receive the error. > > > > So looks like the DeadLock error is directly caused by Galera sync delays > > > > I'm still thinking if is not better to change the Galera cluster to > > Active/passive? Instead still trying to make it work in active/active. Do > > you think it can fix that bug? > > Galera only really works in active/active mode. Limiting writes to a single > node might resolve this though. That's supposed to be the case with our default configuration. Can we verify that is indeed the case here?
Mike, can you take a look at the traceback here? Is nova's retry decorator not catching the right exception in this case? With the decorator in place, in theory we shouldn't be seeing tracebacks like this.
there's an older system of producing these DBDeadlockErrors that in this version of oslo.db won't work when the error is produced as a result of .commit(), as is the case in this stacktrace. We fixed this in oslo.db with a rewrite of the exception propagation system, however in this case I can produce short patches to the old system that will cover at least this particular case.
So I have reviews ready for both oslo-incubator: https://review.openstack.org/123244 and nova: https://review.openstack.org/123248 relative to the stable/icehouse branches. Will get guidance from russell on how to continue with these patches.
(In reply to Michael Bayer from comment #10) > So I have reviews ready for both oslo-incubator: > > https://review.openstack.org/123244 > > and nova: > > https://review.openstack.org/123248 > > relative to the stable/icehouse branches. Will get guidance from russell > on how to continue with these patches. I'll review these upstream tomorrow, and I'll also work with Nikola to get the fix into our package. Jeff, you can tell the customer that we've identified an additional bug and will have another hot fix on the way.
the wrapper is still not being invoked in that stack trace. possibilities include something unexpected in how nova is using the session or the code isn't updated correctly.
OK it looks like we will need another spot where this condition needs to be wrapped, I have the code ready here and will need to push it through the review process again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2014-1782.html