Bug 1141972 - nova conductor's log is full of deadlock errors
Summary: nova conductor's log is full of deadlock errors
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-nova
Version: 5.0 (RHEL 7)
Hardware: x86_64
OS: Linux
high
high
Target Milestone: z2
: 5.0 (RHEL 7)
Assignee: Russell Bryant
QA Contact: Sean Toner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-09-15 21:50 UTC by Jeff Dexter
Modified: 2022-07-09 07:13 UTC (History)
13 users (show)

Fixed In Version: openstack-nova-2014.1.3-3.el6ost openstack-nova-2014.1.2-5.el7ost
Doc Type: Bug Fix
Doc Text:
Cause: Unhandled database deadlock conditions triggered with some database configuration edge cases Consequence: Lost database transactions Fix: Retry actions on deadlock conditions Result: Robust database communication in all cases
Clone Of:
Environment:
Last Closed: 2014-11-03 08:26:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1370191 0 None None None Never
OpenStack gerrit 121940 0 None MERGED db: Add @_retry_on_deadlock to service_update() 2020-10-12 18:50:45 UTC
OpenStack gerrit 122146 0 None MERGED db: Add @_retry_on_deadlock to service_update() 2020-10-12 18:50:45 UTC
OpenStack gerrit 122147 0 None MERGED db: Add @_retry_on_deadlock to service_update() 2020-10-12 18:50:45 UTC
OpenStack gerrit 123244 0 None MERGED Add _wrap_db_error() support to Session.commit() 2020-10-12 18:50:45 UTC
OpenStack gerrit 123248 0 None ABANDONED Add _wrap_db_error() support to Session.commit() 2020-10-12 18:50:56 UTC
OpenStack gerrit 124515 0 None MERGED Add _wrap_db_error() support to SessionTransaction.commit() 2020-10-12 18:50:45 UTC
OpenStack gerrit 124883 0 None MERGED Add @_retry_on_deadlock to _instance_update() 2020-10-12 18:50:46 UTC
Red Hat Issue Tracker OSP-16547 0 None None None 2022-07-09 07:13:07 UTC
Red Hat Product Errata RHSA-2014:1782 0 normal SHIPPED_LIVE Important: openstack-nova security, bug fix, and enhancement update 2014-11-03 13:25:42 UTC

Description Jeff Dexter 2014-09-15 21:50:19 UTC
Description of problem:
Nova-conductor is deadlocking when accessing the db

All controllers nova.conf points towards the VIP for mariadb, yet nova-conductor and only nova-conductor throws the following errror.


4-09-15 15:40:22.574 34384 ERROR nova.servicegroup.drivers.db [-] model
server went away
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db Traceback
(most recent call last):
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line
95, in _report_state
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
service.service_ref, state_catalog)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in
service_update
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
return self._manager.service_update(context, service, values)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/utils.py", line 967, in wrapper
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
return func(*args, **kwargs)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139,
in inner
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
return func(*args, **kwargs)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 491, in
service_update
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db     svc =
self.db.service_update(context, service['id'], values)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/api.py", line 148, in
service_update
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
return IMPL.service_update(context, service_id, values)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 146, in
wrapper
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
return f(*args, **kwargs)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 533, in
service_update
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
service_ref.update(values)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 447,
in __exit__
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
self.rollback()
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line
58, in __exit__
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
compat.reraise(exc_type, exc_value, exc_tb)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 444,
in __exit__
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
self.commit()
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 358,
in commit
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
t[1].commit()
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1195,
in commit
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
self._do_commit()
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226,
in _do_commit
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
self.connection._commit_impl()
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 491,
in _commit_impl
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
self._handle_dbapi_exception(e, None, None, None, None)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1024,
in _handle_dbapi_exception
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
exc_info
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 196,
in raise_from_cause
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
reraise(type(exception), exception, tb=exc_tb)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 488,
in _commit_impl
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
self.engine.dialect.do_commit(self.connection)
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py",
line 2030, in do_commit
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
dbapi_connection.commit()
2014-09-15 15:40:22.574 34384 TRACE nova.servicegroup.drivers.db
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to
get lock; try restarting transaction') None None

Comment 3 Jeff Dexter 2014-09-19 20:22:47 UTC
Russel, looks like it didnt make a difference 

We updated all the installed packages in all controllers and nodes and
restarted all nova services but the deadlock error still there, as you can
see in the attached screenshot that I¹m adding into the ticket sometimes
the error is displayed into horizon interface and our customer will see it.

2014-09-19 16:51:55.642 39633 TRACE nova.servicegroup.drivers.db
2014-09-19 16:51:55.676 39652 ERROR nova.servicegroup.drivers.db [-] model
server went away
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db Traceback
(most recent call last):
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/servicegroup/drivers/db.py", line
95, in _report_state
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
service.service_ref, state_catalog)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/conductor/api.py", line 218, in
service_update
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
return self._manager.service_update(context, service, values)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/utils.py", line 967, in wrapper
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
return func(*args, **kwargs)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/oslo/messaging/rpc/server.py", line 139,
in inner
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
return func(*args, **kwargs)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/conductor/manager.py", line 491, in
service_update
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db     svc =
self.db.service_update(context, service['id'], values)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/api.py", line 148, in
service_update
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
return IMPL.service_update(context, service_id, values)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 146, in
wrapper
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
return f(*args, **kwargs)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 202, in
wrapped
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
return f(*args, **kwargs)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 534, in
service_update
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
service_ref.update(values)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 447,
in __exit__
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
self.rollback()
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line
58, in __exit__
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
compat.reraise(exc_type, exc_value, exc_tb)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 444,
in __exit__
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
self.commit()
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 358,
in commit
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
t[1].commit()
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1195,
in commit
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
self._do_commit()
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1226,
in _do_commit
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
self.connection._commit_impl()
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 491,
in _commit_impl
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
self._handle_dbapi_exception(e, None, None, None, None)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1024,
in _handle_dbapi_exception
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
exc_info
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 196,
in raise_from_cause
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
reraise(type(exception), exception, tb=exc_tb)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 488,
in _commit_impl
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
self.engine.dialect.do_commit(self.connection)
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db   File
"/usr/lib64/python2.7/site-packages/sqlalchemy/dialects/mysql/base.py",
line 2030, in do_commit
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
dbapi_connection.commit()
2014-09-19 16:51:55.676 39652 TRACE nova.servicegroup.drivers.db
OperationalError: (OperationalError) (1213, 'Deadlock found when trying to
get lock; try restarting transaction') None None

Comment 4 Jeff Dexter 2014-09-19 20:27:50 UTC
Serra Armani, Renato commented:
"Some news is:
if I ban galera in 2 controllers and let it up and running in just one of the Controllers this error stops and instances are launched 20 times faster than when I have more than one node.

I performed more than 10 tests launching 10 instances on the same time and I didn't receive the error.

So looks like the DeadLock error is directly caused by Galera sync delays

I'm still thinking if is not better to change the Galera cluster to Active/passive? Instead still trying to make it work in active/active. Do you think it can fix that bug?

Comment 5 Jeff Dexter 2014-09-20 10:47:02 UTC
Mysql Log, when deadlocking

140919 17:47:36 [Note] WSREP: BF kill (1, seqno: 19588074), victim: (1962)
trx: 54282168
140919 17:47:36 [Note] WSREP: Aborting query: commit
140919 17:47:36 [Note] WSREP: victim 54282168 in state 7
140919 17:47:36 [Note] WSREP: COMMIT failed, MDL released: 1962
140919 17:47:36 [Note] WSREP: cert failure, thd: 1962 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:47:36 [Note] WSREP: commit failed for reason: 3
140919 17:47:36 [Note] WSREP: conflict state: 0
140919 17:47:36 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:47:36 [Note] WSREP: Victim thread:
   THD: 1963, mode: local, state: executing, conflict: cert failure,
seqno: 19588078
   SQL: commit
140919 17:47:36 [Note] WSREP: COMMIT failed, MDL released: 1963
140919 17:47:36 [Note] WSREP: cert failure, thd: 1963 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:47:36 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:47:36 [Note] WSREP: commit failed for reason: 3
140919 17:47:36 [Note] WSREP: conflict state: 0
140919 17:47:36 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:47:36 [Note] WSREP: Victim thread:
   THD: 2012, mode: local, state: executing, conflict: cert failure,
seqno: 19588102
   SQL: commit
140919 17:47:36 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588101
   SQL: (null)
140919 17:47:36 [Note] WSREP: Victim thread:
   THD: 2012, mode: local, state: executing, conflict: cert failure,
seqno: 19588102
   SQL: commit
140919 17:47:36 [Note] WSREP: BF kill (1, seqno: 19588101), victim: (2012)
trx: 54282221
140919 17:47:36 [Note] WSREP: Aborting query: commit
140919 17:47:36 [Note] WSREP: victim 54282221 in state 7
140919 17:47:36 [Note] WSREP: COMMIT failed, MDL released: 2012
140919 17:47:36 [Note] WSREP: cert failure, thd: 2012 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:47:46 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:47:46 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588236
   SQL: (null)
140919 17:47:46 [Note] WSREP: Victim thread:
   THD: 1953, mode: local, state: committing, conflict: no conflict,
seqno: 19588238
   SQL: commit
140919 17:47:46 [Note] WSREP: BF kill (1, seqno: 19588236), victim: (1953)
trx: 54283255
140919 17:47:46 [Note] WSREP: Aborting query: commit
140919 17:47:46 [Note] WSREP: kill query for: 1953
140919 17:47:46 [Note] WSREP: kill trx QUERY_COMMITTING for 54283255
140919 17:47:46 [Note] WSREP: commit failed for reason: 3
140919 17:47:46 [Note] WSREP: COMMIT failed, MDL released: 1953
140919 17:47:46 [Note] WSREP: BF Aborted, thd: 1953 is_AC: 0, retry: 0 - 1
SQL: commit
140919 17:47:46 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:47:46 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588250
   SQL: (null)
140919 17:47:46 [Note] WSREP: Victim thread:
   THD: 1962, mode: local, state: committing, conflict: no conflict,
seqno: 19588253
   SQL: commit
140919 17:47:46 [Note] WSREP: BF kill (1, seqno: 19588250), victim: (1962)
trx: 54283283
140919 17:47:46 [Note] WSREP: Aborting query: commit
140919 17:47:46 [Note] WSREP: kill query for: 1962
140919 17:47:46 [Note] WSREP: kill trx QUERY_COMMITTING for 54283283
140919 17:47:46 [Note] WSREP: commit failed for reason: 3
140919 17:47:46 [Note] WSREP: COMMIT failed, MDL released: 1962
140919 17:47:46 [Note] WSREP: BF Aborted, thd: 1962 is_AC: 0, retry: 0 - 1
SQL: commit
140919 17:47:46 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:47:46 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588264
   SQL: (null)
140919 17:47:46 [Note] WSREP: Victim thread:
   THD: 2012, mode: local, state: committing, conflict: no conflict,
seqno: 19588265
   SQL: commit
140919 17:47:46 [Note] WSREP: BF kill (1, seqno: 19588264), victim: (2012)
trx: 54283340
140919 17:47:46 [Note] WSREP: Aborting query: commit
140919 17:47:46 [Note] WSREP: commit failed for reason: 3
140919 17:47:46 [Note] WSREP: conflict state: 0
140919 17:47:46 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:47:46 [Note] WSREP: Victim thread:
   THD: 2012, mode: local, state: executing, conflict: cert failure,
seqno: 19588265
   SQL: commit
140919 17:47:46 [Note] WSREP: victim 54283340 in state 7
140919 17:47:46 [Note] WSREP: COMMIT failed, MDL released: 2012
140919 17:47:46 [Note] WSREP: cert failure, thd: 2012 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:48:06 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:48:06 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588573
   SQL: (null)
140919 17:48:06 [Note] WSREP: Victim thread:
   THD: 1953, mode: local, state: committing, conflict: no conflict,
seqno: 19588575
   SQL: commit
140919 17:48:06 [Note] WSREP: BF kill (1, seqno: 19588573), victim: (1953)
trx: 54285471
140919 17:48:06 [Note] WSREP: Aborting query: commit
140919 17:48:06 [Note] WSREP: kill query for: 1953
140919 17:48:06 [Note] WSREP: kill trx QUERY_COMMITTING for 54285471
140919 17:48:06 [Note] WSREP: commit failed for reason: 3
140919 17:48:06 [Note] WSREP: COMMIT failed, MDL released: 1953
140919 17:48:06 [Note] WSREP: BF Aborted, thd: 1953 is_AC: 0, retry: 0 - 1
SQL: commit
140919 17:48:06 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:48:06 [Note] WSREP: commit failed for reason: 3
140919 17:48:06 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588591
   SQL: (null)
140919 17:48:06 [Note] WSREP: Victim thread:
   THD: 1962, mode: local, state: executing, conflict: no conflict, seqno:
19588592
   SQL: commit
140919 17:48:06 [Note] WSREP: BF kill (1, seqno: 19588591), victim: (1962)
trx: 54285506
140919 17:48:06 [Note] WSREP: Aborting query: commit
140919 17:48:06 [Note] WSREP: conflict state: 0
140919 17:48:06 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:48:06 [Note] WSREP: Victim thread:
   THD: 1962, mode: local, state: executing, conflict: cert failure,
seqno: 19588592
   SQL: commit
140919 17:48:06 [Note] WSREP: victim 54285506 in state 7
140919 17:48:06 [Note] WSREP: COMMIT failed, MDL released: 1962
140919 17:48:06 [Note] WSREP: cert failure, thd: 1962 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:48:06 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:48:06 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588613
   SQL: (null)
140919 17:48:06 [Note] WSREP: Victim thread:
   THD: 2012, mode: local, state: committing, conflict: no conflict,
seqno: 19588617
   SQL: commit
140919 17:48:06 [Note] WSREP: BF kill (1, seqno: 19588613), victim: (2012)
trx: 54285552
140919 17:48:06 [Note] WSREP: Aborting query: commit
140919 17:48:06 [Note] WSREP: kill query for: 2012
140919 17:48:06 [Note] WSREP: kill trx QUERY_COMMITTING for 54285552
140919 17:48:06 [Note] WSREP: commit failed for reason: 3
140919 17:48:06 [Note] WSREP: COMMIT failed, MDL released: 2012
140919 17:48:06 [Note] WSREP: BF Aborted, thd: 2012 is_AC: 0, retry: 0 - 1
SQL: commit
140919 17:48:16 [Note] WSREP: commit failed for reason: 3
140919 17:48:16 [Note] WSREP: conflict state: 0
140919 17:48:16 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:48:16 [Note] WSREP: Victim thread:
   THD: 1963, mode: local, state: executing, conflict: cert failure,
seqno: 19588701
   SQL: commit
140919 17:48:16 [Note] WSREP: COMMIT failed, MDL released: 1963
140919 17:48:16 [Note] WSREP: cert failure, thd: 1963 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:48:26 [Note] WSREP: commit failed for reason: 3
140919 17:48:26 [Note] WSREP: conflict state: 0
140919 17:48:26 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:48:26 [Note] WSREP: Victim thread:
   THD: 1962, mode: local, state: executing, conflict: cert failure,
seqno: 19588918
   SQL: commit
140919 17:48:26 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:48:26 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19588917
   SQL: (null)
140919 17:48:26 [Note] WSREP: Victim thread:
   THD: 1962, mode: local, state: executing, conflict: cert failure,
seqno: 19588918
   SQL: commit
140919 17:48:26 [Note] WSREP: BF kill (1, seqno: 19588917), victim: (1962)
trx: 54286316
140919 17:48:26 [Note] WSREP: Aborting query: commit
140919 17:48:26 [Note] WSREP: victim 54286316 in state 7
140919 17:48:26 [Note] WSREP: COMMIT failed, MDL released: 1962
140919 17:48:26 [Note] WSREP: cert failure, thd: 1962 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:48:26 [Note] WSREP: commit failed for reason: 3
140919 17:48:26 [Note] WSREP: conflict state: 0
140919 17:48:26 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:48:26 [Note] WSREP: Victim thread:
   THD: 1963, mode: local, state: executing, conflict: cert failure,
seqno: 19588921
   SQL: commit
140919 17:48:26 [Note] WSREP: COMMIT failed, MDL released: 1963
140919 17:48:26 [Note] WSREP: cert failure, thd: 1963 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:48:36 [Note] WSREP: commit failed for reason: 3
140919 17:48:36 [Note] WSREP: conflict state: 0
140919 17:48:36 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:48:36 [Note] WSREP: Victim thread:
   THD: 1953, mode: local, state: executing, conflict: cert failure,
seqno: 19589031
   SQL: commit
140919 17:48:36 [Note] WSREP: COMMIT failed, MDL released: 1953
140919 17:48:36 [Note] WSREP: cert failure, thd: 1953 is_AC: 0, retry: 0 -
1 SQL: commit
140919 17:48:46 [Note] WSREP: cluster conflict due to high priority abort
for threads:
140919 17:48:46 [Note] WSREP: commit failed for reason: 3
140919 17:48:46 [Note] WSREP: Winning thread:
   THD: 2, mode: applier, state: executing, conflict: no conflict, seqno:
19589164
   SQL: (null)
140919 17:48:46 [Note] WSREP: conflict state: 0
140919 17:48:46 [Note] WSREP: cluster conflict due to certification
failure for threads:
140919 17:48:46 [Note] WSREP: Victim thread:
   THD: 1953, mode: local, state: executing, conflict: cert failure,
seqno: 19589165
   SQL: commit
140919 17:48:46 [Note] WSREP: Victim thread:
   THD: 1953, mode: local, state: executing, conflict: no conflict, seqno:
19589165
   SQL: commit
140919 17:48:46 [Note] WSREP: BF kill (1, seqno: 19589164), victim: (1953)
trx: 54286886
140919 17:48:46 [Note] WSREP: Aborting query: commit
140919 17:48:46 [Note] WSREP: victim 54286886 in state 7
140919 17:48:46 [Note] WSREP: COMMIT failed, MDL released: 1953
140919 17:48:46 [Note] WSREP: cert failure, thd: 1953 is_AC: 0, retry: 0 -
1 SQL: commit

Comment 6 David Vossel 2014-09-22 18:50:42 UTC
(In reply to Jeff Dexter from comment #4)
> Serra Armani, Renato commented:
> "Some news is:
> if I ban galera in 2 controllers and let it up and running in just one of
> the Controllers this error stops and instances are launched 20 times faster
> than when I have more than one node.
> 
> I performed more than 10 tests launching 10 instances on the same time and I
> didn't receive the error.
> 
> So looks like the DeadLock error is directly caused by Galera sync delays
> 
> I'm still thinking if is not better to change the Galera cluster to
> Active/passive? Instead still trying to make it work in active/active. Do
> you think it can fix that bug?

Galera only really works in active/active mode. Limiting writes to a single node might resolve this though.

-- Vossel

Comment 7 Russell Bryant 2014-09-22 19:50:47 UTC
(In reply to David Vossel from comment #6)
> (In reply to Jeff Dexter from comment #4)
> > Serra Armani, Renato commented:
> > "Some news is:
> > if I ban galera in 2 controllers and let it up and running in just one of
> > the Controllers this error stops and instances are launched 20 times faster
> > than when I have more than one node.
> > 
> > I performed more than 10 tests launching 10 instances on the same time and I
> > didn't receive the error.
> > 
> > So looks like the DeadLock error is directly caused by Galera sync delays
> > 
> > I'm still thinking if is not better to change the Galera cluster to
> > Active/passive? Instead still trying to make it work in active/active. Do
> > you think it can fix that bug?
> 
> Galera only really works in active/active mode. Limiting writes to a single
> node might resolve this though.

That's supposed to be the case with our default configuration.

Can we verify that is indeed the case here?

Comment 8 Russell Bryant 2014-09-22 19:55:19 UTC
Mike, can you take a look at the traceback here?  Is nova's retry decorator not catching the right exception in this case?  With the decorator in place, in theory we shouldn't be seeing tracebacks like this.

Comment 9 Michael Bayer 2014-09-22 20:19:17 UTC
there's an older system of producing these DBDeadlockErrors that in this version of oslo.db won't work when the error is produced as a result of .commit(), as is the case in this stacktrace.   We fixed this in oslo.db with a rewrite of the exception propagation system, however in this case I can produce short patches to the old system that will cover at least this particular case.

Comment 10 Michael Bayer 2014-09-22 20:59:27 UTC
So I have reviews ready for both oslo-incubator:

https://review.openstack.org/123244

and nova:

https://review.openstack.org/123248

relative to the stable/icehouse branches.   Will get guidance from russell on how to continue with these patches.

Comment 11 Russell Bryant 2014-09-22 22:42:36 UTC
(In reply to Michael Bayer from comment #10)
> So I have reviews ready for both oslo-incubator:
> 
> https://review.openstack.org/123244
> 
> and nova:
> 
> https://review.openstack.org/123248
> 
> relative to the stable/icehouse branches.   Will get guidance from russell
> on how to continue with these patches.

I'll review these upstream tomorrow, and I'll also work with Nikola to get the fix into our package.  Jeff, you can tell the customer that we've identified an additional bug and will have another hot fix on the way.

Comment 16 Michael Bayer 2014-09-26 20:24:25 UTC
the wrapper is still not being invoked in that stack trace.   possibilities include something unexpected in how nova is using the session or the code isn't updated correctly.

Comment 17 Michael Bayer 2014-09-26 20:58:44 UTC
OK it looks like we will need another spot where this condition needs to be wrapped, I have the code ready here and will need to push it through the review process again.

Comment 56 errata-xmlrpc 2014-11-03 08:26:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2014-1782.html


Note You need to log in before you can comment on or make changes to this bug.