1362414 – online data migration for Beaker 23 causes deadlocks in beakerd

Bug 1362414 - online data migration for Beaker 23 causes deadlocks in beakerd

Summary: online data migration for Beaker 23 causes deadlocks in beakerd

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Beaker
Classification:	Retired
Component:	general
Sub Component:
Version:	23
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	23.1
Assignee:	Dan Callaghan
QA Contact:	tools-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-02 07:22 UTC by Dan Callaghan
Modified:	2016-08-11 01:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-08-11 01:24:57 UTC
Embargoed:

Attachments	(Terms of Use)

Description Dan Callaghan 2016-08-02 07:22:36 UTC

Description of problem:
For Beaker 23 we have a new "online data migration" step which is designed to run in the background after the outage is finished, for populating historical data for recipe installations. But on our production Beaker instance we can't run it while the system is live because it kills beakerd.

Version-Release number of selected component (if applicable):
23.0

How reproducible:
good luck

Steps to Reproduce:
1. On a large, heavily loaded production Beaker instance run: beaker-init --online-data-migration=commands-for-recipe-installations
2. Watch beakerd logs

Actual results:
beakerd will eventually die with an unhandled exception like this:
bkr.server.tools.beakerd ERROR Error in schedule_queued_recipe(2917806)
 Traceback (most recent call last):
   File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 437, in schedule_queued_recipes
     schedule_queued_recipe(recipe_id, guest_recipe_id)
   File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 564, in schedule_queued_recipe
     recipe.resource.allocate()
   File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 3961, in allocate
     user=self.recipe.recipeset.job.owner)
   File "/usr/lib/python2.6/site-packages/bkr/server/model/inventory.py", line 1621, in reserve_for_recipe
     return self._reserve(service, user, u'recipe')
   File "/usr/lib/python2.6/site-packages/bkr/server/model/inventory.py", line 1640, in _reserve
     session.flush()
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 149, in do
     return getattr(self.registry(), name)(*args, **kwargs)
   File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 1824, in flush
     self._flush(objects)
[...]
 OperationalError: (OperationalError) (1305, 'SAVEPOINT sa_savepoint_3 does not exist') 'ROLLBACK TO SAVEPOINT sa_savepoint_3' ()

Expected results:
Need a solution to perform the online data migration without killing beakerd.

Comment 1 Dan Callaghan 2016-08-02 07:26:28 UTC

The "SAVEPOINT does not exist" error is due to a defect in InnoDB's savepoint handling -- when a deadlock or lock wait timeout occurs InnoDB rolls back the statement (not the transaction) but it seems to also discard all savepoints, so when SQLAlchemy sees the statement erroring out it tries to roll back the nested transaction but that fails.

So the "SAVEPOINT does not exist" error is not the real error -- that would be a deadlock in this case.

SHOW ENGINE INNODB STATUS reveals what is (presumably) the actual deadlock. The first transaction is the online data migration and the second transaction is probably the schedule_queued_recipe routine.

    ------------------------
    LATEST DETECTED DEADLOCK
    ------------------------
    160801  4:38:00
    *** (1) TRANSACTION:
    TRANSACTION 0 940301210, ACTIVE 59 sec, process no 28412, OS thread id 140590213740288 fetching rows
    mysql tables in use 5, locked 5
    LOCK WAIT 119229 lock struct(s), heap size 15677424, 30784321 row lock(s), undo log entries 1548
    MySQL thread id 802, query id 1418845 localhost beaker Copying to tmp table
    UPDATE command_queue
                    SET installation_id = (
                        SELECT installation.id
                        FROM reservation
                        INNER JOIN system_resource ON system_resource.reservation_id = reservation.id
                        INNER JOIN recipe_resource ON system_resource.id = recipe_resource.id
                        INNER JOIN installation ON recipe_resource.recipe_id = installation.recipe_id
                        WHERE system_resource.system_id = command_queue.system_id
                            AND reservation.start_time <= command_queue.updated
                        ORDER BY reservation.start_time DESC LIMIT 1
                    )
                    WHERE callback = 'bkr.server.model.auto_cmd_handler'
                        AND command_queue.installation_id IS NULL
                    ORDER BY id DESC
                    LIMIT 4000
    *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
    RECORD LOCKS space id 269 page no 6004 n bits 648 index `system_resource_system_id_fk` of table `beaker`.`system_resource` trx id 0 940301210 lock mode S waiting
    Record lock, heap no 578 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
     0: len 4; hex 80000400; asc     ;; 1: len 4; hex 8029944c; asc  ) L;;
     
    *** (2) TRANSACTION:
    TRANSACTION 0 940301874, ACTIVE 34 sec, process no 28412, OS thread id 140599782676224 inserting, thread declared inside InnoDB 500
    mysql tables in use 1, locked 1
    160 lock struct(s), heap size 30704, 6583 row lock(s), undo log entries 6776
    MySQL thread id 827, query id 1443948 localhost beaker update
    INSERT INTO system_resource (id, system_id, reservation_id) VALUES (2724962, 6269, NULL)
    *** (2) HOLDS THE LOCK(S):
    RECORD LOCKS space id 269 page no 6004 n bits 648 index `system_resource_system_id_fk` of table `beaker`.`system_resource` trx id 0 940301874 lock_mode X locks rec but not gap
    Record lock, heap no 578 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
     0: len 4; hex 80000400; asc     ;; 1: len 4; hex 8029944c; asc  ) L;;
     
    *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
    RECORD LOCKS space id 269 page no 6845 n bits 1272 index `system_resource_system_id_fk` of table `beaker`.`system_resource` trx id 0 940301874 lock_mode X locks gap before rec insert intention waiting
    Record lock, heap no 159 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
     0: len 4; hex 8000187e; asc    ~;; 1: len 4; hex 80153aca; asc   : ;;

Comment 2 Dan Callaghan 2016-08-02 07:32:15 UTC

Right now the only solution I can think of is to put the online data migration into beakerd, done serially with all its other processing (I guess interleaving one "batch" of migration into each iteration). That will prevent the various scheduler functions from deadlocking with the online data migration, since they won't be running concurrent transactions with each other.

We would need to make a way for the database to store which online data migrations have been completed successfully already, since we will presumably add more in future and we don't want beakerd to try running each of them every time it starts up (even though they are safe, they still have a cost). I guess something similar to the alembic_version table could be used, where we just use a new row to indicate when an online migration has been completed. beaker-init could check that at start-up to decide if it needs to do migrations as well as regular scheduling.

In some ways this would be preferable to what we have now, with beaker-init, since it means the admin would no longer need to decide to manually invoke the migrations, beakerd would know which migrations are needed and automatically run them itself.

Comment 4 Dan Callaghan 2016-08-03 06:22:18 UTC

This patch moves the online data migrations from beaker-init into beakerd, as described above:

http://gerrit.beaker-project.org/5110

It introduces a new table in a maintenance release, which is not something we normally do, so that makes life a bit more difficult. The code has to handle the case where the table doesn't exist (at least for the life of the 23.x series).

So I think I have covered all the possibilities:

* upgraded from < 23 to 23.0, ran beaker-init --online-data-migration successfully:
=> data_migration table does not exist, beakerd skips all migrations
(when later upgrading to 24.0 the table will be created and the migrations will be run but will have no effect)

* upgraded from < 23 to 23.1:
=> data_migration table is created and beakerd handles all online migrations, beaker-init --online-data-migration does not exist and is not needed

* upgraded from < 23 to 23.0, but hit this bug making online data migrations impossible:
=> install 23.1, run beaker-init to create the new table (no outage required), then beakerd will finish off any incomplete data migration

Comment 7 Dan Callaghan 2016-08-11 01:24:57 UTC

Beaker 23.1 has been released.

Note You need to log in before you can comment on or make changes to this bug.