Description of problem: For Beaker 23 we have a new "online data migration" step which is designed to run in the background after the outage is finished, for populating historical data for recipe installations. But on our production Beaker instance we can't run it while the system is live because it kills beakerd. Version-Release number of selected component (if applicable): 23.0 How reproducible: good luck Steps to Reproduce: 1. On a large, heavily loaded production Beaker instance run: beaker-init --online-data-migration=commands-for-recipe-installations 2. Watch beakerd logs Actual results: beakerd will eventually die with an unhandled exception like this: bkr.server.tools.beakerd ERROR Error in schedule_queued_recipe(2917806) Traceback (most recent call last): File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 437, in schedule_queued_recipes schedule_queued_recipe(recipe_id, guest_recipe_id) File "/usr/lib/python2.6/site-packages/bkr/server/tools/beakerd.py", line 564, in schedule_queued_recipe recipe.resource.allocate() File "/usr/lib/python2.6/site-packages/bkr/server/model/scheduler.py", line 3961, in allocate user=self.recipe.recipeset.job.owner) File "/usr/lib/python2.6/site-packages/bkr/server/model/inventory.py", line 1621, in reserve_for_recipe return self._reserve(service, user, u'recipe') File "/usr/lib/python2.6/site-packages/bkr/server/model/inventory.py", line 1640, in _reserve session.flush() File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/scoping.py", line 149, in do return getattr(self.registry(), name)(*args, **kwargs) File "/usr/lib64/python2.6/site-packages/sqlalchemy/orm/session.py", line 1824, in flush self._flush(objects) [...] OperationalError: (OperationalError) (1305, 'SAVEPOINT sa_savepoint_3 does not exist') 'ROLLBACK TO SAVEPOINT sa_savepoint_3' () Expected results: Need a solution to perform the online data migration without killing beakerd.
The "SAVEPOINT does not exist" error is due to a defect in InnoDB's savepoint handling -- when a deadlock or lock wait timeout occurs InnoDB rolls back the statement (not the transaction) but it seems to also discard all savepoints, so when SQLAlchemy sees the statement erroring out it tries to roll back the nested transaction but that fails. So the "SAVEPOINT does not exist" error is not the real error -- that would be a deadlock in this case. SHOW ENGINE INNODB STATUS reveals what is (presumably) the actual deadlock. The first transaction is the online data migration and the second transaction is probably the schedule_queued_recipe routine. ------------------------ LATEST DETECTED DEADLOCK ------------------------ 160801 4:38:00 *** (1) TRANSACTION: TRANSACTION 0 940301210, ACTIVE 59 sec, process no 28412, OS thread id 140590213740288 fetching rows mysql tables in use 5, locked 5 LOCK WAIT 119229 lock struct(s), heap size 15677424, 30784321 row lock(s), undo log entries 1548 MySQL thread id 802, query id 1418845 localhost beaker Copying to tmp table UPDATE command_queue SET installation_id = ( SELECT installation.id FROM reservation INNER JOIN system_resource ON system_resource.reservation_id = reservation.id INNER JOIN recipe_resource ON system_resource.id = recipe_resource.id INNER JOIN installation ON recipe_resource.recipe_id = installation.recipe_id WHERE system_resource.system_id = command_queue.system_id AND reservation.start_time <= command_queue.updated ORDER BY reservation.start_time DESC LIMIT 1 ) WHERE callback = 'bkr.server.model.auto_cmd_handler' AND command_queue.installation_id IS NULL ORDER BY id DESC LIMIT 4000 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 269 page no 6004 n bits 648 index `system_resource_system_id_fk` of table `beaker`.`system_resource` trx id 0 940301210 lock mode S waiting Record lock, heap no 578 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 4; hex 80000400; asc ;; 1: len 4; hex 8029944c; asc ) L;; *** (2) TRANSACTION: TRANSACTION 0 940301874, ACTIVE 34 sec, process no 28412, OS thread id 140599782676224 inserting, thread declared inside InnoDB 500 mysql tables in use 1, locked 1 160 lock struct(s), heap size 30704, 6583 row lock(s), undo log entries 6776 MySQL thread id 827, query id 1443948 localhost beaker update INSERT INTO system_resource (id, system_id, reservation_id) VALUES (2724962, 6269, NULL) *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 269 page no 6004 n bits 648 index `system_resource_system_id_fk` of table `beaker`.`system_resource` trx id 0 940301874 lock_mode X locks rec but not gap Record lock, heap no 578 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 4; hex 80000400; asc ;; 1: len 4; hex 8029944c; asc ) L;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 269 page no 6845 n bits 1272 index `system_resource_system_id_fk` of table `beaker`.`system_resource` trx id 0 940301874 lock_mode X locks gap before rec insert intention waiting Record lock, heap no 159 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 4; hex 8000187e; asc ~;; 1: len 4; hex 80153aca; asc : ;;
Right now the only solution I can think of is to put the online data migration into beakerd, done serially with all its other processing (I guess interleaving one "batch" of migration into each iteration). That will prevent the various scheduler functions from deadlocking with the online data migration, since they won't be running concurrent transactions with each other. We would need to make a way for the database to store which online data migrations have been completed successfully already, since we will presumably add more in future and we don't want beakerd to try running each of them every time it starts up (even though they are safe, they still have a cost). I guess something similar to the alembic_version table could be used, where we just use a new row to indicate when an online migration has been completed. beaker-init could check that at start-up to decide if it needs to do migrations as well as regular scheduling. In some ways this would be preferable to what we have now, with beaker-init, since it means the admin would no longer need to decide to manually invoke the migrations, beakerd would know which migrations are needed and automatically run them itself.
This patch moves the online data migrations from beaker-init into beakerd, as described above: http://gerrit.beaker-project.org/5110 It introduces a new table in a maintenance release, which is not something we normally do, so that makes life a bit more difficult. The code has to handle the case where the table doesn't exist (at least for the life of the 23.x series). So I think I have covered all the possibilities: * upgraded from < 23 to 23.0, ran beaker-init --online-data-migration successfully: => data_migration table does not exist, beakerd skips all migrations (when later upgrading to 24.0 the table will be created and the migrations will be run but will have no effect) * upgraded from < 23 to 23.1: => data_migration table is created and beakerd handles all online migrations, beaker-init --online-data-migration does not exist and is not needed * upgraded from < 23 to 23.0, but hit this bug making online data migrations impossible: => install 23.1, run beaker-init to create the new table (no outage required), then beakerd will finish off any incomplete data migration
Beaker 23.1 has been released.