At the moment after a system is provisioned for a recipe, the recipe state changes to Running but no other details are recorded against the job. When it hits Anaconda %pre we also record the /start result, but that's it.
Beaker knows much more about the progress of the installation than that. It should make all of that information available in the recipe results, to help users figure out what went wrong if the recipe Aborts during the installation.
* There should be an extra recipe state, Installing (or Provisioning?), while the install is running. The recipe state changes to Running only once the harness has reported in and started the first task.
* The recipe results should have a little expando or some other thing which shows timestamps of which stages the provisioning has reached. The current stages we track are rebooted, Anaconda %pre, and Anaconda %post. Anamon could also be expanded to report more detail. This info should be hidden by default, because if the job has run then users won't care about the install.
* Optionally, it could also show detailed logs of exactly what Beaker is doing for the provision, including netboot configs written and power scripts executed. Right now all the server has is the commands and their status, but beaker-provision could send more detailed info about each command as it runs.
We also need to document the new element that Beaker 0.15.3 adds to the recipe result XML:
<installation install_finished="2014-01-22 22:43:15" install_started="2014-01-22 22:35:36" postinstall_finished="2014-01-22 22:44:30"/>
The UI side of this bug is covered by the recipe page redesign which we have been working on:
so this bug will just be about the server-side changes needed to make that possible.
The first part is introducing a new table to track installations, which will consolidate a lot of the various pieces that have been tacked onto other tables over the years (the installation progress timestamps on recipe_resource being the main ones):
That patch is complete and ready to be merged, and should be enough to get the UI implementation done.
The remaining piece for this bug is introducing the new Installing status and removing the /start result and the first_task.start() stuff that we currently do to indicate that the recipe has started installing. Those changes are *not* necessary for the UI implementation although they *are* necessary for taskless reservations (bug 1168527). This second piece is not done yet.
Need to also find a faster way of doing this migration. What we have now is correct but too slow.
Even in our devel environment, which has only a modest amount of historical data, this query took more than 10 minutes:
INNER JOIN activity ON activity.id = command_queue.id
INNER JOIN reservation ON reservation.system_id = command_queue.system_id
AND activity.created >= reservation.start_time
AND (activity.created <= reservation.finish_time
OR reservation.finish_time IS NULL)
INNER JOIN system_resource ON system_resource.reservation_id = reservation.id
INNER JOIN recipe_resource ON recipe_resource.id = system_resource.id
INNER JOIN installation ON installation.recipe_id = recipe_resource.recipe_id
SET command_queue.installation_id = installation.id
WHERE callback = 'bkr.server.model.auto_cmd_handler'
(In reply to Dan Callaghan from comment #4)
So the problematic part of this query is:
> INNER JOIN reservation ON reservation.system_id =
> AND activity.created >= reservation.start_time
> AND (activity.created <= reservation.finish_time
> OR reservation.finish_time IS NULL)
specifically, the >= <= condition to match up the command queue entry with the reservation during which it occurred.
But that's the one piece we absolutely need in this query, there's no other way I can think of to match up command queue entries with their recipe. (Indeed that is the whole purpose of this bug, to give us a way to do that in future, through the installation table.)
The other contributing factor is that command_queue is one of our biggest tables (> 12 million rows in production), and activity is even bigger (> 32 million rows).
So I think it's unavoidable that we will have to cop the very high cost of this query. The only alternative is that we won't be able to show any kernel options or power commands on the new recipe page for any recipes which ran prior to the upgrade.
What we *can* do is run this query as a kind of "online data migration". It's not creating any schema necessary for the new Beaker code to run, it's just backpopulating data so that Beaker will show accurate info for existing recipes. That means we don't strictly need to run it during an outage, it can be run after the actual schema migrations are completed and Beaker is back up and running.
I am thinking we could introduce a new option to beaker-init for running specific online data migrations, after the schema migrations are done. Something like,
beaker-init --background --online-data-migration=commands-for-recipe-installations
We would need to make the query into batches and also idempotent (not repopulated already populated rows) so that the online migration process could be restarted.
(In reply to Dan Callaghan from comment #5)
(In reply to Dan Callaghan from comment #3)
> The remaining piece for this bug is introducing the new Installing status
> and removing the /start result and the first_task.start() stuff that we
> currently do to indicate that the recipe has started installing. Those
> changes are *not* necessary for the UI implementation although they *are*
> necessary for taskless reservations (bug 1168527). This second piece is not
> done yet.
Need to find and fix these warnings:
2016-04-14 09:46:40,069 alembic.migration INFO Running upgrade 3c5510511fd9 -> 51637c12cbd9, Create installation table
/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:324: Warning: Column 'created' cannot be null
/usr/lib64/python2.6/site-packages/sqlalchemy/engine/default.py:324: Warning: Column 'kernel_options' cannot be null
(In reply to Dan Callaghan from comment #8)
> Need to find and fix these warnings:
> 2016-04-14 09:46:40,069 alembic.migration INFO Running upgrade 3c5510511fd9
> -> 51637c12cbd9, Create installation table
> Warning: Column 'created' cannot be null
> cursor.execute(statement, parameters)
This one would be because we are populating installation.created based on recipe.start_time for guest_resource rows. But for guest recipes, a guest_resource row is always created as soon as the host recipe is scheduled -- even if the host recipe is cancelled or fails/aborts before it ever even starts, meaning that recipe.start_time is NULL.
In that case, strictly speaking installation.created should be set to the timestamp at which the host recipe was provisioned because that is how it will be done for new recipes after this patch. The closest we can probably get is to use reservation.start_time for the host recipe. It will make the query a bit fat but it should still perform fine.
> Warning: Column 'kernel_options' cannot be null
> cursor.execute(statement, parameters)
Similarly, for guest_resource rows we are just copying kernel_options from recipe.kernel_options but that can be NULL on older rows. We can just COALESCE with empty string.
(In reply to Dan Callaghan from comment #9)
> The closest we can probably
> get is to use reservation.start_time for the host recipe.
... except we have 4121 guest recipes where the guest_resource was allocated but the system_resource was not, somehow. Sigh.
(In reply to Dan Callaghan from comment #10)
> ... except we have 4121 guest recipes where the guest_resource was allocated
> but the system_resource was not, somehow. Sigh.
For those we will have to just fall back to 1970-01-01 because there is literally no other timestamp we can use. At least it avoids the warning from inserting NULL into a non-NULLable column. For the rest of the rows we can use the host recipe's reservation.start_time.
(In reply to Dan Callaghan from comment #9)
> Similarly, for guest_resource rows we are just copying kernel_options from
> recipe.kernel_options but that can be NULL on older rows. We can just
> COALESCE with empty string.
(In reply to Dan Callaghan from comment #7)
One problem with the new Installing status is that beah considers any status aside from Waiting or Running to mean that the recipe is finished. I have a beah patch for that here:
but it means for backwards compatibility with older beah versions we should also ensure that Beaker sets the status to Waiting once the installation is finished and the recipe is waiting for beah to start up and pick up tasks.
(In reply to Dan Callaghan from comment #12)
> but it means for backwards compatibility with older beah versions we should
> also ensure that Beaker sets the status to Waiting once the installation is
> finished and the recipe is waiting for beah to start up and pick up tasks.
This is still not working right, causing beah to end recipes with:
2016-04-21 03:29:59,126 backend handle_recipe_exception: INFO The recipe has finished.
(In reply to Dan Callaghan from comment #14)
The problem is, this code:
and its corresponding test case:
are assuming that job.update_status() will set the status back to Waiting once the installation.postinstall_finished timestamp has been set. And it would -- except that the job will never be marked dirty and so beakerd will never invoke update_status() on the job.
This is really a flaw in the way that the test cases invoke job.update_status() directly, even if in reality it wouldn't actually happen because the job is not actually dirty.
I am thinking to avoid this in future we might be able to assert in job.update_status() that it was invoked on a job that is actually marked as dirty already.
(In reply to Dan Callaghan from comment #15)
I forgot one last piece, which was to actually make the Installation tab say that it was installing when the status was Installing. You might say, the entire point of this bug :-)
Beaker 23.0 has been released.