Similar to bug 1322700, we should delete the log_recipe_task and log_recipe rows for the deleted jobs.
On a production db dump:
MariaDB [beaker]> SELECT count(*) FROM log_recipe_task INNER JOIN recipe_task ON log_recipe_task.recipe_task_id = recipe_task.id INNER JOIN recipe on recipe_task.recipe_id = recipe.id INNER JOIN recipe_set on recipe.recipe_set_id = recipe_set.id INNER JOIN job on recipe_set.job_id = job.id WHERE job.deleted IS NOT NULL;
| count(*) |
| 18403070 |
1 row in set (10 min 33.02 sec)
MariaDB [beaker]> SELECT count(*) FROM log_recipe INNER JOIN recipe on log_recipe.recipe_id = recipe.id INNER JOIN recipe_set on recipe.recipe_set_id = recipe_set.id INNER JOIN job on recipe_set.job_id = job.id WHERE job.deleted IS NOT NULL;
| count(*) |
| 2915042 |
1 row in set (39.77 sec)
(In reply to Jon Orris from comment #1)
Pushed to wrong branch. Correct change is:
This bug fix is included in beaker-server-24.0.git.210.01bf4bf which is currently available for testing here:
Matt, Jon: has someone actually checked that all the log_* rows we will be deleting here have already had the log files deleted on disk as well?
Otherwise this change is actually going to make things worse, not better (since now we will have a pile of old logs left behind on disk with no reference to them in the database to locate them). In that case the proper fix would be to figure out why beaker-log-delete has not already deleted these files and their corresponding log_* rows.
But in the meantime we need to roll this patch out of 24.0, right?
(In reply to Dan Callaghan from comment #7)
> But in the meantime we need to roll this patch out of 24.0, right?
Yeah, I have a patch to revert the patch.
I'm actually putting this one back to NEW, since with the recent team re-shuffling it highly likely that someone else needs to finish it.
beaker-log-delete and the related model code in Job.delete() has received a lot of bug fixes over the years since it was originally introduced in 2011, so it might not be worthwhile digging through that to find exactly why it ended up leaving some old jobs in this state: where the job is marked as deleted, but some logs are still left behind.
I did look out for old code which looked like it might be failing to delete the logs but then still committing the db transaction -- but actually, all the fixes I could see seemed more likely to be the other way around -- that is the log files would be deleted but then the db transaction *not* committed.
Regardless, we probably just need to add some code to beaker-log-delete to look for jobs which are marked as deleted, but still have logs recorded against them, and to delete those again "properly".
It would be good to get this done sooner rather than later, since if there really are 20 million log files which could be deleted this would potentially save quite a bit of storage.
So this mess with "deleted" vs. "soft-deleted" vs. "to_delete" has been bugging me for years... It might be finally time to clean that up:
https://gerrit.beaker-project.org/5765 remove Job "sanitise" methods
https://gerrit.beaker-project.org/5766 remove unused params on Job.find_jobs
https://gerrit.beaker-project.org/5767 use Job.is_deleted in queries
https://gerrit.beaker-project.org/5768 rationalise Job.delete_jobs, Job._delete_criteria, job.delete_jobs
https://gerrit.beaker-project.org/5769 clarify deleted vs. expired vs. purged for jobs
With that clean-up in place, the problem here is a lot easier to describe: we have some old jobs which have been *deleted* but not successfully *purged*. And therefore we can just update the db to clear the purged timestamp and let beaker-log-delete purge them again.
https://gerrit.beaker-project.org/5770 make beaker-log-delete fix up old jobs which still have logs
Beaker 25.0 has been released.
Release notes are available upstream: https://beaker-project.org/docs/whats-new/release-25.html