With increasing CI builds in Copr, generated tarballs from git (usually for each pull-request and/or commit) could eat basically unlimited amount of storage (for N builds: N * size_of(tarball)) It would be nice to have a policy and garbage collector (maybe as an opt-in) for lookaside cache data.
per @clime comment from very valuable (but long) discussion https://pagure.io/copr/copr/issue/68: > The value is that COPR maintainers will not need to spend nights on fixing > machine that is broken because it is being flooded with tons of data that > no-one really needs for anything. I completely agree. We should let users to define *what's* important for them. There's already configurable "build garbage collector" on backend side, let's have something similar for the "dist-git" side.
Sorry but this is not a good idea from my point of view. It will be very difficult to define what is garbage and what is not. The implementation would be cumbersome and we cannot garbage collect Git repos into which users push their work if copr-dist-git becomes open.
I agree it is not easy, at least. As the starting point we could remove dist-git data related to coprs which are already deleted (completely; including the git content). Request for removal could be generated similarly to the request to delete data on backend ... except for processing it by copr-dist-git service. --- Otherwise, as a second step -- pruning the dist-git data for "still existing" coprs is sensitive topic. I doubt we can even touch the git repositories (though git data eat small chunk of storage, compared to lookaside). And when garbage collecting the lookaside cache, we need to be super cautions (maybe we should have the two phase process, see below the pseudocode). We could have script which will walk through remaining dist-git repositories, and removing files which are: * not referenced by any source file in dist-git branches * not referenced by any source file in git history say for N commits back * are marked to be deleted for two consecutive runs Simplified pseudo-code could be: for each user and copr project: keep_files = {} for each dist-git branch: for N last commits back for each file in sources keep_files[name] = sum # Data from previous run (1 month back e.g.) old_files_to_remove = load('<backup>') files_to_remove = [] for each file in lookaside: if file is younger than X days: continue if not file in keep_files: # Candidate for removal! if not in file old_files_to_remove: # Firstly marked for removal based on 'sources' files, don't # remove it now. files_to_remove.append(file) else: # Marked for removal for the second time. rm -f file store(files_to_remove, '<backup>') The <backup> file should be available over http, e.g.: $ curl http://dist-git.copr.com/praiskup/project/to_be_removed.txt # These files will be removed at the earliest on 2019-05-01 ef079f3e8a5160f33a09719093a08efb README.xz 9e854df51ca3fef8bfe566dbd7b89241 linux-3.18.tar.xz ... So users can review what's going to be be removed "soon" - and tell us that we have some bug in our script.. before we even remove the data. Such garbage collector shouldn't be unconditionally enabled -- we should rather have toggle button for it ....
Implemented in: https://pagure.io/copr/copr/issue/612
Meh, I closed wrong bug, sorry.
The bugs related to Copr build system are now migrated to the default Copr team tracker: https://github.com/fedora-copr/copr/issues/2648