Bug 1427431 - RFE: dist-git: policy for garbage collecting of lookaside cache tarballs
Summary: RFE: dist-git: policy for garbage collecting of lookaside cache tarballs
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Copr
Classification: Community
Component: backend
Version: unspecified
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Copr Team
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1402861
TreeView+ depends on / blocked
 
Reported: 2017-02-28 08:33 UTC by Pavel Raiskup
Modified: 2023-04-03 18:45 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-04-03 18:45:02 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1402861 0 unspecified CLOSED Cleanup of dist-git and results 2021-02-22 00:41:40 UTC

Internal Links: 1402861

Description Pavel Raiskup 2017-02-28 08:33:02 UTC
With increasing CI builds in Copr, generated tarballs from git (usually
for each pull-request and/or commit) could eat basically unlimited amount
of storage (for N builds: N * size_of(tarball))

It would be nice to have a policy and garbage collector (maybe as an
opt-in) for lookaside cache data.

Comment 1 Pavel Raiskup 2017-06-01 09:26:22 UTC
per @clime comment from very valuable (but long) discussion
https://pagure.io/copr/copr/issue/68:

> The value is that COPR maintainers will not need to spend nights on fixing
> machine that is broken because it is being flooded with tons of data that
> no-one really needs for anything.

I completely agree.  We should let users to define *what's* important for them.
There's already configurable "build garbage collector" on backend side, let's
have something similar for the "dist-git" side.

Comment 2 clime 2017-06-02 09:26:54 UTC
Sorry but this is not a good idea from my point of view. It will be very difficult to define what is garbage and what is not. The implementation would be cumbersome and we cannot garbage collect Git repos into which users push their work if copr-dist-git becomes open.

Comment 3 Pavel Raiskup 2017-06-02 13:50:26 UTC
I agree it is not easy, at least.  As the starting point we could remove
dist-git data related to coprs which are already deleted (completely;
including the git content).  Request for removal could be generated similarly
to the request to delete data on backend ... except for processing it by
copr-dist-git service.

---

Otherwise, as a second step -- pruning the dist-git data for "still
existing" coprs is sensitive topic.  I doubt we can even touch the git
repositories (though git data eat small chunk of storage, compared to
lookaside).  And when garbage collecting the lookaside cache, we need to
be super cautions (maybe we should have the two phase process, see below
the pseudocode).

We could have script which will walk through remaining dist-git
repositories, and removing files which are:

  * not referenced by any source file in dist-git branches
  * not referenced by any source file in git history say for N commits
    back
  * are marked to be deleted for two consecutive runs

Simplified pseudo-code could be:

  for each user and copr project:
    keep_files = {}
    for each dist-git branch:
      for N last commits back
        for each file in sources
          keep_files[name] = sum

    # Data from previous run (1 month back e.g.)
    old_files_to_remove = load('<backup>')

    files_to_remove = []
    for each file in lookaside:
      if file is younger than X days:
        continue

      if not file in keep_files:
        # Candidate for removal!
        if not in file old_files_to_remove:
          # Firstly marked for removal based on 'sources' files, don't
          # remove it now.
          files_to_remove.append(file)
        else:
          # Marked for removal for the second time.
          rm -f file

    store(files_to_remove, '<backup>')

The <backup> file should be available over http, e.g.:

  $ curl http://dist-git.copr.com/praiskup/project/to_be_removed.txt
  # These files will be removed at the earliest on 2019-05-01
  ef079f3e8a5160f33a09719093a08efb README.xz
  9e854df51ca3fef8bfe566dbd7b89241 linux-3.18.tar.xz
  ...

So users can review what's going to be be removed "soon" - and tell us that we
have some bug in our script..  before we even remove the data.

Such garbage collector shouldn't be unconditionally enabled -- we should
rather have toggle button for it ....

Comment 4 Pavel Raiskup 2020-12-14 14:09:03 UTC
Implemented in:
https://pagure.io/copr/copr/issue/612

Comment 5 Pavel Raiskup 2020-12-14 14:09:52 UTC
Meh, I closed wrong bug, sorry.

Comment 6 Jakub Kadlčík 2023-04-03 18:45:02 UTC
The bugs related to Copr build system are now migrated to the
default Copr team tracker:
https://github.com/fedora-copr/copr/issues/2648


Note You need to log in before you can comment on or make changes to this bug.