Bug 1340565 - each invocation leaves new orphaned directories in /tmp
Summary: each invocation leaves new orphaned directories in /tmp
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: rpmdeplint
Classification: Community
Component: general
Version: unreleased
Hardware: Unspecified
OS: Unspecified
high
unspecified
Target Milestone: 1.0
Assignee: Roman Joost
QA Contact: tools-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-27 22:13 UTC by Dan Callaghan
Modified: 2016-08-11 23:24 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-19 23:51:56 UTC
Embargoed:


Attachments (Terms of Use)

Description Dan Callaghan 2016-05-27 22:13:00 UTC
Each time rpmdeplint is run, repodata is fetched into a new /tmp directory. These are never reused and also never cleaned up.

Comment 1 Dan Callaghan 2016-05-27 22:15:27 UTC
We probably want to use a well-defined cache directory ($XDG_CACHE_HOME/rpmdeplint) with each repo's repodata cached in a subdirectory based on the repo name or URL.

To prevent it growing unboundedly we should probably do LRU expiration on startup -- that is, if there are more than X subdirectories in the cache, pick the oldest ones and delete them to reduce the cache down to X subdirectories, and then proceed.

Comment 2 Jon Orris 2016-05-31 21:55:11 UTC
Ideally we should also be able to cache repos by name, then only refresh all the metadata when it has actually changed.

Comment 3 Dan Callaghan 2016-05-31 22:04:12 UTC
I think librepo will do that for us already, if we just use a consistent path for the local cache across invocations, right?

Comment 4 Jon Orris 2016-06-01 14:43:29 UTC
(In reply to Dan Callaghan from comment #3)
> I think librepo will do that for us already, if we just use a consistent
> path for the local cache across invocations, right?

I don't think librepo has any functionality to do this automatically. It's a pretty low level wrapper around C librepo. 

DNF, which uses librepo, doesn't do this automatically either. It maintains a notion of cache expiry based on the age of the primary.xml.gz file, and will re-download if the data is too old (default 48 hours).

Comment 5 Jon Orris 2016-06-02 19:08:59 UTC
Basically, we want to do something like:

c = pycurl.Curl()
c.setopt(c.URL,'http://example.com/some_distro/x86_64/os/repodata/repomd.xml')
c.setopt(c.NOBODY, True)
c.setopt(pycurl.HTTPHEADER, ['If-Modified-Since: %s' % our_repomd_timestamp])
c.perform()
response = c.getinfo(pycurl.HTTP_CODE) 

if response == 302:
   return
elif response == 200:
   refresh_metadata()

Comment 6 Jon Orris 2016-06-02 19:10:36 UTC
typo: 304, not 302

Comment 7 Dan Callaghan 2016-06-06 22:56:47 UTC
So I guess there are really two separate issues here, which I was conflating:

* rpmdeplint is downloading the repodata every time it's invoked (which makes it take easily 10-60 seconds for a single run) rather than caching it on disk and reusing it whenever possible, like yum/dnf does

* rpmdeplint is leaving behind repodata in /tmp which is never cleaned up and will quickly grow to waste a lot of space

I assumed that the re-use of cached repodata would come basically for free from librepo, but it seems like right now it's actually handled by dnf instead, not by librepo, so we would have to reimplement that same logic in rpmdeplint.

It seems like in the (near?) future, we could switch to libhif which wraps up librepo and libhawkey and other libraries to provide more package-manager-like functionality, presumably including this logic for caching and re-using downloaded repodata. But that's a fairly big piece of work.

So for now let's limit this bug to just be about the repodata not cleaned up in /tmp. I will file a separate RFE for caching and re-using repodata.

Comment 8 Dan Callaghan 2016-06-06 22:58:20 UTC
(In reply to Dan Callaghan from comment #7)
> I assumed that the re-use of cached repodata would come basically for free
> from librepo, but it seems like right now it's actually handled by dnf
> instead, not by librepo, so we would have to reimplement that same logic in
> rpmdeplint.

The other thing that makes this more difficult is we would also need to implement some locking mechanisms to make it safe for multiple rpmdeplint processes to refer to the same cache without trampling on each other. Presumably this is also something that libhif will provide (it's handled by dnf now).

Comment 9 Dan Callaghan 2016-06-06 23:02:40 UTC
Bug 1343247 is the RFE for caching repodata.

Comment 10 Jon Orris 2016-06-07 21:11:13 UTC
Since we're creating sacks via hawkey.Sack(make_cache_dir=True), why don't we just use it's cache dir as the base of ours?

It's already creating it in a standard location that we can access via sack.cache_dir. If we created subdirectories like this:

sack.cache_dir
    rpmdeplint/
        reponame1/
        reponame2/

It would make it easier both to clean up unused data as well as caching and refreshing repo data as in bug 1343247


http://hawkey.readthedocs.io/en/latest/tutorial-py.html#building-and-reusing-the-repo-cache-label

Comment 11 Roman Joost 2016-06-07 23:24:33 UTC
Dear Jon,

isn't that a different item you're referring to? We're handing the repositories to hawkeys Sack, but before we do that, librepo downloads the repository information to a temporary directory.

Which means there are two instances which needs to keep track of temporary items: the sack and the repositories.

I've updated my patch to clean up created temporary directories:

https://gerrit.beaker-project.org/#/c/4948/6

Comment 12 Jon Orris 2016-06-08 02:09:02 UTC
What I'm suggesting may go against some of the refactorings that are in progress leading up to this change. My idea eliminates temp directories.

My suggestion is this:

1) Create hawkey.Sack(make_cache_dir=True) first

2) sack.cache_dir then points at a valid directory, i.e. /var/tmp/hawkey-jorris-bpjjLD. If this directory already exists it will be reused. If not, it is automatically created.

3) Use the sack.cache_dir as the root of any loaded repos, instead of creating temp directories. Then when using rpmdeplint like this:

rpmdeplint check-sat --repo=fedora24,http://example.com/fedora24/x86_64/os/ /tmp/python-rpmfluff-0.4.3-3.fc24.noarch.rpm

You would then have a directory structure like:

/var/tmp/hawkey-jorris-bpjjLD/rpmdeplint/fedora24/repodata/repomd.xml
/var/tmp/hawkey-jorris-bpjjLD/rpmdeplint/fedora24/repodata/{hash}-primary.xml.gz
/var/tmp/hawkey-jorris-bpjjLD/rpmdeplint/fedora24/repodata/{otherhash}-filelists.xml.gz

This enables us to easily support all three use cases with repo dirs whe running rpmdeplint:

1) rpmdeplint/fedora24 hasn't been touched in 3 months, I'm not using it now, might as well delete it.
2) rpmdeplint/fedora24 already exists, I'm using it now, and it's up to date.
2) rpmdeplint/fedora24 already exists, I'm using it now, and it's out of date. Update it and run.

Comment 13 Roman Joost 2016-07-19 23:51:56 UTC
This is released with rpmdeplint 1.0 available on Fedora Copr:

https://copr.fedorainfracloud.org/coprs/dcallagh/rpmdeplint/

for RHEL 7 (EPEL), F23, F24 and Rawhide.


Note You need to log in before you can comment on or make changes to this bug.