1343247 – [RFE] cache and re-use repodata instead of fetching it every time

Bug 1343247 - [RFE] cache and re-use repodata instead of fetching it every time

Summary: [RFE] cache and re-use repodata instead of fetching it every time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpmdeplint
Sub Component:
Version:	rawhide
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	unspecified
Target Milestone:	---
Assignee:	Miroslav Vadkerti
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-06-06 23:00 UTC by Dan Callaghan
Modified:	2023-11-03 18:26 UTC (History)
CC List:	6 users (show)
Fixed In Version:	rpmdeplint-2.0~rc3-1.fc39
Clone Of:
Environment:
Last Closed:	2023-11-03 18:26:08 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Dan Callaghan 2016-06-06 23:00:44 UTC

Rpmdeplint should take advantage of repomd.xml timestamps and HTTP caching mechanisms to avoid re-downloading repodata whenever possible, similar to how dnf does it currently.

Ideally this would be handled for us by libhif or some other library. See also bug 1340565.

Comment 1 Jon Orris 2016-06-10 00:13:05 UTC

I'll take this on as soon as the sorting bits in bug 1340565 are sorted out.

Comment 2 Jon Orris 2016-06-10 20:07:02 UTC

Patch submitted:

https://gerrit.beaker-project.org/#/c/4980/

Comment 3 Roman Joost 2016-07-19 23:43:34 UTC

The patch in comment 2 was abandoned, because we anticipate that users will most likely run rpmdeplint on the same machine concurrently (Jenkins env). Caching data in a concurrent environment turned out non-trivial and we'll defer implementing this perhaps later as a nice to have.

Comment 5 Dan Callaghan 2016-12-14 05:04:07 UTC

Blake, I know you have been working on this for a while already so I don't want to derail you but I thought I would give a bit of a braindump of some thoughts I had around this, in case any of this is helpful...

First up note that we are not constrained by RHEL7 anymore, since we are planning to deploy rpm-test-trigger on F25. So if we decide to implement this caching using something from libhif/libdnf such that it no longer becomes usable on RHEL7 I *think* that might be an acceptable tradeoff. (Ideally we would come up with some fallback code path for other people who still want to use rpmdeplint on RHEL7 though.) So that is just one difference from Jon's previous attempt. But... it looks like libhif/libdnf is still vaporware, in the sense of no usable release yet. :-(

Some other "requirements" in the sense of things that it will be important that we get right:

* We need to always ensure we are using the latest repodata, and that the cache is purely as a way of avoiding wasting bandwidth. We need to ensure there is never a question of "are my test results incorrect because I need to clear the cache?" because we should be confident that the cache freshness is checked every time.

* HTTP provides mechanisms for dealing with caching (conditional requests with If-Modified-Since) which we should take advantage of. The repodata itself also has mechanisms to help with this (repomd.xml has timestamps and checksums for the other files in the repodata) so we should make use of that also.

* We know we will have many concurrent rpmdeplint processes running on the same machine. At the very least, we need that to work correctly (no rpmdeplint process should ever use an incomplete cached repodata) while still letting them re-use each other's caches.

* Even better if rpmdeplint processes which are using different repos are not unnecessarily serialized on each other (for example, if process A needs RHEL7+HA and process B needs RHEL7+Optional, process B should not get stuck waiting for process A to download HA before process B downloads Optional).

* We can't assume anything about the repo names/ids because they are generated and meaningless (RHEL0, RHEL1, etc) and RHEL0 in one rpmdeplint process is possibly different from RHEL0 in another. This means we can't key the cache on the repo name (like yum and dnf do). We probably have to key the cache on the full repo URL. I wrote check-yum-repos.py in beaker-project.org to do that, for example, with - substituted for / to make the URLs filesystem-safe.

Comment 6 Dan Callaghan 2016-12-14 05:20:41 UTC

As far as implementing the storage of the cache itself, we have two basic approaches which can help us:

* flock
* the standard write-temp-file-then-rename dance

The simplest approach would be for every rpmdeplint process to use /var/cache/rpmdeplint and hold a flock over that directory for the entire duration of the run. That's what yum does. The downside is that it will effectively serialize every rpmdeplint process on the whole system, which is not ideal and will basically ruin the performance of rpm-test-trigger.

A tempting approach might be something like, cache repodata in /var/cache/rpmdeplint/<url-of-repo> and then have rpmdeplint acquire a flock on the corresponding directory for each repo that it needs. The problem there is that you must be very careful to acquire the flocks in the same order in every process every time, otherwise they will deadlock. It's also still wasteful because it means that every rpmdeplint process testing RHEL7, or any combination of RHEL7+addons or layered products, will be serialized due to taking the lock on the cache for the RHEL7 base repo.

Essentially, any approach which relies on acquiring a flock and then updating the cache and then holding the flock while performing the rest of the analysis -- is going to ruin concurrency of rpmdeplint processes on the system because they will always be contending.

What we really want is for each rpmdeplint process to be able to populate or re-use its cache for each repo it wants, and then proceed with its analysis using a snapshot of that cache which may be reused (but not overwritten!) by any other rpmdeplint process.

One approach might be to key the cache not just on the repo URL, but on the repo URL + timestamp (or E-Tag or similar). And then each rpmdeplint process can either re-use that repodata if it's still fresh, or write a *new separate* copy if it has fetched a newer version, with the old version left behind for any existing rpmdeplint processes still using it. It would just need some LRU cleanup mechanism to clean out stale cache entries.

I had another idea too, which might be crazier or it might be nice and simple... inspired by git's object store, and the fact that repomd.xml gives us the checksum of each repodata file. We could use a "content-addressable" scheme to store repodata files (except repomd.xml itself, which is small and constantly changing), for example:

/var/cache/rpmdeplint/<checksum-type>/<checksum-first-letter>/<rest-of-checksum>

Each file is immutable because it's named after its checksum. We can use the write-temp-file-then-rename dance to ensure that each file in the cache is atomically created and processes do not accidentally use an incomplete file. When rpmdeplint wants to use a repo, it first unconditionally fetches repomd.xml (caching these is not worthwhile anyway) and then for each file in the repodata, it can use the checksum to look in the cache if it already has the file, else it downloads it and puts it into the cache.

With that scheme, every rpmdeplint process can run concurrently without holding any locks and it would let unrelated repos be downloaded concurrently as well without serializing. As above it would just need a LRU cleanup mechanism in place to ensure that /var/cache/rpmdeplint doesn't grow forever. Basically just before writing any new files into the cache, go through and find any file more than 1 week old and unlink it (or something like that).

Comment 7 Tyrone Abdy 2016-12-14 05:35:02 UTC

On the above rename process above, the only caveat that we could run into, is that you might want to check the temp in progress file as to avoid the possibility that you have two rpmdeplint processes downloading the same data for caching.

e.g

if you run concurrently two commands as such 
rpmdeplint --repo=RHEL0,http://repo1 package1.rpm
rpmdeplint --repo=RHEL0,http://repo1 package2.rpm

they might conflict on the caching front.

Comment 8 Dan Callaghan 2016-12-14 05:40:08 UTC

(In reply to Tyrone Abdy from comment #7)

Yep that's a good point. There are two options:

1. allow each process to download the same file at the same time, and rely on the temp-file-rename dance to ensure they both successfully add their copy to the cache without crashing or breaking atomicity (so everything works, but we wasted downloading the same file twice); or

2. use a predictable name for the temp file, and do O_CREAT|O_EXCL with a flock to ensure that only one process at a time can be downloading the file (the other process will wait for the flock and then just use the copy which the first process downloaded)

Option 2 saves bandwidth but it is a lot more complexity to get it right. So it might be worth going with option 1 first (you kind of get it for free anyway, if the temp-file-then-rename dance is implemented correctly) and then we can expand it to option 2 later.

Comment 9 Dan Callaghan 2016-12-14 09:23:26 UTC

Oh I just remembered we can't use /var/cache/rpmdeplint because we will be running rpmdeplint as unprivileged processes. Safely sharing the cache across different user accounts will be way too much of a headache to bother with. So probably we should just put the cache in $XDG_CACHE_HOME/rpmdeplint.

And then for rpm-test-trigger, we should be running it as its own user account with a homedir of /var/lib/rpm-test-trigger and so its rpmdeplint cache would therefore go into /var/lib/rpm-test-trigger/.cache/rpmdeplint which is probably what we want.

https://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html

Comment 10 Jon Orris 2016-12-15 00:05:56 UTC

The way I was thinking about this was to have caching be optional, not the default. Additionally, the user would choose the cache directory. That way we don't have to worry about locking into one specific cache location vs another.

In the case of multiple processes, I wonder whether rpmdeplint/test trigger is the right place to solve that problem. I think it would be better in that case to have a separate process utilizing Dan's ideas in comment #6 managing the multi process case. The individual rpmdeplint processes can then use the shared cache directories without worrying about each process having to manage it.

Comment 11 Blake McIvor 2016-12-21 03:05:31 UTC

https://gerrit.beaker-project.org/#/c/5558/

Comment 12 Dan Callaghan 2017-04-28 04:51:50 UTC

So this is merged for 1.3 but there are a few minor problems I have noticed.

The current version is not keeping cache entries as:

.cache/rpmdeplint/<checksum-first-letter>/<rest-of-checksum>

but actually as:

.cache/rpmdeplint/<checksum-first-letter>/<rest-of-checksum>/<checksum>-primary.xml.gz

It turns out that is necessary, with the current hawkey APIs, because if the filename is passed in as just a long hex string (with no file extension) libsolv does not see the .gz extension so it does not know to transparently decompress while reading. That manifests as an exception from hawkey:

_hawkey.Exception: repo_add_repomdxml/rpmmd() has failed.

but if you poke into the libsolv pool_errstr() using gdb the error is:

$6 = 0x555555f7a860 "repo_rpmmd: Document is empty\n at line 1:1"

because the file only contains compressed gzip binary garbage.

The downside with that naming scheme, where each cache entry is its own directory, is you end up with a very large number of empty directories left behind. The cache cleaning logic only removes files, it doesn't handle removing empty parent directories.

The other, maybe more serious problem, is because of the same reason (hawkey needs to be given a filename, not open file descriptor). There is a race window where rpmdeplint first opens a cache entry to make sure it exists, and then closes it, and passes the cache filename to hawkey so that libsolv can open it again. The window between closing and reopening is a race condition during which another rpmdeplint process could unlink the cache entry.

It turns out if we were making libsolv calls directly, this is easily avoidable (libsolv lets the caller handle file opening, and it has solv_xfopen which takes a filename plus open file descriptor which is what we need for our cache handling here). Igor actually suggested switching to libsolv calls directly in bug 1422873 and it seems like it's quite a reasonable idea.

So I think we should disable this caching behaviour by default in 1.3 (or take it out entirely, if we need to) and perhaps try switching to libsolv and then use that to address the above issues.

Comment 13 Dan Callaghan 2017-04-28 05:28:37 UTC

I have tagged rpmdeplint 1.3 based on 1a6b1bf9cebced3a2793228ced9b180d91ff1f52 which is the commit before this was introduced (but contains all our other fixes). So this is still MODIFIED because it's on master but will be in a later release, once we figure out what we want to do.

Comment 14 Dan Callaghan 2017-04-29 00:51:44 UTC

The other thing is that the cache expiry is done based on file modification time, but when rpmdeplint reuses a cached file it doesn't touch the file to bump the modification time. So the cache is not actually keeping the most-recently used entries right now.

Comment 15 Dan Callaghan 2017-06-29 06:32:36 UTC

Moving to 2.0 since it needs the libsolv patch.

Comment 16 Dan Callaghan 2017-07-03 04:16:56 UTC

Some extra cache-related fixes:

https://gerrit.beaker-project.org/5722
https://gerrit.beaker-project.org/5723
https://gerrit.beaker-project.org/5724
https://gerrit.beaker-project.org/5725

Comment 17 Dan Callaghan 2017-07-03 05:16:24 UTC

One more problem with the original caching patch... rpmdeplint now leaks *two* directories in /var/tmp/rpmdeplint-* per repo per invocation. So we'll need to fix that.

Comment 18 Dan Callaghan 2017-07-03 05:27:38 UTC

Also the code to download primary.xml.gz/filelists.xml.gz using requests doesn't work if the repo is using a metalink or mirrorlist, it assumes there is only a single baseurl.

Comment 19 Igor Gnatenko 2017-07-03 06:16:12 UTC

(In reply to Dan Callaghan from comment #18)
> Also the code to download primary.xml.gz/filelists.xml.gz using requests
> doesn't work if the repo is using a metalink or mirrorlist, it assumes there
> is only a single baseurl.

At this point you will start reimplementing librepo :)

Comment 20 Dan Callaghan 2017-07-03 06:30:27 UTC

Yeah... I really don't want to do that! The problem is librepo doesn't give much control over where/how the downloaded files are written to disk, so that we can implement this hash-addressed caching scheme. That's why we started using python-requests for some of the downloads.

I was also just looking at the package downloading code (currently still using librepo, but I want it to go through the same cache like the repodata downloads).

I probably need to see if there are some enhancements we could request in the librepo API to support this use case -- with our custom cache layout. Maybe to allow passing an open file descriptor where to write each downloaded file or something...

Comment 21 Jan Hutař 2017-08-04 06:16:21 UTC

What about changing non-unique and non-descriptive repo names (RHEL0 -> https://cdn.redhat.com/content/dist/rhel/server/6/6Server/x86_64/os/; RHEL1 -> https://cdn.redhat.com/content/dist/rhel/server/6/6Server/x86_64/highavailability/os/) to some ugly labels which would uniquely describe repo URL, so librepo could be used even with multiple rpmdeplint processes running in parallel (RHEL0 -> https___cdn_redhat_com_content_dist_rhel_server_6_6Server_x86_64_os_)? Or would that serialize it anyway?

Comment 22 Dan Callaghan 2017-08-07 23:27:23 UTC

(In reply to Jan Hutař from comment #21)

Yeah the point of this somewhat unique cache design is to allow many concurrent processes using the cache *and* still reuse repodata files when they are seen repeatedly (even if under different ids or URLs).

A normal package manager doesn't have to deal with either of those cases (that is, you expect only one process on the system, and you expect the repo configuration to be relatively stable and not changing for every run). Which is why librepo's API doesn't really work for us here.

Comment 23 Dan Callaghan 2017-08-07 23:28:58 UTC

This bug is on MODIFIED because we have a bunch of code merged for it already, but the problems in comment 17 and comment 18 need to be fixed before we can release it therefore flipping back to ASSIGNED.

Comment 26 Fedora Update System 2023-10-02 08:03:51 UTC

FEDORA-2023-a965252f36 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-a965252f36

Comment 27 Fedora Update System 2023-10-03 03:39:33 UTC

FEDORA-2023-a965252f36 has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-a965252f36`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-a965252f36

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 28 Fedora Update System 2023-11-03 18:26:08 UTC

FEDORA-2023-a965252f36 has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.