Bug 1465736 - Testing packages with lots of files is slow
Testing packages with lots of files is slow
Product: rpmdeplint
Classification: Community
Component: general (Show other bugs)
Unspecified Unspecified
unspecified Severity unspecified (vote)
: 2.0
: ---
Assigned To: Dan Callaghan
Depends On:
  Show dependency treegraph
Reported: 2017-06-28 00:57 EDT by Roman Joost
Modified: 2017-11-05 19:42 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Roman Joost 2017-06-28 00:57:24 EDT
Description of problem:

I'm running rpmdeplint against a package with lots of files being packaged (aprox. 10000). While arguably we should perhaps change the packaging (which will be tracked in a different bug), it still remains an interesting test case for rpmdeplint. The whole check for the package takes about 40min. It seems the lion share of the time spend is on the file conflict checking.

Version-Release number of selected component (if applicable):


How reproducible:

Steps to Reproduce:
1. Pick a package with lots of files. The package under test has lots of rubygems bundled up.
2. Run rpmdeplint against the RPM.

Actual results:

Overall check takes very long (about 40min).

Expected results:

Quicker result would be handy.
Comment 3 Dan Callaghan 2017-06-30 03:21:42 EDT
I got as far as figuring out that the main slowness happens in _solvables_with_file, and it only happens for certain filenames and not others. Looking up solvables containing /etc/errata takes 13 ms but looking up solvables containing /var/www/errata_rails takes almost 300 ms. Could it be due to primary vs. filelists metadata (REPO_EXTEND_SOLVABLES)? Need to bust out gdb or gperf to figure it out. Start from libsolv's dataiterator_filelistcheck function.
Comment 4 Dan Callaghan 2017-10-30 19:19:43 EDT
I hacked up a C implementation roughly equivalent to what check-conflicts is doing here, and ran callgrind over it while it was doing the step to look up all the solvables containing each file.

The callgrind profile shows that it spends a large proportion of its time 32% in datamatcher_checkbasename here, doing string comparisons:


The total time for that dataiterator_step function is 96% of which 46% is self, so that function itself is also doing a lot of work aside from the hot string comparisons in datamatcher_checkbasename.

There is definitely some difference internally within libsolv between how it represents files from primary vs. filelists which I think is making a difference here. I probably need to dig further into how the representations work inside libsolv to see if there is some better way to find packages containing a file.
Comment 5 Dan Callaghan 2017-10-30 22:25:46 EDT
I got completely lost in the internals of libsolv and how it manages its repodata in memory... but I *did* stumble across some success by fiddling with the flags when loading repodata into the pool. I guess because it influences how the various pieces get represented.

I started out with:

    repo_add_rpmmd(repo, primary, 0, 0);
    repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES|REPO_LOCALPOOL);

which I probably just cargo-culted from the libsolv examples. It also matches what hawkey will do.

My test program measures the time to load RHEL6 repodata, and then the time to find all packages containing /etc/errata, and the time to find all packages containing /var/www/errata_rails/Makefile (100 times over to get an average).

With this version the times I got were:

Loading repodata: 11096 ms
Packages containing /etc/errata: 7 ms
Packages containing /var/www/errata_rails/Makefile: 214 ms

Next I tried taking out REPO_LOCALPOOL, since we only ever have one pool in rpmdeplint and "polluting" is not an issue (we probably *want* that).

    repo_add_rpmmd(repo, primary, 0, 0);
    repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES);

It trades a bit of extra loading time for a slightly faster lookup time:

Loading repodata: 11364 ms
Packages containing /etc/errata: 7 ms
Packages containing /var/www/errata_rails/Makefile: 205 ms

However I then noticed in libsolv-bindings.txt it says:

Do not internalize the added repository data. This is useful if
you plan to add more data because internalization is a costly

I figured it would make sense to skip internalizing on the first round and let it happen only after loading the filelists.

    repo_add_rpmmd(repo, primary, 0, REPO_NO_INTERNALIZE);
    repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES);

And it turns out this makes a huge difference to the performance:

Loading repodata: 9061 ms
Packages containing /etc/errata: 1 ms
Packages containing /var/www/errata_rails/Makefile: 2 ms

Presumably this is because, if everything is "internalized" immediately after loading primary, it *doesn't happen again* and the filelists are then left un-internalized.
Comment 6 Dan Callaghan 2017-10-30 22:36:05 EDT
Hmm unfortunately that also makes it not actually match any packages when searching for the file :-( which is presumably why it's so fast...
Comment 7 Dan Callaghan 2017-10-30 23:51:26 EDT
So if we assume that libsolv is going to be slow at finding all packages containing a particular filename (because its in-memory representation of packages is just not optimized for doing that over and over again with different filenames)... we need a different algorithm.

Instead of iterating over each filename in the package under test, and then searching for other packages containing that filename, we can just iterate over *every* other package and retrieve its filelist, and use Python set operations for look for any overlaps.


This takes the running time of the example in comment 2 down from ~40 minutes to about 23 seconds (excluding the time taken to download repodata). The memory cost will be higher but we don't need to worry much about that I think.

Note You need to log in before you can comment on or make changes to this bug.