Bug 1465736 - Testing packages with lots of files is slow
Summary: Testing packages with lots of files is slow
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: rpmdeplint
Version: rawhide
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Miroslav Vadkerti
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-06-28 04:57 UTC by Roman Joost
Modified: 2023-11-03 18:26 UTC (History)
5 users (show)

Fixed In Version: rpmdeplint-2.0~rc3-1.fc39
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-03 18:26:19 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Roman Joost 2017-06-28 04:57:24 UTC
Description of problem:

I'm running rpmdeplint against a package with lots of files being packaged (aprox. 10000). While arguably we should perhaps change the packaging (which will be tracked in a different bug), it still remains an interesting test case for rpmdeplint. The whole check for the package takes about 40min. It seems the lion share of the time spend is on the file conflict checking.


Version-Release number of selected component (if applicable):

1.3

How reproducible:
100%

Steps to Reproduce:
1. Pick a package with lots of files. The package under test has lots of rubygems bundled up.
2. Run rpmdeplint against the RPM.

Actual results:

Overall check takes very long (about 40min).

Expected results:

Quicker result would be handy.

Comment 3 Dan Callaghan 2017-06-30 07:21:42 UTC
I got as far as figuring out that the main slowness happens in _solvables_with_file, and it only happens for certain filenames and not others. Looking up solvables containing /etc/errata takes 13 ms but looking up solvables containing /var/www/errata_rails takes almost 300 ms. Could it be due to primary vs. filelists metadata (REPO_EXTEND_SOLVABLES)? Need to bust out gdb or gperf to figure it out. Start from libsolv's dataiterator_filelistcheck function.

Comment 4 Dan Callaghan 2017-10-30 23:19:43 UTC
I hacked up a C implementation roughly equivalent to what check-conflicts is doing here, and ran callgrind over it while it was doing the step to look up all the solvables containing each file.

The callgrind profile shows that it spends a large proportion of its time 32% in datamatcher_checkbasename here, doing string comparisons:

https://github.com/openSUSE/libsolv/blob/0.6.30/src/repodata.c#L1796

The total time for that dataiterator_step function is 96% of which 46% is self, so that function itself is also doing a lot of work aside from the hot string comparisons in datamatcher_checkbasename.

There is definitely some difference internally within libsolv between how it represents files from primary vs. filelists which I think is making a difference here. I probably need to dig further into how the representations work inside libsolv to see if there is some better way to find packages containing a file.

Comment 5 Dan Callaghan 2017-10-31 02:25:46 UTC
I got completely lost in the internals of libsolv and how it manages its repodata in memory... but I *did* stumble across some success by fiddling with the flags when loading repodata into the pool. I guess because it influences how the various pieces get represented.

I started out with:

    repo_add_rpmmd(repo, primary, 0, 0);
    ...
    repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES|REPO_LOCALPOOL);

which I probably just cargo-culted from the libsolv examples. It also matches what hawkey will do.

My test program measures the time to load RHEL6 repodata, and then the time to find all packages containing /etc/errata, and the time to find all packages containing /var/www/errata_rails/Makefile (100 times over to get an average).

With this version the times I got were:

Loading repodata: 11096 ms
Packages containing /etc/errata: 7 ms
Packages containing /var/www/errata_rails/Makefile: 214 ms

Next I tried taking out REPO_LOCALPOOL, since we only ever have one pool in rpmdeplint and "polluting" is not an issue (we probably *want* that).

    repo_add_rpmmd(repo, primary, 0, 0);
    ...
    repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES);

It trades a bit of extra loading time for a slightly faster lookup time:

Loading repodata: 11364 ms
Packages containing /etc/errata: 7 ms
Packages containing /var/www/errata_rails/Makefile: 205 ms

However I then noticed in libsolv-bindings.txt it says:

*REPO_NO_INTERNALIZE*::
Do not internalize the added repository data. This is useful if
you plan to add more data because internalization is a costly
operation.

I figured it would make sense to skip internalizing on the first round and let it happen only after loading the filelists.

    repo_add_rpmmd(repo, primary, 0, REPO_NO_INTERNALIZE);
    ...
    repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES);

And it turns out this makes a huge difference to the performance:

Loading repodata: 9061 ms
Packages containing /etc/errata: 1 ms
Packages containing /var/www/errata_rails/Makefile: 2 ms

Presumably this is because, if everything is "internalized" immediately after loading primary, it *doesn't happen again* and the filelists are then left un-internalized.

Comment 6 Dan Callaghan 2017-10-31 02:36:05 UTC
Hmm unfortunately that also makes it not actually match any packages when searching for the file :-( which is presumably why it's so fast...

Comment 7 Dan Callaghan 2017-10-31 03:51:26 UTC
So if we assume that libsolv is going to be slow at finding all packages containing a particular filename (because its in-memory representation of packages is just not optimized for doing that over and over again with different filenames)... we need a different algorithm.

Instead of iterating over each filename in the package under test, and then searching for other packages containing that filename, we can just iterate over *every* other package and retrieve its filelist, and use Python set operations for look for any overlaps.

https://gerrit.beaker-project.org/5894

This takes the running time of the example in comment 2 down from ~40 minutes to about 23 seconds (excluding the time taken to download repodata). The memory cost will be higher but we don't need to worry much about that I think.

Comment 10 Fedora Update System 2023-10-02 08:04:04 UTC
FEDORA-2023-a965252f36 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-a965252f36

Comment 11 Fedora Update System 2023-10-03 03:39:44 UTC
FEDORA-2023-a965252f36 has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-a965252f36`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-a965252f36

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 12 Fedora Update System 2023-11-03 18:26:19 UTC
FEDORA-2023-a965252f36 has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.