Description of problem: I'm running rpmdeplint against a package with lots of files being packaged (aprox. 10000). While arguably we should perhaps change the packaging (which will be tracked in a different bug), it still remains an interesting test case for rpmdeplint. The whole check for the package takes about 40min. It seems the lion share of the time spend is on the file conflict checking. Version-Release number of selected component (if applicable): 1.3 How reproducible: 100% Steps to Reproduce: 1. Pick a package with lots of files. The package under test has lots of rubygems bundled up. 2. Run rpmdeplint against the RPM. Actual results: Overall check takes very long (about 40min). Expected results: Quicker result would be handy.
I got as far as figuring out that the main slowness happens in _solvables_with_file, and it only happens for certain filenames and not others. Looking up solvables containing /etc/errata takes 13 ms but looking up solvables containing /var/www/errata_rails takes almost 300 ms. Could it be due to primary vs. filelists metadata (REPO_EXTEND_SOLVABLES)? Need to bust out gdb or gperf to figure it out. Start from libsolv's dataiterator_filelistcheck function.
I hacked up a C implementation roughly equivalent to what check-conflicts is doing here, and ran callgrind over it while it was doing the step to look up all the solvables containing each file. The callgrind profile shows that it spends a large proportion of its time 32% in datamatcher_checkbasename here, doing string comparisons: https://github.com/openSUSE/libsolv/blob/0.6.30/src/repodata.c#L1796 The total time for that dataiterator_step function is 96% of which 46% is self, so that function itself is also doing a lot of work aside from the hot string comparisons in datamatcher_checkbasename. There is definitely some difference internally within libsolv between how it represents files from primary vs. filelists which I think is making a difference here. I probably need to dig further into how the representations work inside libsolv to see if there is some better way to find packages containing a file.
I got completely lost in the internals of libsolv and how it manages its repodata in memory... but I *did* stumble across some success by fiddling with the flags when loading repodata into the pool. I guess because it influences how the various pieces get represented. I started out with: repo_add_rpmmd(repo, primary, 0, 0); ... repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES|REPO_LOCALPOOL); which I probably just cargo-culted from the libsolv examples. It also matches what hawkey will do. My test program measures the time to load RHEL6 repodata, and then the time to find all packages containing /etc/errata, and the time to find all packages containing /var/www/errata_rails/Makefile (100 times over to get an average). With this version the times I got were: Loading repodata: 11096 ms Packages containing /etc/errata: 7 ms Packages containing /var/www/errata_rails/Makefile: 214 ms Next I tried taking out REPO_LOCALPOOL, since we only ever have one pool in rpmdeplint and "polluting" is not an issue (we probably *want* that). repo_add_rpmmd(repo, primary, 0, 0); ... repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES); It trades a bit of extra loading time for a slightly faster lookup time: Loading repodata: 11364 ms Packages containing /etc/errata: 7 ms Packages containing /var/www/errata_rails/Makefile: 205 ms However I then noticed in libsolv-bindings.txt it says: *REPO_NO_INTERNALIZE*:: Do not internalize the added repository data. This is useful if you plan to add more data because internalization is a costly operation. I figured it would make sense to skip internalizing on the first round and let it happen only after loading the filelists. repo_add_rpmmd(repo, primary, 0, REPO_NO_INTERNALIZE); ... repo_add_rpmmd(repo, filelists, 0, REPO_EXTEND_SOLVABLES); And it turns out this makes a huge difference to the performance: Loading repodata: 9061 ms Packages containing /etc/errata: 1 ms Packages containing /var/www/errata_rails/Makefile: 2 ms Presumably this is because, if everything is "internalized" immediately after loading primary, it *doesn't happen again* and the filelists are then left un-internalized.
Hmm unfortunately that also makes it not actually match any packages when searching for the file :-( which is presumably why it's so fast...
So if we assume that libsolv is going to be slow at finding all packages containing a particular filename (because its in-memory representation of packages is just not optimized for doing that over and over again with different filenames)... we need a different algorithm. Instead of iterating over each filename in the package under test, and then searching for other packages containing that filename, we can just iterate over *every* other package and retrieve its filelist, and use Python set operations for look for any overlaps. https://gerrit.beaker-project.org/5894 This takes the running time of the example in comment 2 down from ~40 minutes to about 23 seconds (excluding the time taken to download repodata). The memory cost will be higher but we don't need to worry much about that I think.
FEDORA-2023-a965252f36 has been submitted as an update to Fedora 39. https://bodhi.fedoraproject.org/updates/FEDORA-2023-a965252f36
FEDORA-2023-a965252f36 has been pushed to the Fedora 39 testing repository. Soon you'll be able to install the update with the following command: `sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-a965252f36` You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-a965252f36 See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.
FEDORA-2023-a965252f36 has been pushed to the Fedora 39 stable repository. If problem still persists, please make note of it in this bug report.