Bug 1225501
| Summary: | query performance does not scale | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Daniel Mach <dmach> | ||||||
| Component: | libdnf | Assignee: | rpm-software-management | ||||||
| Status: | CLOSED UPSTREAM | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | rawhide | CC: | dmach, jmracek, jzeleny, lkocman, mluscon, mvanross, packaging-team-maint, pbrobinson, rpm-software-management, vmukhame | ||||||
| Target Milestone: | --- | Keywords: | Performance, Reopened, Tracking, Triaged | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | libdnf-0.14 | Doc Type: | Bug Fix | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-05-29 14:20:35 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1080837, 1156501 | ||||||||
| Attachments: | 
 | ||||||||
| We would appreciate more data from profiler, please, to get it fixed. *** Bug 1272109 has been marked as a duplicate of this bug. *** This package has changed ownership in the Fedora Package Database. Reassigning to the new owner of this component. Fedora 22 changed to end-of-life (EOL) status on 2016-07-19. Fedora 22 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. btw, I have to note that it's impossible to get proper performance with hawkey/libdnf/dnf for repoclosure, because they don't expose libsolv objects. If you need speed, use libsolv directly. Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    14                                           @profile
    15                                           def main():
    16         1         3801   3801.0      0.0      d = dnf.Base()
    17         1           14     14.0      0.0      d.conf.cachedir = "./dnf-cache"
    18         1          789    789.0      0.0      repo = dnf.repo.Repo("repo-0", d.conf)
    19         1           83     83.0      0.0      repo.baseurl = "http://dl.fedoraproject.org/pub/fedora/linux/releases/22/Server/x86_64/os/"
    20         1           12     12.0      0.0      d.repos.add(repo)
    21                                           
    22         1         6644   6644.0      0.0      d.fill_sack(load_system_repo=False, load_available_repos=True)
    23                                           
    24                                           
    25         1           25     25.0      0.0      print("DICT CACHE")
    26                                           
    27         1            5      5.0      0.0      t10 = datetime.now()
    28                                           
    29         1          899    899.0      0.0      RELDEP_RE = re.compile("^(?P<name>.*)( (?P<flag>[<>=]+) (?P<version>.*))?$")
    30                                           
    31         1            1      1.0      0.0      pkgs_by_dep = {}   # provides_name -> [pkgs]
    32         1            1      1.0      0.0      pkgs_by_file = {}  # /file/path -> [pkgs]
    33                                           
    34      2482         7217      2.9      0.0      for pkg in d.sack.query():
    35     54090        79387      1.5      0.4          for prov in pkg.provides:
    36     51609        82624      1.6      0.4              match = RELDEP_RE.match(str(prov))
    37     51609        60466      1.2      0.3              name = match.groupdict()["name"]
    38     51609        70579      1.4      0.3              pkgs_by_dep.setdefault(name, set()).add(pkg)
    39    172182       230086      1.3      1.0          for prov in pkg.files:
    40    169701       341161      2.0      1.5              pkgs_by_file.setdefault(str(prov), set()).add(pkg)
    41                                           
    42     20158        22652      1.1      0.1      for key in pkgs_by_dep:
    43     20157      1316334     65.3      6.0          pkgs_by_dep[key] = d.sack.query().filter(pkg=pkgs_by_dep[key]).apply()
    44                                           
    45                                           
    46        11           18      1.6      0.0      for i in range(ITERATIONS):
    47     24820       129733      5.2      0.6          for pkg in d.sack.query():
    48    226700       404662      1.8      1.8              for req in pkg.requires:
    49    201890       446666      2.2      2.0                  match = RELDEP_RE.match(str(req))
    50    201890       280349      1.4      1.3                  name = match.groupdict()["name"]
    51    201890       218405      1.1      1.0                  if name.startswith("/"):
    52      4490         7620      1.7      0.0                      pkgs_by_file.get(name, [])
    53                                                           else:
    54    197400       260652      1.3      1.2                      q = pkgs_by_dep.get(name, None)
    55    197400       327988      1.7      1.5                      if q:
    56    130580      3404829     26.1     15.4                          q.filter(provides=req)
    57                                                               else:
    58     66820        58576      0.9      0.3                          []
    59                                           
    60         1            7      7.0      0.0      t11 = datetime.now()
    61         1            2      2.0      0.0      delta = t11 - t10
    62         1           38     38.0      0.0      print("total: %ss" % delta.total_seconds())
    63                                           
    64         1           20     20.0      0.0      print()
    65         1            9      9.0      0.0      print("-----")
    66         1            7      7.0      0.0      print()
    67                                           
    68         1            8      8.0      0.0      print("QUERIES")
    69                                           
    70         1            4      4.0      0.0      t20 = datetime.now()
    71                                           
    72        11           18      1.6      0.0      for i in range(ITERATIONS):
    73     24820        79194      3.2      0.4          for pkg in d.sack.query():
    74    226700       386157      1.7      1.7              for req in pkg.requires:
    75    201890     13841045     68.6     62.7                  list(d.sack.query().filter(provides=req))
    76                                           
    77         1            7      7.0      0.0      t21 = datetime.now()
    78         1            2      2.0      0.0      delta = t21 - t20
    79         1           48     48.0      0.0      print("total: %ss" % delta.total_seconds())
This bug appears to have been reported against 'rawhide' during the Fedora 26 development cycle. Changing version to '26'. Created attachment 1307485 [details]
new reproducer working with dnf 2.x and f26
cache performance has degraded significantly (regression in libdnf/hawkey? unicode literals?)
but the overall query performance stays where it was
Query performance was fixed in upstream, to be released as part of libdnf-0.14 | 
Created attachment 1030579 [details] reproducer hawkey (or libsolv?) performs sequence scan for every single query argument. This makes queries slower than on yum, that probably benefits from using database (sqlite3) backend with indexed data. Results from my test where I cached data in memory and narrowed down package sets for individual queries vs queries without caching: 1 iteration: dict cache: 2.5s <-- cache building overhead queries: 1.8s 5 iterations: dict cache: 3.9s queries: 9.5s 10 iterations: dict cache: 5.4s queries: 18.2s 20 iterations: dict cache: 9.0s queries: 36.2s 100 iterations: dict cache: 35.3s queries: 191.3s