501103 – enhancement request : improve 'rpm -q' performance

Bug 501103 - enhancement request : improve 'rpm -q' performance

Summary: enhancement request : improve 'rpm -q' performance

Keywords:
Status:	CLOSED DUPLICATE of bug 438625
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Panu Matilainen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-16 08:18 UTC by Jason Vas Dias
Modified:	2009-06-03 14:05 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-06-03 14:05:46 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jason Vas Dias 2009-05-16 08:18:34 UTC

Description of problem:

This is an enhancement request, not a bug - where do enhancement requests 
go in bugzilla ? I'm sure there used to be a "enhancement" severity ? ...

sqlite + berkeley DB are great tools, but the way you've deployed them
for RPM they are VERY slow .

If I want to find all packages with "office" in the name, for example :

$ time rpm -qa '*office*'
openoffice.org-writer-core-3.0.1-15.3.fc10.x86_64
...
openoffice.org-draw-core-3.0.1-15.3.fc10.x86_64

real    0m3.49s
user    0m1.78s
sys     0m0.19s

It takes around 3.5 seconds on a 2.2GHZx2 x86_64 with 2GB RAM under FC-10 
(userspace CPU frequency governor set to max cpufreq (2.2Ghz)) .  

Using a database system that constructs good indexes, such as AT&T Daytona
( http://www.research.att.com/~daytona/ ) :

$ time ./rpmByName office     
openoffice.org-writer-core3.0.115.3.fc10x86_6411236669384sApplications/Productivityhttp://www.openoffice.org/Word Processor libraries for openoffice.org
...
openoffice.org-draw-core3.0.115.3.fc10x86_6411236669384sApplications/Productivityhttp://www.openoffice.org/Drawing libraries for openoffice.org

real    0m0.09s
user    0m0.00s
sys     0m0.02s

That's 3.5/0.09 = 38.8 times faster !

On FC10, /var/lib/rpm is @ 83MB in size :

$ du -ks /var/lib/rpm
83776   /var/lib/rpm

The Daytona database ASCII text file is @ 250KB :
$ ls -l RPM.db
-rw-rw-r-- 1 root daytona 248930 2009-05-12 01:25 RPM.db

which contains all data necessary to identify an RPM
 (name,version,release,epoch,arch,buildtime,source,spec,url,group,summary)
plus an integer identifier that can be used as a "KEY" by which a separate
eg. "RPM_File.db" can reference an RPM , for all RPMs on the system.

But on my machine, one could copy all the data under /var/lib/rpm into memory
24 times over, and still have 6GB of swap left over ; as crude a method 
as mapping all the files under /var/lib/rpm into memory and then doing a 
"grep" of the process memory image could be done on my machine in less than 2
seconds. 

Surely there must be a better way improve RPM's lookup performance ?

Even compiling a query from scratch with Daytona is quicker than running
'rpm -q' :
 
$  time ksh -c "echo 'select ^Name^, ^Version^, ^Release^ from ^RPM^ where ( ^Name^ Matches \"rpm\" )' | DS QQ -"

%msg01)delim)|
%msg02)
%msg03)Query File:  /tmp/DS.QQ.16826
%msg04)
%msg05)
%msg06)recls)DS_QQ_16826
%msg07)flds)Name|Version|Release
%msg08)types)STR(*)|STR(*)|STR(*)
%msg09)
rpm-python|4.6.0|2.fc10
redhat-rpm-config|9.0.3|3.fc10
rpm|4.6.0|2.fc10
rpm-devel|4.6.0|2.fc10
rpm-libs|4.6.0|2.fc10
rpm-build|4.6.0|2.fc10

real    0m1.51s
user    0m0.81s
sys     0m0.45s

Of which @ 99% is compilation time :

$ echo 'select ^Name^, ^Version^, ^Release^ from ^RPM^ where ( ^Name^ Matches "rpm" )' > rpm.rpms.S
$ DS_Mk rpm.rpms.S rpm.rpms
...
$ time ./rpm.rpms
%msg01)delim)|
%msg02)
%msg03)Query File:  /home/root/rpms.S
%msg04)
%msg05)
%msg06)recls)RPMS
%msg07)flds)Name|Version|Release
%msg08)types)STR(*)|STR(*)|STR(*)
%msg09)
rpm-python|4.6.0|2.fc10
redhat-rpm-config|9.0.3|3.fc10
rpm|4.6.0|2.fc10
rpm-devel|4.6.0|2.fc10
rpm-libs|4.6.0|2.fc10
rpm-build|4.6.0|2.fc10

real    0m0.08s
user    0m0.00s
sys     0m0.02s

The "rpmByName" query demonstrates Daytona's Cymbal, into which SQL can be 
compiled :
 $ cat rpmByName.cy
local:
 STR .str
  RE .re;

set [ .str ] = read( from _cmd_line_ but_if_absent [ ".*" ] ) ;
set .re = (RE).str;

for_each_time there_is_a RPM where
( Name Matches .re )
{ where ( this_is_a RPM )  with_format _data_ do Describe; }

Now, I'm not suggesting using a proprietary database such as AT&T's Daytona for
RPM - but Daytona's RPM.db is only an ASCII text file, which is indexed using
algorithms in the public domain - indeed, which have been there for over twenty
years ! 

The "Name" index file is used for the above query automatically by Daytona - and Index files could easily be constructed for the sqlite or Berkeley DB files used
by RPM - I could make some suggestions as to how this might be done if you like.
The query time of @ 0.08 seconds above includes the time necessary to verify
that the index files are in-sync with the data files . 

Consider how much CPU time globally is spent by scripts doing an 'rpm -q' !
 
Version-Release number of selected component (if applicable):
all

How reproducible:
100%

Steps to Reproduce:
see above
  
Actual results:
takes @ 3.5 seconds on a 2.2ghz x 2 CPU 

Expected results:
could take < 0.1 seconds

Additional info:
I suggest constructing indexes for "RPM By Name", "RPM By Name-Version-Release",
"RPM By Name-Version-Release-Arch", "RPM By Arch", "RPM By Release", ... etc.

Comment 1 Jeff Johnson 2009-05-16 12:03:13 UTC

Daytona & Cymbal is hardly a fair comparison. Go ahead,
install software using Daytona if you wish.

The usual performance complaint (your RFE comes along regular
as clockwork for almost a decade), is that signatures/digests
are verified when headers are read. Adding --nosignature
and --nodigest will likely speed your rpm -qa by an order of magnitude.
Disabling the signature check can be configured per-user/per-system/per-vendor
persistently as well.

You are correct that generating additional tables would improve performance,
and one could generate all the variants of the tuple {N,V,R,E,A,O} N==name etc
if there were need. The ability to generate additional indices is also persistently
configurable (although there are limitations on the tag data boundaries).

But there just aren't that many "rpm -q" scripts around (as you claim), nor
does the data in an rpmdb change so rapidly that one _MUST_ query an
rpmdb (with the performance penalities of signature/digest checking and
lack of optimized table index optimizations). One can easily query once
and do whatever optimizations one wishes, with Daytona/Cymbal or perl/awk or ...
on the data retrieved.

Comment 2 Panu Matilainen 2009-06-03 14:05:46 UTC


*** This bug has been marked as a duplicate of bug 438625 ***

Note You need to log in before you can comment on or make changes to this bug.