Bug 752897 - rpm database extremely slow after preupgrade to f16
rpm database extremely slow after preupgrade to f16
Product: Fedora
Classification: Fedora
Component: rpm (Show other bugs)
i686 Linux
unspecified Severity medium
: ---
: ---
Assigned To: Panu Matilainen
Fedora Extras Quality Assurance
: 754761 811544 (view as bug list)
Depends On:
  Show dependency treegraph
Reported: 2011-11-10 13:33 EST by Rob Spanton
Modified: 2012-04-11 11:53 EDT (History)
23 users (show)

See Also:
Fixed In Version: rpm-
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2012-02-11 16:55:28 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:

Attachments (Terms of Use)
xz'ed Packages file (46 bytes, text/plain)
2011-11-10 13:35 EST, Rob Spanton
no flags Details
strace on rpm --initdb --dbpath /run/slowpoke (1.86 MB, application/x-xz)
2011-11-12 16:10 EST, Yaroslav Fedevych
no flags Details

  None (edit)
Description Rob Spanton 2011-11-10 13:33:06 EST
Description of problem:

Performing normal operations in yum took a very long time after upgrading my desktop to f16.  After some prodding directed by the fine fellows in #yum, it was concluded that this was a problem with my rpm database.  It seems that interacting with the Packages database takes a lot of time.

First thing that I tried was to rebuild the rpm database (following the "oh, something's wrong with RPM, better rebuild its database" strategy), after deleting the __db* files.  Rebuilding the database takes a large amount of time:

[rob@zarniwoop ~(master)]$ time sudo rpm --rebuilddb

real	5m0.445s
user	0m11.372s
sys	0m3.372s


[root@zarniwoop rpm]$ time python -c 'import yum; yum.YumBase().rpmdb.simplePkgList()'
Loaded plugins: auto-update-debuginfo, fastestmirror, presto, refresh-packagekit
real    3m22.227s
user    0m1.779s
sys     0m1.097s

My desktop has 3237 packages installed.

One suspicion was the disks in my desktop: However, they appear to be completely ship-shape.  There's no funk in dmesg, I can write to a file in /tmp (same FS as /var/lib/rpm) at 80MB/sec using dd, and parted reports my partitions are aligned correctly.  filefrag reports that the Packages file is in 55 chunks, which doesn't seem crazy.

skvidal suggested that I tried copying my desktop's rpm database over to my laptop and tried rebuilding it on there to see if it was a hardware problem.  My laptop also runs F16 and does not exhibit this slowness with its own rpm database (it takes 20s to rebuild and has 1907 packages installed).  My laptop also displays slowness when rebuilding the rpm database from my desktop:

[rob@megadodo rpm]$ time rpm --root=/tmp/desktop-chroot/ --rebuilddb

real	2m2.058s
user	0m12.884s
sys	0m3.166s

For comparison, my laptop rebuilding its _own_ rpm database:

[rob@megadodo rpm]$ time sudo rpm --rebuilddb

real	0m19.878s
user	0m6.495s
sys	0m1.132s

So it appears that the problem is with the database itself, rather than the machine it is running on.

Other things of potential interest:
 * `package-cleanup --dupes` reports no duplicates after churning with RPM for a long time...
 * Upon observing that I had a lot of gpg-pubkey's installed (http://pastebin.com/YXtTvNMv) on my desktop, skvidal suggested I removed them and tried rebuilding.  This didn't help.  The Packages file below is from before I removed all of them.

You should find a copy of my Packages file here: https://xgoat.com/share/rpm-slowdb/Packages.xz

Version numbers:
 rpm: rpm-
 uname: Linux zarniwoop.blob 3.1.0-7.fc16.i686.PAE #1 SMP Tue Nov 1 20:53:45 UTC 2011 i686 i686 i386 GNU/Linux

How reproducible: Very!
Comment 1 Rob Spanton 2011-11-10 13:35:16 EST
Created attachment 532898 [details]
xz'ed Packages file

I discovered I can attach from an URL...
Comment 2 Kevin Kofler 2011-11-10 18:24:55 EST
Is this the same as https://bugzilla.redhat.com/show_bug.cgi?id=748790 (well, not the "every 5-10 minutes" part, that's the same as https://bugzilla.redhat.com/show_bug.cgi?id=752564 , but the "100% CPU" part)?
Comment 3 Rob Spanton 2011-11-10 19:45:55 EST
Hi Kevin,

I don't think it is the same.  This situation appears to be entirely IO bound.  Very little CPU time is used whilst it rebuilds the db.

You can observe this out for yourself by downloading the Packages file I uploaded and rebuilding it on a F16 machine.  (You probably want to rebuild twice...)


Comment 4 Panu Matilainen 2011-11-11 03:24:29 EST
Okay the insane slowness is reproducable for me too (for once), excellent. Will
have a look.

Just to get this straight: on F15 things were business as usual, and it only
started acting up after the upgrade to F16, right? Its pretty peculiar as the
rpm version is exactly the same on both.

Are others seeing this as well, or does this just happen to be an otherwise popular/interesting bug (judging by the number of CC's we have already)?
Comment 5 Rob Spanton 2011-11-11 07:49:09 EST
Well I only noticed the slowdown after the F16 upgrade.  I believe it was not slow before that, so I guess whatever operations happened on the database during the upgrade lead to the database becoming slow.

I still have another desktop that is running F15.  I rebuilt this database on there to find that it's also slow on there.  That machine is running rpm-

I think it might just be an interesting bug...
Comment 6 Panu Matilainen 2011-11-12 07:49:38 EST
FWIW the "insane slowness" I referred to occurs when you do 'rpm --rebuilddb' from just the Packages file, triggering index autocreation which in this situation ends up turning fsync on for the new database being built when it shouldn't. The end result is that 'rpmdb --rebuilddb' takes over 20 *minutes* on my systems, when in normal circumstances it completes in a few seconds on the same systems. That's just a dumb (but harmless) bug, will fix.

The 2-5min times for --rebuilddb are quite likely to be something else - when the index/fsync bug is triggered you'll see "warning: Generating <N> missing indexes" message during --rebuilddb. It's a kinda known issue that rpmdb performance starts deteriorating over time (or rather, huge number of upgrades etc), rebuilding the db brings it back to normal and what you're seeing might be just that: the f15 -> f16 upgrade just happened to push it past whatever the limit where performance goes down the drain.

And that apparently different "a few minutes" time range case I'm not able to reproduce. Index generation and db rebuild when done separately run in the expected "a few seconds" timeframe:

[pmatilai@turre ~]$ mkdir /tmp/slowdb
[pmatilai@turre ~]$ xz -cd ~/Download/Packages.xz > /tmp/slowdb/Packages
[pmatilai@turre ~]$ time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real	0m5.577s
user	0m3.888s
sys	0m0.453s
[pmatilai@turre ~]$ time rpmdb --rebuilddb --dbpath /tmp/slowdb/

real	0m7.630s
user	0m4.907s
sys	0m0.698s
Comment 7 Panu Matilainen 2011-11-12 08:19:46 EST
Oh, one more thing, so obvious it's easy to forget: was the system actually otherwise idle when you noticed the slowness? After a distro install/upgrade, there can be any number of other io-intensive things like prelink, mlocate and man-db cronjobs running which will severely slow down other io-intensive operations such as --rebuilddb.

Robert, if you still have the full original "slow" db (all of it, not just the Packages file) backup somewhere, can you try reproducing it with that, after ensuring the system is really idle? Also the timings for separate --initdb and --rebuilddb as shown in comment #6 might be interesting.
Comment 8 Yaroslav Fedevych 2011-11-12 16:05:15 EST
(In reply to comment #6)
My system has this bug.

It's a rawhide system, rpm version, yum version 3.4.3.

First I thought I'll just blame yum, which indeed deserves some blame of its own, but after profiling I've found that slowness is somewhere in the RPM itself.

Judging from strace, it seems that the Package database is being read over and over and over again, via pread(), in chunks of 4 KiB. Given that the Packages is 95 MiB in size, that is long time.

Increasing mp_mmapsize to 256 megabytes from 128, and the cache to 128 from 64, hasn't helped it a bit.

strace tells me the files are in fact mmap()'ed, but my puny C knowledge says you usually just use the file as an array of bytes in memory. However, lots of pread() calls issued after this, and lots of pwrite() calls issued when indexes are being written, somehow don't fit my picture of how mmap()ed files should be written or read. Correct me if I'm wrong.

As for timings, they are horrible. 30 minutes for --initdb step.

[root@think slowpoke]# time rpm --initdb --dbpath /root/slowpoke
warning: Generating 12 missing index(es), please wait...

real	30m17.679s
user	0m31.035s
sys	0m14.904s

I'm afraid of doing --rebuilddb; it really seems that either nofsync flag is ignored, or that mmap() is ignored.
Comment 9 Yaroslav Fedevych 2011-11-12 16:10:24 EST
Created attachment 533295 [details]
strace on rpm --initdb --dbpath /run/slowpoke

Notice how many 4K-pread()'s and pwrite()'s are there.
Comment 10 Rob Spanton 2011-11-12 19:22:56 EST
Hi Panu,

Yes, rebuilding the database from the Packages file takes considerably longer.  On two machines that I have it took approximately 50 minutes.  It's the second rebuilding that takes several minutes.

The numbers I quoted in the initial post for rebuilding and simplePkgList() are on a database that has already been rebuilt from from just the Packages file.

The system I was using was idle.  When the rpm stuff was going, there was pretty much nothing else accessing the disk (was watching it in iotop and gkrellm).  There was pretty much nothing else using the cpu either (was watching in top).

I'll generate some --initdb stats if you like, but since I've replicated the same behaviour on different machines with quite different hardware, I am somewhat sceptical that this'll be useful.


Comment 11 Shawn Starr 2011-11-12 20:04:19 EST
In my rawhide box:

[root@segfault ~]# time rpm --rebuilddb -v
real    0m48.277s

On a Fedora 15 -> 16 preupgrade:

[root@coredump ~]# time rpm --rebuilddb -v
real    2m35.502s
Comment 12 Panu Matilainen 2011-11-14 09:02:04 EST
How much memory do the systems exhibiting this have, and what fs is /var on?

Like said on my own systems I can't reproduce anything but the index-creation/fsync issue, but on a virtual machine with less than 2GB memory the performance suddenly goes down the drain real hard, into the "several minutes" range. What seems to cure it is *decreasing* cache size:

In /usr/lib/rpm/macros change "mp_size=64Mb" to "mp_size=1Mb" (which incidentally is the value it had for years and years), remove the db environment: 'rm -f /var/lib/rpm/__db*', and performance should be back to normal - at least that's what I see on the virtual machines.

Now to figure out what the heck is actually going on in there...
Comment 13 Rob Spanton 2011-11-14 13:22:32 EST
Hi Panu,

My rootfs is ext3, which is what /var is on.

Changing mp_size to 1Mb does indeed bring performance back.  (And changing it back to 64Mb takes it away again.)


Comment 14 Panu Matilainen 2011-11-15 02:55:07 EST
So it is the same thing that I'm seeing then and no extra factors like completely different filesystem (ext3 vs ext4 is clone enough), good.

The (dramatic) slowdowndown is a somewhat peculiar combination of rpmdb size, system RAM and cache size: with "average" rpmdb sizes things are ok with the current 64Mb cache (this value was the same throughout F15) even with lower RAM, so for example your laptop's own db didn't exhibit the slowness but the desktop systems larger db crawled on the laptop.

Rebuilding the db is a special case though and doesn't really benefit from caching, whereas larger (than 1Mb) cache IS beneficial for other functionality. So its a question of finding a sweet spot that doesn't kill lower end systems while having a reasonable cache on "normal" operations...
How much memory do your systems have, circa 1GB or so?
Comment 15 Rob Spanton 2011-11-15 06:35:50 EST
Ah, apologies.  Forgot to write how much RAM these machine have:

 * desktop: 6GB (32-bit install)
 * laptop: 2GB

Comment 16 Samuel Sieb 2011-11-15 13:23:02 EST
I'm seeing this as well.  I was wondering why yum was so slow, but then I figured out it was actually the rpm Packages database.  It only started after upgrading to F16.  The strange thing I noticed is that a warm start doesn't help at all as if there is no caching being done.  Any rpm operation is slow, even if repeated immediately after.  My laptop has 5GB of RAM.

That workaround definitely fixes it for me, it's actually speedy now.
Comment 17 Panu Matilainen 2011-11-16 01:32:22 EST
So its not limited to systems with "low" memory. I'm curious as to why I'm not seeing this (except with lower available memory), when upping the cache size I did much of the testing on a laptop with 2Gb memory and was seeing definite improvements with the larger cache.

At this point better its quite obviously best to just back the cache size down to the previous value that is working so much better for others, at least until the issue is better understood.
Comment 18 Samuel Sieb 2011-11-16 01:53:44 EST
My laptop does have a lot of things loaded most of the time, but still a 64MB cache seems pretty minimal.  I also have two desktops with 2GB RAM that I upgraded and have this same problem.  Did you upgrade yours or was it a clean install?  I will see if I can do some testing with that.  Although the issue may be related to the number of packages installed as well as all the computers involved have a lot of packages.  The desktops have around 2300 and the laptop nearly 2800.

In the process of getting that information remotely while no one was logged in, I didn't see the problem.  The first run of "rpm -qa" took quite a while with heavy disk access, but the second run was fast, basically entirely cached.  I wonder if there is something about the disk reads with the larger cache setting that trigger something in the kernel to not cache when there's significant memory load that doesn't happen with smaller cache setting.  I will see if I can find the time to do some more testing.
Comment 19 Hin-Tak Leung 2011-11-20 13:58:23 EST
*** Bug 754761 has been marked as a duplicate of this bug. ***
Comment 20 Hin-Tak Leung 2011-11-20 17:12:34 EST
As in 754761, I started noticing rpm being very slow (it used to be just yum in f15, and I assume it was the python overhead) in f16, whenever I have a lot of idle browser windows open - i.e. the system is mostly idle but with a lot of idle processes which can be swap out. I am just expecting an interactive process, especially a root-owned one like rpm/yum, to push more of those idle processes into swap to proceed.

FWIW, my system is 2GB RAM + 2GB swap. During most of F15 I had probably 3000+ packages installed but I desperately needed texlive 201x so I got it on from one of the fedora packages-being-reviewed places a week before I upgraded to F16. That is packaged into thousands of packages (from 10's in f15/f16's texlive 2007) so my package list is just under 7000 at the moment, I think. Since the switch from texlive 2007 to texlive 2011 happens just a week before f16, that probably made it worse.

If this bug is affected by large package list, putting on texlive 2011 from http://jnovy.fedorapeople.org/texlive/

(In reply to comment #12)
> In /usr/lib/rpm/macros change "mp_size=64Mb" to "mp_size=1Mb" (which
> incidentally is the value it had for years and years)...

Just for interest, when did it got changed?
Comment 21 Panu Matilainen 2012-02-02 01:39:37 EST
Seeing this: http://lwn.net/Articles/478440/ (original report at http://article.gmane.org/gmane.linux.kernel/1244680 for those who can't yet access the LWN article) makes me wonder if that's what is hitting rpm.

Just a theory for now, but at any rate that's the kind of thing that would explain why the very same DB configuration worked fine in (early) F15 but things got dirt slow in F16: F15 started with kernel 2.6.38 (but later updates brought in 3.0.x masquerading as 2.6.40 iirc) whereas F16 had 3.x from day one. If somebody wants to try and recompile their kernel to see if it makes any difference, this is what's supposed to fix it: http://article.gmane.org/gmane.linux.kernel.mm/73293
Comment 22 Stephen Tweedie 2012-02-07 06:11:31 EST
I don't think it's a disk performance issue.  It's the sheer volume of disk activity that occurs.

Without the mp_size change, I'm seeing vast amounts of disk IO from rpm even from simple queries like "yum list available ...".  It sustains writes (yes, writes!) of between 3MB and 15MB per second for 90 seconds... that's an insane amount of write IO for a readonly query of the bdb, and I think it's a red herring to be looking for kernel reasons why that amount of IO is slow.  rpm/bdb really shouldn't be generating that much write traffic in the first place.

btw, "iostat" is a good way to watch the traffic resulting here.  I usually use "iostat 2" to get it updated every couple of seconds.

With mp_size=1MB as in comment #12, the problem disappears completely.  With the default /usr/lib/macros, rpm is almost unusable, on every system I've got which has been updated from older versions of Fedora.
Comment 23 Stephen Tweedie 2012-02-07 06:12:44 EST
btw, is there any reason not to revert mp_size while we work out what the underlying problem is?  This is a pretty major issue for those of us it affects.
Comment 24 Stephen Tweedie 2012-02-07 06:36:17 EST
One more data point: in slow mode, the write IO all seems to be to the bdb internal journal files.  (This done while executing a simple yum query, "yum list available foobar".)  The most recently modified files in /var/lib/rpm look like

[rpm]# ls -lath
total 280M
-rw-r--r--.  1 root root  24K Feb  7 11:20 __db.001
-rw-r--r--.  1 root root 3.7M Feb  7 11:20 __db.002
-rw-r--r--.  1 root root  81M Feb  7 11:20 __db.003
-rw-r--r--.  1 root root 536K Feb  7 11:20 __db.004
-rw-r--r--.  1 rpm  rpm  6.5M Feb  7 10:47 Dirnames

Reverting mp_size to 1MB by itself doesn't fix this, the results look the same.  But clearing the __db* files and rerunning, I still get the write traffic, but much less; and the db files look like

[rpm]# ls -lath
total 198M
-rw-r--r--.  1 root root  24K Feb  7 11:27 __db.001
-rw-r--r--.  1 root root 240K Feb  7 11:27 __db.002
-rw-r--r--.  1 root root 1.3M Feb  7 11:27 __db.003
-rw-r--r--.  1 root root 536K Feb  7 11:27 __db.004
-rw-r--r--.  1 rpm  rpm  6.5M Feb  7 10:47 Dirnames

Curiously, the __db.003 file (which I think is the bdb mmap cache file) has shrunk by almost exactly the same factor of 64 by which we shrank mp_size.  But the actual amount of disk write according to iostat has shrunk to just a few kB.

So the real question is why we're generating so much traffic to the mmap file; and if we can't avoid that, then we should probably just minimise the issue by reverting the size of that file back to its older value.
Comment 25 Met Merilius 2012-02-07 08:07:47 EST
I can confirm this bug on both of my F16 installations. One were upgraded with preupgrade, the other with yum --releasever=16 upgrade.

In both cases I get huge amounts of IO, it takes yum up to 10min to list packages to be upgraded. And then the next 10min to finish yum after `cleaning' the last package. On one system updated with yum (not preupgrade) the trick with mp_size=1MB did the job (on the other I didn't try yet)

**I think it is a high-priority bug**
Comment 26 Panu Matilainen 2012-02-08 03:20:06 EST
The "mp" (for "memory pool") thing is BDB jargon for its internal (shared) cache, through which all db IO goes. That's why you get a lot of write-IO on read-only queries too, especially after reboot which erases the whole BDB environment including the cache.

On my four (F16) systems, mp_size=64Mb vs 1Mb makes no difference whatsoever to the --initdb times (a dumb benchmark for caching related issues as caching is of no benefit for this operation, but serves as a simple reproducer for this issue). However on a VM with just 512MB this makes a huge differece, quite obviously something horribly stupid happens with the cache when memory is on the low side.

And it also does have to do with the kernel. With mp_size=1Mb there's no measurable difference, but with 64Mb the difference between original F15 kernel and current F16 one is quite drastic. Using the packages from comment #1 as a test-case, here are the timing on the virtual machine with 512M memory, the magnitude of difference is easily reproducable here:

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    2m39.459s
user    0m6.173s
sys     0m2.478s

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    10m34.010s
user    0m5.728s
sys     0m2.117s

FWIW, mp_size=1Mb does the deed in roughly 8 seconds on the same VM on both kernels. This is the kind of time I get on my laptop running this VM, regardless of mp_size setting.

So, obviously the larger cache is severely hurting rpm performance in some situations, much more than its winning in the best scenarios. I'll switch it back to the former default for now, finding an optimum is going to take some added testing and tuning on different types of systems. But there's quite a regression in kernel performance as well, whether it's the plugging issue I referred to in comment #21 or something else I've no clue.
Comment 27 Panu Matilainen 2012-02-08 03:24:43 EST
Oh and the timings were produced on the same F16 installation, the only difference is booting with F15 vs F16 kernel.
Comment 28 Panu Matilainen 2012-02-08 05:54:03 EST
Dunno how much the Fedora kernel version can be trusted these days, but testing with a sampling of kernels grabbed from koji, this doesn't seem to start at 2.6.39 but 2.6.40 and gradually getting worse. This is the same test-case as with the above numbers, 3.2.3 retested for curiosity and producing a little less horrible figure this time, but still by far slowest of them all:

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    2m39.459s
user    0m6.173s
sys     0m2.478s

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    2m24.814s
user    0m5.939s
sys     0m2.245s

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    6m26.406s
user    0m5.772s
sys     0m2.152s

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    5m42.944s
user    0m5.796s
sys     0m2.164s

[root@localhost slowdb]# uname -r
[root@localhost slowdb]# time rpmdb --initdb --dbpath /tmp/slowdb/
warning: Generating 12 missing index(es), please wait...

real    7m53.031s
user    0m5.750s
sys     0m2.217s
Comment 29 Fedora Update System 2012-02-09 03:54:44 EST
rpm- has been submitted as an update for Fedora 16.
Comment 30 Fedora Update System 2012-02-09 03:57:31 EST
rpm- has been submitted as an update for Fedora 15.
Comment 31 Fedora Update System 2012-02-09 19:56:21 EST
Package rpm-
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing rpm-'
as soon as you are able to.
Please go to the following url:
then log in and leave karma (feedback).
Comment 32 Christoph Wickert 2012-02-10 12:55:23 EST
I have been suffering from this for ages and tried a lot, but this update finally fixes my problems. Transactions are now going fast again. Thanks!
Comment 33 Fedora Update System 2012-02-11 16:55:28 EST
rpm- has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 34 Fedora Update System 2012-02-11 17:04:10 EST
rpm- has been pushed to the Fedora 15 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 35 Mike Ratcliffe 2012-04-11 11:53:11 EDT
*** Bug 811544 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.