Description of problem: A problem with db4 killed my rpm database with F7. Version-Release number of selected component (if applicable): Almost nothing (just FireFox AFAIR) updated beyond F7 final. How reproducible: Very rare, I hit this sort of problems 2-3 times every year. Steps to Reproduce: 1. do something with rpm/yum/yumex Actual results: Sometimes db4 breaks rpm. Most of the time "rm -f /var/lib/rpm/__db.*; rpm --rebuilddb" fixes it, but not this time. I ended up with 151 or so packages left in the database (every program works). Expected results: rpm should be rock solid! Additional info: I think it's maybe the old RH8/9 db4 problem with nptl (CentOS uses nonptl rpm for that reason AFAIK). Here's what I did: --- cut --- [root@laptop2 ~]# yum shell Loading "installonlyn" plugin Loading "downloadonly" plugin Loading "fastestmirror" plugin Loading "skip-broken" plugin Loading "fedorakmod" plugin Loading "changelog" plugin rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from dbenv->open: DB_RUNRECOVERY: Fatal error, run database recovery error: cannot open Packages index using db3 - (-30977) error: cannot open Packages database in /var/lib/rpm Traceback (most recent call last): File "/usr/bin/yum", line 29, in <module> yummain.main(sys.argv[1:]) File "/usr/share/yum-cli/yummain.py", line 82, in main base.getOptionsConfig(args) File "/usr/share/yum-cli/cli.py", line 146, in getOptionsConfig errorlevel=opts.errorlevel) File "/usr/lib/python2.5/site-packages/yum/__init__.py", line 153, in _getConfig self._conf = config.readMainConfig(startupconf) File "/usr/lib/python2.5/site-packages/yum/config.py", line 601, in readMainConfig yumvars['releasever'] = _getsysver(startupconf.installroot, startupconf.distroverpkg) File "/usr/lib/python2.5/site-packages/yum/config.py", line 664, in _getsysver idx = ts.dbMatch('provides', distroverpkg) --- cut --- The fix for this used to be: "rm -f /var/lib/rpm/__db.*; rpm --rebuilddb", but now: --- cut --- [root@laptop2 rpm]# rpm --rebuilddb rpmdb: page 11409: illegal page type or format rpmdb: PANIC: Invalid argument rpmdb: /var/lib/rpm/Packages: pgin failed for page 11409 error: db4 error(-30977) from dbcursor->c_get: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from dbenv->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from dbenv->close: DB_RUNRECOVERY: Fatal error, run database recovery --- cut --- Funny, the second "rpm --rebuilddb" fixed it... at least I tought so, but: [root@laptop2 ~]# rpm -qa | wc 151 151 3436 and... only 6 MB in /var/lib/rpm. I can now try to use /var/log/rpmpkgs and rpm --justdb, but... maybe I'll just reinstall...
You want to be sure to do rm -f /var/lib/rpm/__db* before doing rpm --rebuilddb. BTW, is this an x86_64 box?
No, it's Intel(R) CPU T1350 @ 1.86GHz. I did 'rm -f /var/lib/rpm/__db*' before 'rpm --rebuilddb'. Could this be related to bug # 242368 and bug # 242299? Maybe not...
DB_RUNRECOVERY is hard core rpmdb (and Berkely DB) failure, usually due to NPTL locking being fubar somehow. yum behavior, depsolver or otherwise, is unrelated to DB_RUNRECOVERY.
User pnasrat's account has been closed
Reassigning to owner after bugzilla made a mess, sorry about the noise...
most likely we'll never know what happened... works since then... (Is closing with 'insufficient data' OK?)
I have a machine on which this happens all the time. This started when it was running F7 and persist on an update to F8. For all practical purposes after nearly any operation with rpm (well, pretty close) I have to do '\rm /var/lib/rpm/__db*; rpm --rebuilddb;', which takes around 7 minutes, and this is ok for a short while. After that I am back to square one again. If you have a bigger update with yum you can be pretty sure that you will get kicked out in the middle of a transaction and left with a substantial mess on hands. Big sigh! This does not happen on other machines running F8. One is only so lucky. Here is what I managed to catch (this one after 'rpm -Va >& report'): rpmdb: page 223: illegal page type or format rpmdb: PANIC: Invalid argument error: db4 error(-30977) from dbcursor->c_get: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->cursor: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->put: DB_RUNRECOVERY: Fatal error, run database recovery ..... and off we go in a merry loop. rpm-4.4.2.2-7.fc8 on i386 installation.
So what's different about that one system to the others where this problem doesn't happen? Also don't count out the possibility of hardware failure of some sort - I'd suggest running memcheck on the failing system and checking if there's anything suspicious in logs...
> So what's different about that one system to the others Different box, different hardware and BIOS, possibly slightly different timings on a memory access, .... > Also don't count out the possibility of hardware failure That is a possibility as this is not a new system. Still take into account comment #3 and that detail that after 'rpm --rebuilddb' everything works fine for some time and in general the whole system seems to behave. One would think that a faulty memory failure patterns would be more random. Unfortunately this is not mine machine and I have only a remote, limited, access to it so running memtest may turn out difficult. I'll see if I can arrange for that but this will take a longer while.
Came to my mind that I should possibly explain how a feat of an update F7->F8 was accomplished on that hardware. 'rpm -Uvh fedora-release*' with F8 packages was followed by 'rpm --rebuilddb; yum update "rpm*" "yum*";'. That mostly worked with rpm falling flat at the end of a "cleanup" phase. It was followed with some manual cleanup, 'rpm --rebuilddb' and 'yum update'. Something of an order of 1200, or a bit more, packages were retrieved. After some ten or twenty packages were installed a database decided to go south again and the whole transaction got aborted. At this stage a loop like that was deployed: for p in $@ ; do rm /var/lib/rpm/__db* rpm -Uvh --force --nodeps $p done with $p ranging through all newly retrieved package files. This worked although 'rpm --rebuilddb' and a bit of a cleanup was required again. The above does not seem to be very consistent with a box having hardware troubles; at least on the first look.
> I'd suggest running memcheck on the failing system A sixteen hours long run of memtest on the system in question came back with zero errors. At this moment a workaround seems to be to run regularly 'rpm --rebuilddb'.
There are some variations. Now I got: ..... rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery ..... in an apparent infinite loop. Sigh! "Run recovery" helps for a short while.
Created attachment 280571 [details] gdb trace from crashing 'rpm -vv --rebuilddb' It looks like that misbehaving system is ultimately screwed up and the only option remaining will be rebuild it from scratch. Whatever is in backups does not work too and it was bad for a while anyway. rpm operations invariably end in segfaults. At least now 'rpm -vv --rebuilddb' consistently segfaults in one place. Last lines printed that way on multiple tries look like this: D: adding 115 entries to Filemd5s index. error: rpmdbNextIterator: skipping h# 380 Header V3 DSA signature: BAD, key ID 4f2a6fd2 D: read h# 364 Header V3 DSA signature: NOKEY, key ID 4f2a6fd2 error: rpmdbNextIterator: skipping h# 380 Header V3 DSA signature: BAD, key ID 4f2a6fd2 D: +++ h# 717 Header V3 DSA signature: NOKEY, key ID 4f2a6fd2 D: adding "xorg-x11-xkb-utils" to Name index. D: adding 18 entries to Basenames index. D: adding "User Interface/X" to Group index. D: adding 14 entries to Requirename index. D: adding 6 entries to Providename index. D: adding 4 entries to Dirnames index. D: adding 14 entries to Requireversion index. D: adding 6 entries to Provideversion index. D: adding 1 entries to Installtid index. D: adding 1 entries to Sigmd5 index. D: adding "7ff1a76054eef935f2a226fd0b2efedc67669232" to Sha1header index. D: adding 18 entries to Filemd5s index. Attached gdb trace was done after rpm-debuginfo-4.4.2.2-7.fc8 was added with a help of 'rpm2cpio' and 'cpio'. Just to make sure that things are not corrupted I made by the method above fresh copies of rpm files from newly retrieved rpm packages. That trace is combined from two runs. Catching an output of 'where' on a terminal was a bit too much. :-) All executables for rpm-4.4.2.2-7.fc8 on i386.
In comment #8 Panu Matilainen suggested: "Also don't count out the possibility of hardware failure of some sort ...". On 2007/12/14 the machine in question was reinstalled "from scratch", without any hardware changes, and its configuration and home directories restored, as the situation was getting out of hand. From that time on it runs and updates without any incidents - until the next time rpm databases will decide to pack up. This machine provides various services so it is on all the time. If somebody is interested I have a copy of an old content of /var/lib/rpm stashed away. This is some 53Megs of data.
See also http://lkml.org/lkml/2008/10/14/429 with "Possible ext3 corruption with 1K block size" for its subject. Looks strangely familiar.
Indeed. See bug 181363, there are people reporting the problem got cured by moving rpmdb to fs with 4K block size.
It is possible, I could have formatted my root FS with 1k size (if '/usr' '/var/log'... are separate FS then '/' mostly contains small files and once all '/dev' nodes). If this is true then one should be able to reproduce this by just installing Fedora on empty ext3 FS with 1K block size without letting anaconda reformat it... would be great to be able to reproduce it somehow.
This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
The bug most likely it is still there (modulo suggestions from comment #15) but not that easy to reproduce. In any case, the system with suggested in comment #8 "hardware failures" after a reinstallation works now for about a year without any troubles. It is quite likely that the reinstallation in question changed a block size on /var, as the system had quite long history before the corruption struck, but I cannot be sure. At that time I did not have any idea that an attention should be paid to block sizes.
There's another problem with similar failure symptoms that is correlating with ext4 patches https://bugzilla.redhat.com/show_bug.cgi?id=468437 Berkeley DB uses mmap(2). If mmap(2) (or the underlying file system store) returns inconsistent or incorrect results, then you will see rpmdb failures. An rpmdb, with locks and data sanity checks, is jut the canary in the mine shaft.
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.