Bug 242092 - old db4 problem kills rpm database
old db4 problem kills rpm database
 Status: CLOSED WONTFIX None Fedora Fedora rpm --- 8 All Linux low medium --- Panu Matilainen Reopened depends on / blocked

 Reported: 2007-06-01 16:47 UTC by Doncho Gunchev 2009-01-09 07:06 UTC (History) 2 users (show) michal n3npq Bug Fix --- 2009-01-09 07:06:49 UTC --- --- --- --- --- --- --- Red Hat Enterprise Virtualization Manager Red Hat OpenStack

 Doncho Gunchev 2007-06-01 16:47:20 UTC Description of problem: A problem with db4 killed my rpm database with F7. Version-Release number of selected component (if applicable): Almost nothing (just FireFox AFAIR) updated beyond F7 final. How reproducible: Very rare, I hit this sort of problems 2-3 times every year. Steps to Reproduce: 1. do something with rpm/yum/yumex Actual results: Sometimes db4 breaks rpm. Most of the time "rm -f /var/lib/rpm/__db.*; rpm --rebuilddb" fixes it, but not this time. I ended up with 151 or so packages left in the database (every program works). Expected results: rpm should be rock solid! Additional info: I think it's maybe the old RH8/9 db4 problem with nptl (CentOS uses nonptl rpm for that reason AFAIK). Here's what I did: --- cut --- [root@laptop2 ~]# yum shell Loading "installonlyn" plugin Loading "downloadonly" plugin Loading "fastestmirror" plugin Loading "skip-broken" plugin Loading "fedorakmod" plugin Loading "changelog" plugin rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from dbenv->open: DB_RUNRECOVERY: Fatal error, run database recovery error: cannot open Packages index using db3 - (-30977) error: cannot open Packages database in /var/lib/rpm Traceback (most recent call last): File "/usr/bin/yum", line 29, in yummain.main(sys.argv[1:]) File "/usr/share/yum-cli/yummain.py", line 82, in main base.getOptionsConfig(args) File "/usr/share/yum-cli/cli.py", line 146, in getOptionsConfig errorlevel=opts.errorlevel) File "/usr/lib/python2.5/site-packages/yum/__init__.py", line 153, in _getConfig self._conf = config.readMainConfig(startupconf) File "/usr/lib/python2.5/site-packages/yum/config.py", line 601, in readMainConfig yumvars['releasever'] = _getsysver(startupconf.installroot, startupconf.distroverpkg) File "/usr/lib/python2.5/site-packages/yum/config.py", line 664, in _getsysver idx = ts.dbMatch('provides', distroverpkg) --- cut --- The fix for this used to be: "rm -f /var/lib/rpm/__db.*; rpm --rebuilddb", but now: --- cut --- [root@laptop2 rpm]# rpm --rebuilddb rpmdb: page 11409: illegal page type or format rpmdb: PANIC: Invalid argument rpmdb: /var/lib/rpm/Packages: pgin failed for page 11409 error: db4 error(-30977) from dbcursor->c_get: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from dbenv->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from dbenv->close: DB_RUNRECOVERY: Fatal error, run database recovery --- cut --- Funny, the second "rpm --rebuilddb" fixed it... at least I tought so, but: [root@laptop2 ~]# rpm -qa | wc 151 151 3436 and... only 6 MB in /var/lib/rpm. I can now try to use /var/log/rpmpkgs and rpm --justdb, but... maybe I'll just reinstall...  Jeff Johnson 2007-06-06 04:51:45 UTC You want to be sure to do rm -f /var/lib/rpm/__db* before doing rpm --rebuilddb. BTW, is this an x86_64 box?  Doncho Gunchev 2007-06-06 17:12:26 UTC No, it's Intel(R) CPU T1350 @ 1.86GHz. I did 'rm -f /var/lib/rpm/__db*' before 'rpm --rebuilddb'. Could this be related to bug # 242368 and bug # 242299? Maybe not...  Jeff Johnson 2007-06-07 02:00:49 UTC DB_RUNRECOVERY is hard core rpmdb (and Berkely DB) failure, usually due to NPTL locking being fubar somehow. yum behavior, depsolver or otherwise, is unrelated to DB_RUNRECOVERY.  Red Hat Bugzilla 2007-08-21 05:34:30 UTC User pnasrat@redhat.com's account has been closed  Panu Matilainen 2007-08-22 06:35:00 UTC Reassigning to owner after bugzilla made a mess, sorry about the noise...  Doncho Gunchev 2007-09-27 12:04:02 UTC most likely we'll never know what happened... works since then... (Is closing with 'insufficient data' OK?)  Michal Jaegermann 2007-11-26 22:52:26 UTC I have a machine on which this happens all the time. This started when it was running F7 and persist on an update to F8. For all practical purposes after nearly any operation with rpm (well, pretty close) I have to do '\rm /var/lib/rpm/__db*; rpm --rebuilddb;', which takes around 7 minutes, and this is ok for a short while. After that I am back to square one again. If you have a bigger update with yum you can be pretty sure that you will get kicked out in the middle of a transaction and left with a substantial mess on hands. Big sigh! This does not happen on other machines running F8. One is only so lucky. Here is what I managed to catch (this one after 'rpm -Va >& report'): rpmdb: page 223: illegal page type or format rpmdb: PANIC: Invalid argument error: db4 error(-30977) from dbcursor->c_get: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->cursor: DB_RUNRECOVERY: Fatal error, run database recovery rpmdb: PANIC: fatal region error detected; run recovery error: db4 error(-30977) from db->put: DB_RUNRECOVERY: Fatal error, run database recovery ..... and off we go in a merry loop. rpm-4.4.2.2-7.fc8 on i386 installation.  Panu Matilainen 2007-11-27 06:25:59 UTC So what's different about that one system to the others where this problem doesn't happen? Also don't count out the possibility of hardware failure of some sort - I'd suggest running memcheck on the failing system and checking if there's anything suspicious in logs...  Michal Jaegermann 2007-11-27 18:16:51 UTC > So what's different about that one system to the others Different box, different hardware and BIOS, possibly slightly different timings on a memory access, .... > Also don't count out the possibility of hardware failure That is a possibility as this is not a new system. Still take into account comment #3 and that detail that after 'rpm --rebuilddb' everything works fine for some time and in general the whole system seems to behave. One would think that a faulty memory failure patterns would be more random. Unfortunately this is not mine machine and I have only a remote, limited, access to it so running memtest may turn out difficult. I'll see if I can arrange for that but this will take a longer while.  Michal Jaegermann 2007-11-27 19:50:28 UTC Came to my mind that I should possibly explain how a feat of an update F7->F8 was accomplished on that hardware. 'rpm -Uvh fedora-release*' with F8 packages was followed by 'rpm --rebuilddb; yum update "rpm*" "yum*";'. That mostly worked with rpm falling flat at the end of a "cleanup" phase. It was followed with some manual cleanup, 'rpm --rebuilddb' and 'yum update'. Something of an order of 1200, or a bit more, packages were retrieved. After some ten or twenty packages were installed a database decided to go south again and the whole transaction got aborted. At this stage a loop like that was deployed: for p in $@ ; do rm /var/lib/rpm/__db* rpm -Uvh --force --nodeps$p done with \$p ranging through all newly retrieved package files. This worked although 'rpm --rebuilddb' and a bit of a cleanup was required again. The above does not seem to be very consistent with a box having hardware troubles; at least on the first look.  Michal Jaegermann 2007-11-30 17:54:38 UTC > I'd suggest running memcheck on the failing system A sixteen hours long run of memtest on the system in question came back with zero errors. At this moment a workaround seems to be to run regularly 'rpm --rebuilddb'.  Michal Jaegermann 2007-12-04 22:32:57 UTC There are some variations. Now I got: ..... rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery rpmdb: PANIC: fatal region error detected; run recovery ..... in an apparent infinite loop. Sigh! "Run recovery" helps for a short while.  Michal Jaegermann 2007-12-07 05:09:34 UTC Created attachment 280571 [details] gdb trace from crashing 'rpm -vv --rebuilddb' It looks like that misbehaving system is ultimately screwed up and the only option remaining will be rebuild it from scratch. Whatever is in backups does not work too and it was bad for a while anyway. rpm operations invariably end in segfaults. At least now 'rpm -vv --rebuilddb' consistently segfaults in one place. Last lines printed that way on multiple tries look like this: D: adding 115 entries to Filemd5s index. error: rpmdbNextIterator: skipping h# 380 Header V3 DSA signature: BAD, key ID 4f2a6fd2 D: read h# 364 Header V3 DSA signature: NOKEY, key ID 4f2a6fd2 error: rpmdbNextIterator: skipping h# 380 Header V3 DSA signature: BAD, key ID 4f2a6fd2 D: +++ h# 717 Header V3 DSA signature: NOKEY, key ID 4f2a6fd2 D: adding "xorg-x11-xkb-utils" to Name index. D: adding 18 entries to Basenames index. D: adding "User Interface/X" to Group index. D: adding 14 entries to Requirename index. D: adding 6 entries to Providename index. D: adding 4 entries to Dirnames index. D: adding 14 entries to Requireversion index. D: adding 6 entries to Provideversion index. D: adding 1 entries to Installtid index. D: adding 1 entries to Sigmd5 index. D: adding "7ff1a76054eef935f2a226fd0b2efedc67669232" to Sha1header index. D: adding 18 entries to Filemd5s index. Attached gdb trace was done after rpm-debuginfo-4.4.2.2-7.fc8 was added with a help of 'rpm2cpio' and 'cpio'. Just to make sure that things are not corrupted I made by the method above fresh copies of rpm files from newly retrieved rpm packages. That trace is combined from two runs. Catching an output of 'where' on a terminal was a bit too much. :-) All executables for rpm-4.4.2.2-7.fc8 on i386.  Michal Jaegermann 2008-01-17 21:56:51 UTC In comment #8 Panu Matilainen suggested: "Also don't count out the possibility of hardware failure of some sort ...". On 2007/12/14 the machine in question was reinstalled "from scratch", without any hardware changes, and its configuration and home directories restored, as the situation was getting out of hand. From that time on it runs and updates without any incidents - until the next time rpm databases will decide to pack up. This machine provides various services so it is on all the time. If somebody is interested I have a copy of an old content of /var/lib/rpm stashed away. This is some 53Megs of data.  Michal Jaegermann 2008-10-15 04:44:34 UTC See also http://lkml.org/lkml/2008/10/14/429 with "Possible ext3 corruption with 1K block size" for its subject. Looks strangely familiar.  Panu Matilainen 2008-10-15 05:25:42 UTC Indeed. See bug 181363, there are people reporting the problem got cured by moving rpmdb to fs with 4K block size.  Doncho Gunchev 2008-10-26 21:31:25 UTC It is possible, I could have formatted my root FS with 1k size (if '/usr' '/var/log'... are separate FS then '/' mostly contains small files and once all '/dev' nodes). If this is true then one should be able to reproduce this by just installing Fedora on empty ext3 FS with 1K block size without letting anaconda reformat it... would be great to be able to reproduce it somehow.  Bug Zapper 2008-11-26 07:16:18 UTC This message is a reminder that Fedora 8 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 8. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '8'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 8's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 8 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping  Michal Jaegermann 2008-11-26 16:49:05 UTC The bug most likely it is still there (modulo suggestions from comment #15) but not that easy to reproduce. In any case, the system with suggested in comment #8 "hardware failures" after a reinstallation works now for about a year without any troubles. It is quite likely that the reinstallation in question changed a block size on /var, as the system had quite long history before the corruption struck, but I cannot be sure. At that time I did not have any idea that an attention should be paid to block sizes.  Jeff Johnson 2008-11-27 18:31:21 UTC There's another problem with similar failure symptoms that is correlating with ext4 patches https://bugzilla.redhat.com/show_bug.cgi?id=468437 Berkeley DB uses mmap(2). If mmap(2) (or the underlying file system store) returns inconsistent or incorrect results, then you will see rpmdb failures. An rpmdb, with locks and data sanity checks, is jut the canary in the mine shaft.  Bug Zapper 2009-01-09 07:06:49 UTC Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.