Bug 242092

Summary: old db4 problem kills rpm database
Product: [Fedora] Fedora Reporter: Doncho Gunchev <dgunchev>
Component: rpmAssignee: Panu Matilainen <pmatilai>
Status: CLOSED WONTFIX QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: 8CC: michal, n3npq
Target Milestone: ---Keywords: Reopened
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-01-09 07:06:49 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
gdb trace from crashing 'rpm -vv --rebuilddb' none

Description Doncho Gunchev 2007-06-01 16:47:20 UTC
Description of problem:
A problem with db4 killed my rpm database with F7.

Version-Release number of selected component (if applicable):
Almost nothing (just FireFox AFAIR) updated beyond F7 final.

How reproducible:
Very rare, I hit this sort of problems 2-3 times every year.

Steps to Reproduce:
1. do something with rpm/yum/yumex
  
Actual results:
Sometimes db4 breaks rpm. Most of the time "rm -f /var/lib/rpm/__db.*; rpm
--rebuilddb" fixes it, but not this time. I ended up with 151 or so packages
left in the database (every program works).

Expected results:
rpm should be rock solid!

Additional info:
I think it's maybe the old RH8/9 db4 problem with nptl (CentOS uses nonptl rpm
for that reason AFAIK). Here's what I did:

--- cut ---
[root@laptop2 ~]# yum shell
Loading "installonlyn" plugin
Loading "downloadonly" plugin
Loading "fastestmirror" plugin
Loading "skip-broken" plugin
Loading "fedorakmod" plugin
Loading "changelog" plugin
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from dbenv->open: DB_RUNRECOVERY: Fatal error, run
database recovery
error: cannot open Packages index using db3 -  (-30977)
error: cannot open Packages database in /var/lib/rpm
Traceback (most recent call last):
  File "/usr/bin/yum", line 29, in <module>
    yummain.main(sys.argv[1:])
  File "/usr/share/yum-cli/yummain.py", line 82, in main
    base.getOptionsConfig(args)
  File "/usr/share/yum-cli/cli.py", line 146, in getOptionsConfig
    errorlevel=opts.errorlevel)
  File "/usr/lib/python2.5/site-packages/yum/__init__.py", line 153, in _getConfig
    self._conf = config.readMainConfig(startupconf)
  File "/usr/lib/python2.5/site-packages/yum/config.py", line 601, in readMainConfig
    yumvars['releasever'] = _getsysver(startupconf.installroot,
startupconf.distroverpkg)
  File "/usr/lib/python2.5/site-packages/yum/config.py", line 664, in _getsysver
    idx = ts.dbMatch('provides', distroverpkg)
--- cut ---

The fix for this used to be: "rm -f /var/lib/rpm/__db.*; rpm --rebuilddb", but now:

--- cut ---
[root@laptop2 rpm]# rpm --rebuilddb
rpmdb: page 11409: illegal page type or format
rpmdb: PANIC: Invalid argument
rpmdb: /var/lib/rpm/Packages: pgin failed for page 11409
error: db4 error(-30977) from dbcursor->c_get: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from dbenv->close: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from db->close: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from dbenv->close: DB_RUNRECOVERY: Fatal error, run
database recovery
--- cut ---
Funny, the second  "rpm --rebuilddb" fixed it... at least I tought so, but:
[root@laptop2 ~]# rpm -qa | wc
    151     151    3436
and... only 6 MB in /var/lib/rpm.

I can now try to use /var/log/rpmpkgs and rpm --justdb, but... maybe I'll just
reinstall...

Comment 1 Jeff Johnson 2007-06-06 04:51:45 UTC
You want to be sure to do
    rm -f /var/lib/rpm/__db*
before doing rpm --rebuilddb.

BTW, is this an x86_64 box?

Comment 2 Doncho Gunchev 2007-06-06 17:12:26 UTC
No, it's Intel(R) CPU T1350 @ 1.86GHz. I did 'rm -f /var/lib/rpm/__db*' before
'rpm --rebuilddb'. Could this be related to bug # 242368 and bug # 242299? Maybe
not...

Comment 3 Jeff Johnson 2007-06-07 02:00:49 UTC
DB_RUNRECOVERY is hard core rpmdb (and Berkely DB) failure, usually due to NPTL locking being fubar 
somehow. yum behavior, depsolver or otherwise, is unrelated to DB_RUNRECOVERY.

Comment 4 Red Hat Bugzilla 2007-08-21 05:34:30 UTC
User pnasrat's account has been closed

Comment 5 Panu Matilainen 2007-08-22 06:35:00 UTC
Reassigning to owner after bugzilla made a mess, sorry about the noise...

Comment 6 Doncho Gunchev 2007-09-27 12:04:02 UTC
most likely we'll never know what happened... works since then... (Is closing
with  'insufficient data' OK?)

Comment 7 Michal Jaegermann 2007-11-26 22:52:26 UTC
I have a machine on which this happens all the time.  This started
when it was running F7 and persist on an update to F8.  For all
practical purposes after nearly any operation with rpm (well, pretty
close) I have to do '\rm /var/lib/rpm/__db*; rpm --rebuilddb;', which
takes around 7 minutes, and this is ok for a short while.  After that
I am back to square one again.  If you have a bigger update with
yum you can be pretty sure that you will get kicked out in the middle
of a transaction and left with a substantial mess on hands.  Big sigh!

This does not happen on other machines running F8.  One is only
so lucky.

Here is what I managed to catch (this one after 'rpm -Va >& report'):

rpmdb: page 223: illegal page type or format
rpmdb: PANIC: Invalid argument
error: db4 error(-30977) from dbcursor->c_get: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from db->cursor: DB_RUNRECOVERY: Fatal error, run
database recovery
rpmdb: PANIC: fatal region error detected; run recovery
error: db4 error(-30977) from db->put: DB_RUNRECOVERY: Fatal error, run database
recovery
.....

and off we go in a merry loop.

rpm-4.4.2.2-7.fc8 on i386 installation.



Comment 8 Panu Matilainen 2007-11-27 06:25:59 UTC
So what's different about that one system to the others where this problem
doesn't happen? Also don't count out the possibility of hardware failure of some
sort - I'd suggest running memcheck on the failing system and checking if
there's anything suspicious in logs...

Comment 9 Michal Jaegermann 2007-11-27 18:16:51 UTC
> So what's different about that one system to the others
Different box, different hardware and BIOS, possibly slightly
different timings on a memory access, ....

> Also don't count out the possibility of hardware failure
That is a possibility as this is not a new system.  Still take
into account comment #3 and that detail that after
'rpm --rebuilddb' everything works fine for some time and in
general the whole system seems to behave.  One would think that
a faulty memory failure patterns would be more random.

Unfortunately this is not mine machine and I have only a remote,
limited, access to it so running memtest may turn out difficult.
I'll see if I can arrange for that but this will take a longer
while. 


Comment 10 Michal Jaegermann 2007-11-27 19:50:28 UTC
Came to my mind that I should possibly explain how a feat of
an update F7->F8 was accomplished on that hardware.

'rpm -Uvh fedora-release*' with F8 packages was followed by
'rpm --rebuilddb; yum update "rpm*" "yum*";'.  That mostly worked
with rpm falling flat at the end of a "cleanup" phase.  It was
followed with some manual cleanup, 'rpm --rebuilddb' and 'yum update'.
Something of an order of 1200, or a bit more, packages were
retrieved.  After some ten or twenty packages were installed
a database decided to go south again and the whole transaction
got aborted.  At this stage a loop like that was deployed:

        for p in $@ ; do
          rm /var/lib/rpm/__db*
          rpm -Uvh --force --nodeps $p
        done 

with $p ranging through all newly retrieved package files.
This worked although 'rpm --rebuilddb' and a bit of a cleanup
was required again.

The above does not seem to be very consistent with a box
having hardware troubles; at least on the first look.

Comment 11 Michal Jaegermann 2007-11-30 17:54:38 UTC
> I'd suggest running memcheck on the failing system

A sixteen hours long run of memtest on the system in question
came back with zero errors.

At this moment a workaround seems to be to run regularly
'rpm --rebuilddb'.

Comment 12 Michal Jaegermann 2007-12-04 22:32:57 UTC
There are some variations.  Now I got:
.....
rpmdb: PANIC: fatal region error detected; run recovery
rpmdb: PANIC: fatal region error detected; run recovery
rpmdb: PANIC: fatal region error detected; run recovery
rpmdb: PANIC: fatal region error detected; run recovery
rpmdb: PANIC: fatal region error detected; run recovery
rpmdb: PANIC: fatal region error detected; run recovery
rpmdb: PANIC: fatal region error detected; run recovery
.....
in an apparent infinite loop.  Sigh! "Run recovery" helps
for a short while.


Comment 13 Michal Jaegermann 2007-12-07 05:09:34 UTC
Created attachment 280571 [details]
gdb trace from crashing 'rpm -vv --rebuilddb'

It looks like that misbehaving system is ultimately screwed up and
the only option remaining will be rebuild it from scratch.  Whatever
is in backups does not work too and it was bad for a while anyway.
rpm operations invariably end in segfaults.

At least now 'rpm -vv --rebuilddb' consistently segfaults in one place.
Last lines printed that way on multiple tries look like this:

D: adding 115 entries to Filemd5s index.
error: rpmdbNextIterator: skipping h#	  380 Header V3 DSA signature: BAD, key
ID 4f2a6fd2
D:  read h#	364 Header V3 DSA signature: NOKEY, key ID 4f2a6fd2
error: rpmdbNextIterator: skipping h#	  380 Header V3 DSA signature: BAD, key
ID 4f2a6fd2
D:   +++ h#	717 Header V3 DSA signature: NOKEY, key ID 4f2a6fd2
D: adding "xorg-x11-xkb-utils" to Name index.
D: adding 18 entries to Basenames index.
D: adding "User Interface/X" to Group index.
D: adding 14 entries to Requirename index.
D: adding 6 entries to Providename index.
D: adding 4 entries to Dirnames index.
D: adding 14 entries to Requireversion index.
D: adding 6 entries to Provideversion index.
D: adding 1 entries to Installtid index.
D: adding 1 entries to Sigmd5 index.
D: adding "7ff1a76054eef935f2a226fd0b2efedc67669232" to Sha1header index.
D: adding 18 entries to Filemd5s index.

Attached gdb trace was done after rpm-debuginfo-4.4.2.2-7.fc8 was
added with a help of 'rpm2cpio' and 'cpio'.  Just to make sure that
things are not corrupted I made by the method above fresh copies
of rpm files from newly retrieved rpm packages.

That trace is combined from two runs.  Catching an output of
'where' on a terminal was a bit too much. :-)

All executables for rpm-4.4.2.2-7.fc8 on i386.

Comment 14 Michal Jaegermann 2008-01-17 21:56:51 UTC
In comment #8 Panu Matilainen suggested: "Also don't count out
the possibility of hardware failure of some sort ...".

On 2007/12/14 the machine in question was reinstalled "from scratch",
without any hardware changes, and its configuration and home directories
restored, as the situation was getting out of hand.  From that time on
it runs and updates without any incidents - until the next time rpm
databases will decide to pack up.  This machine provides various
services so it is on all the time.

If somebody is interested I have a copy of an old content of
/var/lib/rpm stashed away.  This is some 53Megs of data.


Comment 15 Michal Jaegermann 2008-10-15 04:44:34 UTC
See also http://lkml.org/lkml/2008/10/14/429 with
"Possible ext3 corruption with 1K block size" for
its subject.  Looks strangely familiar.

Comment 16 Panu Matilainen 2008-10-15 05:25:42 UTC
Indeed. See bug 181363, there are people reporting the problem got cured by moving rpmdb to fs with 4K block size.

Comment 17 Doncho Gunchev 2008-10-26 21:31:25 UTC
It is possible, I could have formatted my root FS with 1k size (if '/usr' '/var/log'... are separate FS then '/' mostly contains small files and once all '/dev' nodes).

If this is true then one should be able to reproduce this by just installing Fedora on empty ext3 FS with 1K block size without letting anaconda reformat it... would be great to be able to reproduce it somehow.

Comment 18 Bug Zapper 2008-11-26 07:16:18 UTC
This message is a reminder that Fedora 8 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 8.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '8'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 8's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 8 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 19 Michal Jaegermann 2008-11-26 16:49:05 UTC
The bug most likely it is still there (modulo suggestions from comment #15) but not that easy to reproduce.  In any case, the system with suggested in comment #8 "hardware failures" after a reinstallation works now for about a year without any troubles.

It is quite likely that the reinstallation in question changed a block size on /var, as the system had quite long history before the corruption struck, but I cannot be sure.  At that time I did not have any idea that an attention should be  paid to block sizes.

Comment 20 Jeff Johnson 2008-11-27 18:31:21 UTC
There's another problem with similar failure symptoms that is correlating with ext4 patches
    https://bugzilla.redhat.com/show_bug.cgi?id=468437

Berkeley DB uses mmap(2). If mmap(2) (or the underlying file system store) returns inconsistent
or incorrect results, then you will see rpmdb failures. An rpmdb, with locks and data
sanity checks, is jut the canary in the mine shaft.

Comment 21 Bug Zapper 2009-01-09 07:06:49 UTC
Fedora 8 changed to end-of-life (EOL) status on 2009-01-07. Fedora 8 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.