Bug 114978

Summary: duelling rpm operations corrupt the rpm database
Product: [Retired] Red Hat Linux Reporter: Peter Wolfenden <pw>
Component: rpmAssignee: Jeff Johnson <jbj>
Status: CLOSED NOTABUG QA Contact: Mike McLean <mikem>
Severity: high Docs Contact:
Priority: medium    
Version: 7.2   
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-02-05 19:27:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Peter Wolfenden 2004-02-05 02:25:25 UTC
Description of problem:
=======================
The 4.1.1-1.8x source RPM was compiled on one of our RedHat
7.2 systems, and seems to work fine until I run two copies
of script1.pl and two copies of script2.pl (see
"Steps to Reproduce" below).

After just a few seconds of normal operation (the scripts
should generate a large number of messages about contention
for the rpm database), the scripts start to log lots of error
messages indicating that the rpm database has been corrupted
(for details see "Actual Results" below).

Sometimes, one of the rpm operations hangs (for a sample
backtrace, see the "Actual Results" below).

Version-Release number of selected component (if applicable):
=============================================================
rpm-4.1.1-1.8x.src.rpm

How reproducible:
=================
Every time.

Steps to Reproduce:
===================
1. Replace SYSTEM_RPM_V1 and SYSTEM_RPM_V2 in script2.pl
   (see below) with any two versions of a system rpm which
   exists in your local filesystem (the contents of the
   package shouldn't matter).

2. Run the scripts defined below as follows:

  ./script1.pl >& one1.out&
  ./script1.pl >& one2.out&
  ./script2.pl >& two1.out&
  ./script2.pl >& two2.out&

script1.pl:
-----------
#!/usr/bin/perl

sub catch_zap {
  my $signame = shift;
  printf($output);
  die;
}

$SIG{INT} = \&catch_zap;

while (1) {
  $output = `rpm -qa`;
  $now = scalar localtime(time());
  $counter++; printf("$counter ($now)\n");
}

script2.pl:
-----------
#!/usr/bin/perl

sub catch_zap {
  my $signame = shift;
  print "output1=[$output1]\n";
  print "output2=[$output2]\n";
  die;
}

$SIG{INT} = \&catch_zap;

while (1) {
  ($output1,$output2) = ('-','-');
  $counter++; $now = scalar localtime(time());
  $output1 = `rpm -U --oldpackage SOME_SYSTEM_RPM_V1`;
  $output2 = `rpm -U SOME_SYSTEM_RPM_V2`;
  printf("$counter ($now)\n");
}
  
Actual results:
===============
After a few seconds (less than a minute on a 2GHz Celeron) of
"normal" failure messages about contention for the rpm database,
weird error messages begin to appear in the output files, eg:

  rpmdb: fatal region error detected; run recovery
  error: db4 error(-30982) from db->sync: DB_RUNRECOVERY: Fatal 
error, run database recovery

Or this (though these lines may actually have been produced by
one of the earlier versions of rpm):

  rpmdb: /var/lib/rpm/Packages: unexpected file type or format
  error: cannot open Packages index using db3 - Invalid argument (22)

Sometimes (typically within a couple of minutes), one of the rpm
processes hangs. In this case you will notice that the output file
for the associated Perl script ceases to be updated. The backtrace
of a hung rpm process typically looks something like this:

#0  0x08117d90 in __memp_fget_rpmdb ()
#1  0x080f04f0 in __db_free_rpmdb ()
#2  0x080f105a in __db_doff_rpmdb ()
#3  0x080ffa55 in __ham_del_pair_rpmdb ()
#4  0x080f9f7d in __ham_c_del ()
#5  0x080e8770 in __db_c_del_rpmdb ()
#6  0x080905df in db3cdel ()
#7  0x0808cf61 in rpmdbRemove ()
#8  0x0806374e in rpmpsmStage ()
#9  0x08062f5e in rpmpsmStage ()
#10 0x080632a4 in rpmpsmStage ()
#11 0x0807dc92 in rpmtsRun ()
#12 0x0806f24a in rpmInstall ()
#13 0x08049554 in main ()
#14 0x08155672 in __libc_start_main ()

Sometimes, both versions of the system package which is upgraded
& downgraded by script2.pl end up in the rpm database at the same
time.

Sometimes, the rpm database gets so mangled that the magic number
on the /var/lib/rpm/Packages becomes corrupted, and 'file' reports
the Packages file as 'data' instead of 'Berkeley DB'.

Expected results:
=================
The scripts should produce error messages about contention for the
rpm database, but should never freeze for more than a few seconds.
It should be possible to run the scripts for a whole week (disk space
for output files permitting) without any rpm processes hanging or the
rpm database becoming corrupted, or two versions of the same package
being added to the rpm database.

Additional information:
=======================
I've tried this same experiment with several versions of rpm. All
have failed, but in different ways:

RPM Version    Behavior of my 4 Perl scripts
-----------    -----------------------------
4.0.4x         rpm hangs after ~20-30 minutes
4.0.5          rpm hangs after ~20-30 minutes (deadlock)
4.1.1          rpm database becomes corrupted
               (and rpm sometimes hangs)

For details re 4.0.5 and 4.0.4x, see Bug 11480.

Bug 89728 and Bug 12443 look similar, but that's based only on a
superficial reading of their notes.

Comment 1 Jeff Johnson 2004-02-05 04:05:29 UTC
Well, we meet again ;-) But your script is still rather unrealistic.

Your database can be recovered by doing
    cd /var/lib/rpm
    mv Packages Packages-ORIG
    db_dump Packages-ORIG | db_load Packages
    rpm --rebuilddb -vv
Use rpmdb_dump and rpmdb_load in /usr/lib/rpm if you have.

Locks can be displayed by doing
    cd /var/lib/rpm
    db_stat -CA
Use /usr/lib/rpm/rpmdb_stat if you have that.

What is the output?

Comment 2 Peter Wolfenden 2004-02-05 19:22:34 UTC
This bug is invalid, because my tests were never run with
rpm version 4.1.1 after all. It turns out that my organization
is committed to an older (patched) version of glibc, which
unfortunately rules out rpm versions 4.1.* and 4.2.*. The
tests that I thought were being run with rpm version 4.1.1
(also mentioned in Bug 11400) were in fact being run with
version 4.0.5.

Oops. Here's my corrected table o' knowlege:

rpm version  glibc version  behavior of my Perl scripts
-----------  -------------  ---------------------------
4.0.4        2.2            rpm hangs after ~20-30 minutes
                            (deadlock)
4.0.5        2.2            rpm database becomes corrupted
                            (and rpm sometimes hangs)
4.1.1        2.3            ?

Apologies for the confusion and thanks, Jeff, for your time.

I would of course be interested to know what happens when
my Perl scripts run for an hour with the latest & greatest
version of rpm, but I won't have time to build a machine for
this purpose at least for the next couple of months. Please
let me know if someone else gets a chance to try this
experiment.