Bug 114810

Summary:	rpmq freezes on select
Product:	[Retired] Red Hat Linux	Reporter:	Peter Wolfenden <pw>
Component:	rpm	Assignee:	Jeff Johnson <jbj>
Status:	CLOSED WONTFIX	QA Contact:	Mike McLean <mikem>
Severity:	high	Docs Contact:
Priority:	medium
Version:	7.2
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2004-02-04 18:55:13 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Wolfenden 2004-02-03 02:49:49 UTC

Description of problem:
=======================
The latest 4.0.5 series source RPM was compiled on one
of our RedHat 7.2 systems, and seems to work fine until
I run two copies of script1.pl and two copies of script2.pl
(see "Steps to Reproduce" below) for a few minutes.

One of the instances of script2.pl always gets stuck, and
using gdb to backtrace it, I see that it is getting stuck
on a select, which returned a failure code of -1 (for more
details, see "Additional Information" below).

I've never had a problem killing the stuck 'rpmq' process,
and I've never had to rebuild the database afterwards. But
applying the same test to later versions of the rpm binaries
has given me systems with corrupted db3 (and db4) rpm databases.

Version-Release number of selected component (if applicable):
=============================================================
rpm-4.0.5-1.7x.src.rpm

How reproducible:
=================
Every time.

Steps to Reproduce:
===================
1. Replace SYSTEM_RPM_V1 and SYSTEM_RPM_V2 in script2.pl
   (see below) with any two versions of a system rpm which
   exists in your local filesystem (the contents of the
   package shouldn't matter).

2. Run the scripts defined below as follows:

  ./script1.pl >& one1.out&
  ./script1.pl >& one2.out&
  ./script2.pl >& two1.out&
  ./script2.pl >& two2.out&

script1.pl:
-----------
#!/usr/bin/perl

sub catch_zap {
  my $signame = shift;
  printf($output);
  die;
}

$SIG{INT} = \&catch_zap;

while (1) {
  $output = `rpm -qa`;
  $now = scalar localtime(time());
  $counter++; printf("$counter ($now)\n");
}

script2.pl:
-----------
#!/usr/bin/perl

sub catch_zap {
  my $signame = shift;
  print "output1=[$output1]\n";
  print "output2=[$output2]\n";
  die;
}

$SIG{INT} = \&catch_zap;

while (1) {
  ($output1,$output2) = ('-','-');
  $counter++; $now = scalar localtime(time());
  $output1 = `rpm -U --oldpackage SOME_SYSTEM_RPM_V1`;
  $output2 = `rpm -U SOME_SYSTEM_RPM_V2`;
  printf("$counter ($now)\n");
}

Actual results:
===============
After a few minutes (anywhere from 15 minutes to
half an hour on a 2GHz Celeron) you'll notice that
either the two1.out file or the two2.out file has
stopped being updated, and after you kill the Perl
scripts you'll see the associated hung rpm process
via 'ps auwxf | grep rpm'.

Expected results:
=================
The scripts should produce error messages about RPM
database lock contention, but should never freeze
for more than a few seconds. It should be possible
to run the scripts for a whole week (disk space for
log files permitting) without any rpm processes
hanging or the rpm database becoming corrupted.

Additional info:
================
Here's what the gdb backtrace typically looks like:
---------------------------------------------------
...
Loaded symbols for /lib/ld-linux.so.2
0x45814b3e in __select () at __select:-1
-1      __select: No such file or directory.
---Type <return> to continue, or q <return> to quit---
        in __select
(gdb) bt
#0  0x45814b3e in __select () at __select:-1
#1  0x4569c28c in __DTOR_END__ () from /usr/lib/librpmdb-4.0.4.so
#2  0x45672f65 in __os_yield_rpmdb () from /usr/lib/librpmdb-4.0.4.so
#3  0x455ff9dd in __db_tas_mutex_lock_rpmdb () from /usr/lib/librpmdb-
4.0.4.so
#4  0x4566d25e in __memp_fopen_int_rpmdb () from /usr/lib/librpmdb-
4.0.4.so
#5  0x4566d09f in __memp_fopen () from /usr/lib/librpmdb-4.0.4.so
#6  0x45624f55 in __db_dbenv_setup_rpmdb () from /usr/lib/librpmdb-
4.0.4.so
#7  0x4563520f in __db_dbopen_rpmdb () from /usr/lib/librpmdb-4.0.4.so
#8  0x45634fbc in __db_open_rpmdb () from /usr/lib/librpmdb-4.0.4.so
#9  0x455fdf0e in db3open () from /usr/lib/librpmdb-4.0.4.so
#10 0x455f623b in dbiOpen () from /usr/lib/librpmdb-4.0.4.so
#11 0x455f72f3 in openDatabase () from /usr/lib/librpmdb-4.0.4.so
#12 0x455f747d in rpmdbOpen () from /usr/lib/librpmdb-4.0.4.so
#13 0x455b2eef in rpmQuery () from /usr/lib/librpm-4.0.4.so
#14 0x08049efe in main ()
#15 0x45749657 in __libc_start_main (main=0x8049620 <main>, argc=3,
    ubp_av=0xb5cc6db4, init=0x8049068 <_init>, fini=0x804a180 <_fini>,
    rtld_fini=0x45556cd4 <_dl_fini>, stack_end=0xb5cc6dac)
    at ../sysdeps/generic/libc-start.c:129
(gdb) Quit

Comment 1 Jeff Johnson 2004-02-03 12:59:12 UTC

The trace shows a deadlock.

Try rpm-4.1.1 if you want concurrent
access to the database. rpm-4.0.5 is
already end-of-life.

Comment 2 Peter Wolfenden 2004-02-03 17:37:44 UTC

I don't care about "concurrent access" to the rpm database -
serialized access would be just fine. I simply want all rpm
operations to succeed or fail without deadlocks or database
corruption, and I don't want to have to "wrap" them with my
own custom contention resolution system to achieve this.
Are you saying that deadlocking behavior is a known and
accepted problem in the 4.0.5 series?

I tried my "four Perl script test" (see the initial description
above) with version 4.1.1-1.8x, and the results were infinitely
*worse* - after half an hour, the rpm database became corrupted,
so much so that the magic number in the 'Packages' file got
mangled (look for the 'data' file below):
-----------------------------------------
[root@localhost rpmq]# file /var/lib/rpm/*
/var/lib/rpm/Basenames:      Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Conflictname:   Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Dirnames:       Berkeley DB (Btree, version 8, native 
byte-order)
/var/lib/rpm/Filemd5s:       Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Group:          Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Installtid:     Berkeley DB (Btree, version 8, native 
byte-order)
/var/lib/rpm/Name:           Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Packages:       data
/var/lib/rpm/Providename:    Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Provideversion: Berkeley DB (Btree, version 8, native 
byte-order)
/var/lib/rpm/Requirename:    Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Requireversion: Berkeley DB (Btree, version 8, native 
byte-order)
/var/lib/rpm/Sha1header:     Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Sigmd5:         Berkeley DB (Hash, version 7, native 
byte-order)
/var/lib/rpm/Triggername:    Berkeley DB (Hash, version 7, native 
byte-order)

I say this is "infinitely worse" because you can't fix this
problem with 'rpm --rebuilddb'. If you use rpm to manage your
system packages (and we do), then the system is well and truly
hosed when you get into a situation like this. Even a system
where processes lock up once in a while is better than one which
needs to be entirely rebuilt every once in the same while!

As far as I'm concerned this bug can be closed if someone is
able to run my "four Perl scripts" test for more than 48 hours
with no deadlocking or database corruption with some "official"
version of rpm. In this case, please indicate the version, and
I will verify it on one of our systems.

Comment 3 Jeff Johnson 2004-02-04 13:35:44 UTC

Yes, known problem in rpm-4.0.5, which effectively
has no locking whatsoever.

30 minutes of repeated upgrades of an identical
package is about what I would expect. There is
no known ap[plication that needs even this degree
of write access to the dtabase.

Yes, you will need to arrange serialization outside of rpm
if you wish to achieve running your scripts for 48 hours.

Resolution is WONTFIX because rpm-4.0.5 is end-of-life.

Comment 4 Peter Wolfenden 2004-02-04 16:44:57 UTC

Obviously no application needs to upgrade and downgrade the
same package continuously. The point of doing this is to
reproduce the failure reliably in only 20 minutes, instead
of having to wait 2-3 months for a "real" app to fail.

Can't you please at least tell me some version (any version!)
of rpm that is immune to the problem instead of repeating
the fact that 4.0.5 is "end of life"? As I already indicated
in comment #2 above, version 4.1.1-1.8x has a *worse* failure
mode. At least version 4.0.5 never corrupts the rpm database
(presumably because it *does* have locking - only there's a
bug in the locking logic that sometimes causes deadlock).

If there aren't any new versions of rpm that address this
issue, then the issue isn't solved and the bug should
remain open!

Comment 5 Jeff Johnson 2004-02-04 18:55:13 UTC

Sure. rpm-4.0.2 had exclusive lock on /var/lib/rpm/Packages
using fcntl about 3 years ago. That will kill the 2nd backgrounded
perl script upgrade almost instantly with "can't open rpmdb".

The issue is knowm, yelling louder ain't gonna change anything.

Again, WONTFIX, because rpm-4.0.5 is end of life.

Comment 6 Peter Wolfenden 2004-02-04 19:23:03 UTC

Before trying rpm-4.0.5, we were using rpm-4.0.4x, which
also uses an exclusive lock on the rpm database. And those
"can't open rpmdb" messages are normal - in fact, I mention
them in the "Expected results" of the original bug description.
But rpm-4.0.4x also breaks after being subjected to my Perl
scripts for about 20 minutes - one of the 'rpmq' processes
hangs (but I don't have a backtrace for this).

To summarize:

RPM Version    Behavior of my 4 Perl scripts
-----------    -----------------------------
4.0.4x         rpm hangs after ~20-30 minutes
4.0.5          rpm hangs after ~20-30 minutes (deadlock)
4.1.1          rpm database becomes corrupted

I'll try version 4.0.2 and see if this improves the situation.
If so, I'll add a note here and thank you for your time. If
not, I'll reopen this bug again.

Comment 7 Peter Wolfenden 2004-02-05 19:34:57 UTC

Correction - it turns out that none of my tests were run
with version 4.1.1 of rpm. The tests that I had thought were
run with version 4.1.1-1.8x were in fact run with version
4.0.5. Sorry for the confusion.

Unfortunately, my organization is committed to a patched
version of glibc 2.2, which precludes us from using rpm
versions 4.1.1 and 4.2.*

The general picture looks like this:

rpm version  glibc version  behavior of my Perl scripts
-----------  -------------  ---------------------------
4.0.4        2.2            rpm hangs after ~20-30 minutes
                            (deadlock)
4.0.5        2.2            rpm database becomes corrupted
                            (and rpm sometimes hangs)
4.1.1        2.3            ?
4.2.1        2.3            ?

I would of course be interested to know the results of
running my Perl scripts with later versions of rpm, but
for for the purposes of serving my organization I'm stuck
with the task of coming up with a fix for one of the 4.0
series versions.