115152 – (IT_39766_43698_42099) duelling rpm processes lock the rpm database

Bug 115152 (IT_39766_43698_42099) - duelling rpm processes lock the rpm database

Summary: duelling rpm processes lock the rpm database

Keywords:
Status:	CLOSED ERRATA
Alias:	IT_39766_43698_42099
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	rpm
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Johnson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	123574
TreeView+	depends on / blocked

Reported:	2004-02-07 01:17 UTC by Peter Wolfenden
Modified:	2010-10-22 02:31 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2004-12-13 13:18:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2004:501	0	normal	SHIPPED_LIVE	Updated rpm package	2004-12-21 05:00:00 UTC

Description Peter Wolfenden 2004-02-07 01:17:05 UTC

Description of problem:
=======================
Version 4.2.1 of rpm seems to work fine on a RedHat Enterprise
Linux system until I run two copies of script1.pl and two copies
of script2.pl (see "Steps to Reproduce" below).

Version-Release number of selected component (if applicable):
=============================================================
rpm version 4.2.1

How reproducible:
=================
Every time.

Steps to Reproduce:
===================
1. Replace SYSTEM_RPM_V1 and SYSTEM_RPM_V2 in script2.pl
   (see below) with any two versions of a system rpm which
   exists in your local filesystem (the contents of the
   package shouldn't matter).

2. Run the scripts defined below as follows:

  ./script1.pl >& one1.out&
  ./script1.pl >& one2.out&
  ./script2.pl >& two1.out&
  ./script2.pl >& two2.out&

script1.pl:
-----------
#!/usr/bin/perl

sub catch_zap {
  my $signame = shift;
  printf($output);
  die;
}

$SIG{INT} = \&catch_zap;

while (1) {
  $output = `rpm -qa`;
  $now = scalar localtime(time());
  $counter++; printf("$counter ($now)\n");
}

script2.pl:
-----------
#!/usr/bin/perl

sub catch_zap {
  my $signame = shift;
  print "output1=[$output1]\n";
  print "output2=[$output2]\n";
  die;
}

$SIG{INT} = \&catch_zap;

while (1) {
  ($output1,$output2) = ('-','-');
  $counter++; $now = scalar localtime(time());
  $output1 = `rpm -U --oldpackage SOME_SYSTEM_RPM_V1`;
  $output2 = `rpm -U SOME_SYSTEM_RPM_V2`;
  printf("$counter ($now)\n");
}
  
Actual results:
===============
After just a few seconds of operation (the and a few "package
already installed" messages from the pw.pl script instances),
the scripts all lock up on stuck 'rpm' commands. Each rpm process
requires a 'kill -9' to make it stop. The __db lock files in the
/var/lib/rpm/ directory must be removed before any futher rpm
commands will work, but even after 'rpm --rebuilddb' the rpm
database indicates multiple instances of the same package, and
multiple instances of the same package version, eg:

[root@us01-sllt20 rpmq]# rpm -qa | grep SYSTEM_RPM

SYSTEM_RPM-1.3.37-1
SYSTEM_RPM-1.3.37-1
SYSTEM_RPM-1.3.36-1
SYSTEM_RPM-1.3.36-1

Expected results:
=================
The scripts may produce error messages about contention for the rpm
database, but should never freeze for more than a few seconds, and
should never manage to corrupt the database or put two versions of
the same package into the database. It should be possible to run the
scripts for a whole week (disk space for output files permitting)
without any rpm processes hanging or the rpm database becoming
corrupted.

Additional info:
=======================
Here's what the Red Hat Enterprise Linux system looks like:

[root@us01-sllt20 rpmq]# uname -a
Linux us01-sllt20 2.4.21-4.ELsmp #1 SMP Fri Oct 3 17:52:56 EDT 2003 
i686 i686 i386 GNU/Linux

[root@us01-sllt20 rpmq]# cat /etc/redhat-release
Red Hat Enterprise Linux ES release 3 (Taroon)

[root@us01-sllt20 rpmq]# rpm --version
RPM version 4.2.1

I've tried this same experiment with several versions of rpm. All
have failed, but in different ways:

rpm version  glibc version  behavior of my Perl scripts
-----------  -------------  ---------------------------
4.0.4        2.2            rpm hangs after ~20-30 minutes
                            (deadlock)
4.0.5        2.2            rpm database becomes corrupted
                            (and rpm sometimes hangs)
4.1.1        2.3            ?
4.2.1        2.3            rpm hangs almost immediately
                            (deadlock)

For details re 4.0.5 and 4.0.4x, see Bug 114810.

Bug 89728 and Bug 12443 look similar, but that's based only on a
superficial reading of their notes.

Comment 1 Jeff Johnson 2004-02-07 02:24:02 UTC

Dead locks is deaadlocks. You expected what?

Comment 2 Peter Wolfenden 2004-02-07 03:15:22 UTC

To see what I expected read "Expected Results".

If rpm is supposed to be useful for managing system packages
on real live production systems, then it had better function
correctly in an environment full of human admins, software
update agents, and random crontab events.

Deadlocks are bad. Database corruption is bad. Neither one
is acceptable in an "infrastructure" technology which is
supposed to be used to manage a machine in a production.
The fact that rpm 4.2.1 has *both* means that the software
is *attempting* to protect the DB integrity (hence the deadlock)
and failing (hence the corruption). That's pretty obviously
a bug, right?

Now, by closing this bug (opened against 4.2.1, the latest
released version of rpm) as WONTFIX, you seem to be saying
that rpm will *never* be safe for use in a production
environment. Is that really what you mean to say? If so,
I'm *really* glad I opened this bug!

Comment 3 Peter Wolfenden 2004-02-07 03:20:09 UTC

In case any poor souls out there are having a similar problem
and are looking for helpful suggestions, here's a "hack" that
can prevent deadlock and a corrupted rpm database (at the cost
of slowing things down quite a bit):

1) Write a "wrapper" program or script that uses the
   setlock program from DJB's daemontools suite, eg:

#!/bin/bash
/usr/local/bin/setlock /tmp/.rpm_lock /bin/rpm "$@"

2) Modify all your admin scripts, crontabs, and agents
   to call the "wrapper" instead of rpm.

Comment 17 Peter Wolfenden 2004-09-22 16:48:08 UTC

I see the state of this bug has changed from REOPENED to NEEDINFO.
What information is needed?

Comment 18 Jeff Johnson 2004-09-22 19:27:44 UTC

Info was needed on how to achieve access to machine to
reproduce similar problem that is (unfortunately) piggy
backed privately here. Aplogies for the disturbance.

Comment 39 Jeff Johnson 2004-12-13 13:18:22 UTC

Fixed by using db-4.2.52 internal to rpm, and
the fix is in a RHEL3 update.

Comment 40 John Flanagan 2004-12-21 14:15:53 UTC

An advisory has been issued which should help the problem 
described in this bug report. This report is therefore being 
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files, 
please follow the link below. You may reopen this bug report 
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2004-501.html

Comment 41 Katherine Lim 2004-12-22 22:17:29 UTC

After rpm-4.2.3-13 installs:

rpmdb: Program version 4.2 doesn't match environment version
error: db4 error(22) from dbenv->open: Invalid argument
error: cannot open Packages index using db3 - Invalid argument (22)
error: cannot open Packages database in /var/lib/rpm

which is fixed by:

rm -f /var/lib/rpm/__db*
rpm --rebuilddb

The above was included in previous advisories, or maybe it should be part of the post-
install?

It's really annoying having to rebuild rpm db's on all our updated servers.

Comment 42 Katherine Lim 2004-12-22 22:39:22 UTC

Scratch my last comment re: having to rebuild rpm db's on all our updated servers. So far 
all of the servers where the updated rpm package installed using RHN automatic errata 
were affected, but one where I did 'up2date -u' just now appears to not be affected, the 
__db* files are rebuilt.

Comment 43 Scott Sibert 2005-01-14 17:29:04 UTC

Katherine Lim:  Thank you, Thank you, Thank you!!!!  Thanks for
putting comment 41 in here about that error you got after letting RHN
update your rpm.  I thought I was going crazy and was not looking
forward to trying to fix the problem on one of my servers.

Comment 44 Frank Reppin 2005-03-22 22:54:58 UTC

Hm - nobody reopened this one yet...
But I can still reproduce this behaviour - even with
the update suggested in the errata.

Given my special setup - I prolly waste your time - but
it would be nice to see it working here too:

What I've got/done:

 1) a so called 'mother' host running a Debian Sarge with a vanilla
    kernel 2.4.29 patched with:

    http://www.openwall.com/linux/linux-2.4.29-ow1.tar.gz
    http://www.13thfloor.at/vserver/s_release/v1.2.10/patch-2.4.29-vs1.2.10.diff

    (both apply clean - and a nice setup of other RPM based vservers
    on this very same 'mother' host doesn't show the behaviour the initial 
    bugreporter described)

 2) I've installed RHEL3-WS on a different machine - applied all upgrades
    and made a tarbell of this installation (booted using some bootcd and
    tar'ed *everything*
 3) I then extraced this tarball on the 'mother' vserver (see 1) and I
    it saw it running fine - everything started as it should start and
    everything was perfectly OK....
 4) I then started to build rpms - but upon installation of these RPMS
    the 'bug' manifested itself as described (HEADER lines of) in:

    http://download.opengroupware.org/packages/rhel3/

 5) I've reinstalled the whole server - with no success regarding the issue.
    (I even rebuild/installed the source rpm of 'rpm' on this host)

BUT:
====

  I've got it hopefully fixed whilst using the RPMS for redhat9(!) suggested
  in:

  http://www.fedora.us/wiki/LegacyRPMUpgrade

  Since I've installed these RPMS everything works as expected
  (http://download.fedora.us/patches/redhat/9/i386/RPMS.stable/ to be exact)

  and I don't saw it deadlock again.

There's at least some one else running into this issue:

  http://archives.linux-vserver.org/200503/0006.html

Maybe we can get together to reproduce this in order to fix it for
upcoming RHEL releases.

TIA!

cheers,
frank

Comment 45 James Antill 2005-03-22 23:17:06 UTC

"""a so called 'mother' host running a Debian Sarge with a vanilla kernel 2.4.29
patched with: openwall, vserver"""

 So this kernel doesn't contain any NPTL knowledge, which I'd assume is your
problem.

 That might mean that running rpm with LD_ASSUME_KERNEL could provoke the same
failures you're seeing on debian, but that's a bad idea anyway.

Note You need to log in before you can comment on or make changes to this bug.