Description of problem: ======================= Version 4.2.1 of rpm seems to work fine on a RedHat Enterprise Linux system until I run two copies of script1.pl and two copies of script2.pl (see "Steps to Reproduce" below). Version-Release number of selected component (if applicable): ============================================================= rpm version 4.2.1 How reproducible: ================= Every time. Steps to Reproduce: =================== 1. Replace SYSTEM_RPM_V1 and SYSTEM_RPM_V2 in script2.pl (see below) with any two versions of a system rpm which exists in your local filesystem (the contents of the package shouldn't matter). 2. Run the scripts defined below as follows: ./script1.pl >& one1.out& ./script1.pl >& one2.out& ./script2.pl >& two1.out& ./script2.pl >& two2.out& script1.pl: ----------- #!/usr/bin/perl sub catch_zap { my $signame = shift; printf($output); die; } $SIG{INT} = \&catch_zap; while (1) { $output = `rpm -qa`; $now = scalar localtime(time()); $counter++; printf("$counter ($now)\n"); } script2.pl: ----------- #!/usr/bin/perl sub catch_zap { my $signame = shift; print "output1=[$output1]\n"; print "output2=[$output2]\n"; die; } $SIG{INT} = \&catch_zap; while (1) { ($output1,$output2) = ('-','-'); $counter++; $now = scalar localtime(time()); $output1 = `rpm -U --oldpackage SOME_SYSTEM_RPM_V1`; $output2 = `rpm -U SOME_SYSTEM_RPM_V2`; printf("$counter ($now)\n"); } Actual results: =============== After just a few seconds of operation (the and a few "package already installed" messages from the pw.pl script instances), the scripts all lock up on stuck 'rpm' commands. Each rpm process requires a 'kill -9' to make it stop. The __db lock files in the /var/lib/rpm/ directory must be removed before any futher rpm commands will work, but even after 'rpm --rebuilddb' the rpm database indicates multiple instances of the same package, and multiple instances of the same package version, eg: [root@us01-sllt20 rpmq]# rpm -qa | grep SYSTEM_RPM SYSTEM_RPM-1.3.37-1 SYSTEM_RPM-1.3.37-1 SYSTEM_RPM-1.3.36-1 SYSTEM_RPM-1.3.36-1 Expected results: ================= The scripts may produce error messages about contention for the rpm database, but should never freeze for more than a few seconds, and should never manage to corrupt the database or put two versions of the same package into the database. It should be possible to run the scripts for a whole week (disk space for output files permitting) without any rpm processes hanging or the rpm database becoming corrupted. Additional info: ======================= Here's what the Red Hat Enterprise Linux system looks like: [root@us01-sllt20 rpmq]# uname -a Linux us01-sllt20 2.4.21-4.ELsmp #1 SMP Fri Oct 3 17:52:56 EDT 2003 i686 i686 i386 GNU/Linux [root@us01-sllt20 rpmq]# cat /etc/redhat-release Red Hat Enterprise Linux ES release 3 (Taroon) [root@us01-sllt20 rpmq]# rpm --version RPM version 4.2.1 I've tried this same experiment with several versions of rpm. All have failed, but in different ways: rpm version glibc version behavior of my Perl scripts ----------- ------------- --------------------------- 4.0.4 2.2 rpm hangs after ~20-30 minutes (deadlock) 4.0.5 2.2 rpm database becomes corrupted (and rpm sometimes hangs) 4.1.1 2.3 ? 4.2.1 2.3 rpm hangs almost immediately (deadlock) For details re 4.0.5 and 4.0.4x, see Bug 114810. Bug 89728 and Bug 12443 look similar, but that's based only on a superficial reading of their notes.
Dead locks is deaadlocks. You expected what?
To see what I expected read "Expected Results". If rpm is supposed to be useful for managing system packages on real live production systems, then it had better function correctly in an environment full of human admins, software update agents, and random crontab events. Deadlocks are bad. Database corruption is bad. Neither one is acceptable in an "infrastructure" technology which is supposed to be used to manage a machine in a production. The fact that rpm 4.2.1 has *both* means that the software is *attempting* to protect the DB integrity (hence the deadlock) and failing (hence the corruption). That's pretty obviously a bug, right? Now, by closing this bug (opened against 4.2.1, the latest released version of rpm) as WONTFIX, you seem to be saying that rpm will *never* be safe for use in a production environment. Is that really what you mean to say? If so, I'm *really* glad I opened this bug!
In case any poor souls out there are having a similar problem and are looking for helpful suggestions, here's a "hack" that can prevent deadlock and a corrupted rpm database (at the cost of slowing things down quite a bit): 1) Write a "wrapper" program or script that uses the setlock program from DJB's daemontools suite, eg: #!/bin/bash /usr/local/bin/setlock /tmp/.rpm_lock /bin/rpm "$@" 2) Modify all your admin scripts, crontabs, and agents to call the "wrapper" instead of rpm.
I see the state of this bug has changed from REOPENED to NEEDINFO. What information is needed?
Info was needed on how to achieve access to machine to reproduce similar problem that is (unfortunately) piggy backed privately here. Aplogies for the disturbance.
Fixed by using db-4.2.52 internal to rpm, and the fix is in a RHEL3 update.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2004-501.html
After rpm-4.2.3-13 installs: rpmdb: Program version 4.2 doesn't match environment version error: db4 error(22) from dbenv->open: Invalid argument error: cannot open Packages index using db3 - Invalid argument (22) error: cannot open Packages database in /var/lib/rpm which is fixed by: rm -f /var/lib/rpm/__db* rpm --rebuilddb The above was included in previous advisories, or maybe it should be part of the post- install? It's really annoying having to rebuild rpm db's on all our updated servers.
Scratch my last comment re: having to rebuild rpm db's on all our updated servers. So far all of the servers where the updated rpm package installed using RHN automatic errata were affected, but one where I did 'up2date -u' just now appears to not be affected, the __db* files are rebuilt.
Katherine Lim: Thank you, Thank you, Thank you!!!! Thanks for putting comment 41 in here about that error you got after letting RHN update your rpm. I thought I was going crazy and was not looking forward to trying to fix the problem on one of my servers.
Hm - nobody reopened this one yet... But I can still reproduce this behaviour - even with the update suggested in the errata. Given my special setup - I prolly waste your time - but it would be nice to see it working here too: What I've got/done: 1) a so called 'mother' host running a Debian Sarge with a vanilla kernel 2.4.29 patched with: http://www.openwall.com/linux/linux-2.4.29-ow1.tar.gz http://www.13thfloor.at/vserver/s_release/v1.2.10/patch-2.4.29-vs1.2.10.diff (both apply clean - and a nice setup of other RPM based vservers on this very same 'mother' host doesn't show the behaviour the initial bugreporter described) 2) I've installed RHEL3-WS on a different machine - applied all upgrades and made a tarbell of this installation (booted using some bootcd and tar'ed *everything* 3) I then extraced this tarball on the 'mother' vserver (see 1) and I it saw it running fine - everything started as it should start and everything was perfectly OK.... 4) I then started to build rpms - but upon installation of these RPMS the 'bug' manifested itself as described (HEADER lines of) in: http://download.opengroupware.org/packages/rhel3/ 5) I've reinstalled the whole server - with no success regarding the issue. (I even rebuild/installed the source rpm of 'rpm' on this host) BUT: ==== I've got it hopefully fixed whilst using the RPMS for redhat9(!) suggested in: http://www.fedora.us/wiki/LegacyRPMUpgrade Since I've installed these RPMS everything works as expected (http://download.fedora.us/patches/redhat/9/i386/RPMS.stable/ to be exact) and I don't saw it deadlock again. There's at least some one else running into this issue: http://archives.linux-vserver.org/200503/0006.html Maybe we can get together to reproduce this in order to fix it for upcoming RHEL releases. TIA! cheers, frank
"""a so called 'mother' host running a Debian Sarge with a vanilla kernel 2.4.29 patched with: openwall, vserver""" So this kernel doesn't contain any NPTL knowledge, which I'd assume is your problem. That might mean that running rpm with LD_ASSUME_KERNEL could provoke the same failures you're seeing on debian, but that's a bad idea anyway.