Bug 68056 - rpm randomly has trouble with removing packages
Summary: rpm randomly has trouble with removing packages
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: rpm
Version: 8.0
Hardware: i386
OS: Linux
medium
high
Target Milestone: ---
Assignee: Jeff Johnson
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-07-05 21:45 UTC by Nathan G. Grennan
Modified: 2007-04-18 16:43 UTC (History)
20 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2002-11-16 13:39:49 UTC
Embargoed:


Attachments (Terms of Use)
rpm hanging in a select() loop. (355.98 KB, text/plain)
2002-07-06 04:21 UTC, Sam Varshavchik
no flags Details

Description Nathan G. Grennan 2002-07-05 21:45:45 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.5 (X11; Linux i686; U;) Gecko/20020625

Description of problem:
Using "rpm -e package" will randnomly hang. Using ctrl-c and ctrl-z don't have
any effect. "ps ax" shows that the process is in the state of "S". kill "pid of
rpm"  doesn't work. kill -9 "pid of rpm" does work. Trying "rpm -e package"
again after killing it hangs in the same way. Querying the rpm database via "rpm
-qa" after the failed "rpm -e package" is killed hangs also. If the system is
rebooted "rpm -e package" works.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. rpm -e package

	

Actual Results:  Process hangs

Expected Results:  Process to finish and return to the command prompt

Additional info:

I have seen this at least 3-4 times over the last few days.

Comment 1 Nathan G. Grennan 2002-07-05 21:47:21 UTC
You can ctrl-c out of rpm commands executed after the first hung rpm process is
killed.

Comment 2 Sam Varshavchik 2002-07-06 04:19:22 UTC
I can reproduce this bug reliably.  rpm starts spinning in an infinite select()
loop with an empty file descriptor set.


Comment 3 Sam Varshavchik 2002-07-06 04:21:04 UTC
Created attachment 63924 [details]
rpm hanging in a select() loop.

Comment 4 Jeff Johnson 2002-07-06 15:44:09 UTC
Version of rpm, please:
	rpm -q rpm

Try doing
	rm -rf /var/lib/rpm/__db*

Comment 5 Sam Varshavchik 2002-07-06 16:50:00 UTC
rpm-4.1-0.34

Removing the __db files allows 'rpm -e' to finish.



Comment 6 Jeff Johnson 2002-07-06 17:00:35 UTC
That's the workaround until there's a useful
pthread_mutexattr_setpshared() (ideal) or I get
a chance to factor db open permissions onto a
setgid helper.

Comment 7 Chris Ricker 2002-07-12 05:36:59 UTC
FWIW, I'm seeing what appears to be the same bug sometimes with rpm -Uvh. 
There, it also loops over a null select, and killing it and removing the __db*
files and trying again has so far worked around it.  This is also with rpm-4.1-0.34

Comment 8 Sam Varshavchik 2002-07-12 05:48:10 UTC
I think there's another bug somewhere that leaves __db* files behind, which
trigger this bug on a subsequent rpm transaction, but I haven't been able to
coax rpm into leaving the garbage files behind.


Comment 9 Chris Ricker 2002-07-12 06:34:33 UTC
Hmm, if it's pre-existing __db* files causing the hangs, then they are being
created during install; I've been seeing rpm -Uvh hangs on systems where the rpm
-Uvh command is the first transaction I've run against the database after a
clean install

Comment 10 Jeff Johnson 2002-07-12 13:33:52 UTC
This is a known problem. The current implementation
is adequate, but not perfect.

The __db files are used to share locks.

A ^C will leave the file around, but the next execution
as root, or next reboot, removes the __db files.

Depending on the exact moment when ^C is hit,
there may or may not be a lock held that another
process may stumble upon.

The problem that cannot be solved without a
setgid helper is, if root does ^C, then non-root
cannot remove the file, and can hang on dead locks.

The setgid helper will be added, but not to rpm-4.1.

Comment 11 Nathan G. Grennan 2002-07-12 15:38:41 UTC
Ok, so when will rpm-4.2 be out?

Comment 12 Nathan G. Grennan 2002-08-04 15:22:47 UTC
Below is a script to put in /usr/local/bin after you put /usr/local/bin at the
beginning of your path to workaround this bug.

#!/bin/bash

rm -f /var/lib/rpm/__db* >/dev/null 2>&1
rpm $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15 $16 $17 $18 $19 $20 $21
$22 $23 $24 $25


Comment 13 Jeff Johnson 2002-08-04 15:54:15 UTC
Hmmm, signals are now trapped when an rpmdb is
opened, so the stale lock avoidance hack is not only
not necessary anymore, but also can be harmful, as
	/var/lib/rpm/__db*
can/will open lock races on the /var/lib/rpm/__db* files.

Comment 14 Jeff Johnson 2002-08-04 15:55:39 UTC
Fixed since rpm-4.1-0.59.

Comment 15 Nathan G. Grennan 2002-08-06 20:16:29 UTC
Still see this problem with rpm-4.1-0.66

Comment 16 Nathan G. Grennan 2002-08-07 05:25:29 UTC
Have had the problem with rpm-4.1-0.69

Comment 17 Jeff Johnson 2002-08-07 12:59:51 UTC
I need a reproducible case, not repeated confirmations,
if you want a fix.

Comment 18 Peter Bowen 2002-08-17 15:29:47 UTC
I saw this bug today with 4.1-0.81 while trying running 'rpm -e
kernel-2.4.18-10.98'.  Again, Ctrl-C didn't do anything, and attaching strace to
the pid shows that it is stuck in a select loop.  I don't think that there were
existing /var/lib/rpm/__db.0* files, but I can't be sure.

Leaving as NEEDINFO as I don't think this is enough info to find the bug.

Comment 19 Nathan G. Grennan 2002-08-20 17:53:05 UTC
I am going to close this since I can't seem to reproduce it on demand, I haven't
seen it in a week, and I might have been confusing it with rpm taking
excessively long because of high hard disk load after it was supposed to have
been fixed. If I do see it again I will reopen this bug.

Comment 20 Peter Bowen 2002-08-22 21:23:38 UTC
Still seeing this bug in latest rawhide (rpm-4.1-0.84).  

strace shows infinate loop on:
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)

kill <pid> doesn't work, but kill -9 <pid> does.

I was running rpm -Fvh *.rpm for about 50 rpms, and it hung after the first one
(popt).

After killing the process and removing /var/lib/rpm/__db.0*, running rpm -Fvh
again works.  I don't think there were any __db.0* files there prior to the
first run, but I can't be sure.

Comment 21 Jeff Johnson 2002-08-22 21:31:11 UTC
You will *always* need to do
	rm -f /var/lib/rpm/__db*
after "kill -9".

I still don't see any reproducible problem here ...

Comment 22 Peter Bowen 2002-08-22 21:36:37 UTC
OK, I found another piece of data:

redhat-config-pacakges was running on the machine in question _while_ I was
trying to do the upgrade.  So, hazarding a guess, rpm 4.1 doesn't recognize
librpm404's lock on the database.

Once I cleared the __db.0* files then it worked because 4.0.4's locking is
partially broken, as mentioned on rpm-list.

Comment 23 Peter Bowen 2002-08-22 21:37:03 UTC
reopening due to midair collision

Comment 24 Jeff Johnson 2002-08-27 22:50:03 UTC
No, rpm-4.1 sets the lock to keep rpm-4.0.4
based apps happy.

This is getting purty far off topic, so I'm
gonna (again) close this bug. Feel free to
reopen Yet Another Bug.

Comment 25 Thomas Vander Stichele 2002-09-29 13:06:08 UTC
I am using rpm 4.1 from 8.0 and I experience the same bug.
If I would never have to abort rpm when it's running, then I guess the response
to this bug report would be acceptable.  However, since there's a bug (that
might be similar) in rpm that causes it to lock up in the first place, I find it
strange that this bug is so easily glossed over, especially now with the new
release of 8.0.  This bug will likely affect a lot of users.

I just made a fresh chroot environment, installed lots of base packages in, then
installed a second set of base packages, and it locked up after completing
e2fsprogs (there is up to now no apparent reason for what package triggers the
problem, I've had it on kernel-source and glibc-common as well).

It is currently locked up.  Attaching to it with strace shows it's in pause, and
a backtrace with gdb shows this:

Attaching to program: /mnt/music/B/bitches/root/bin/rpm, process 730
0x08184847 in __libc_pause ()
(gdb) bt
#0  0x08184847 in __libc_pause ()
#1  0x0814267f in pause ()
#2  0x0805fccc in psmWait ()
#3  0x08060236 in runScript ()
#4  0x08060838 in runInstScript ()
#5  0x08062a83 in rpmpsmStage ()
#6  0x0806240b in rpmpsmStage ()
#7  0x080628d5 in rpmpsmStage ()
#8  0x0807d085 in rpmtsRun ()
#9  0x0806dbf1 in rpmInstall ()
#10 0x08048e4d in main ()
#11 0x0815ad62 in __libc_start_main ()

It's been like that for five minutes, and I'm now going to go to the store to
give it at least 15 more minutes, just to prove that it is really spinning idly
in some sort of race condition, and it's not me being trigger-happy with the
kill command.

This bug has been encountered by a few users according to this bug report, and I
fourth this.  Just because a bug doesn't happen for you doesn't mean it's not
valid, and I find it hard to believe that you're unable to experience it.  I
would suggest trying a few fresh installs in a chroot manually to see it for
yourself.  I can't imagine this not being a very critical bug for Red Hat.

Comment 26 Jordan Russell 2002-10-02 00:38:44 UTC
I have the exact same strace & gdb results as thomas.ac.be.

In my working with Red Hat 8.0 + RPM 4.1 today for the first time, 
approximately half of all 'rpm -e' and 'rpm -U' commands I've executed have 
hung (in all I've seen 30+ hangs), requiring a 'kill -9' of rpm each time, 
followed by the obligatory 'rm -f /var/lib/rpm/__db*; rpm --rebuilddb'. The 
hangs appear to always occur in between packages; I haven't seen any hangs 
occur in the middle of erasing/upgrading a package.

Since I'm the second one to experience the 'rpm -e' hangs with RH 8.0, 
shouldn't this report be reopened?

Comment 27 Nathan G. Grennan 2002-10-02 05:12:56 UTC
I have expereinced a lockup with rpm -e with RedHat 8.0.

I maybe completely off and misinterpeting what I saw, but it appeared that it
might be related to having the application running while trying to remove the
package. If I remember right I had some application running, say Mozilla, I
tried rpm -e and it was in S state. I used kill -9 and then closed the
application. Then I rpm -e again and it worked.

Comment 28 john arkim 2002-10-02 08:11:16 UTC
i am having teh same problem like other guys here in redhat 8.0, it just hangs
and i have to delte those __* files in order to get rpm work without having to
restart.

Comment 29 Warren S 2002-10-04 22:21:29 UTC
My first rpm command was rpm -e httpd and it didn't hang, but did tell me about
dependencies.  So I was going to --force a re-install of the rpm and it hung.  I
kill -9 this two more times before comming here and rm'd three __db.00* files. 
My rpm -ivh --force worked fine after this.  So even an attempt with -e will
leave the __db!

Comment 30 Warren Togami 2002-10-05 07:47:46 UTC
Red Hat 8.0, my RPM hung right after it installed a single RPM package with -ivh.

I attached strace to the pid and it repeats the following message forever:
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)

Comment 31 André Johansen 2002-10-08 15:19:22 UTC
I've experienced the same or a similar problem several times after I've just 
upgraded from RH73 to RH80. 
 
[root@zeus /]# rpm -q rpm 
rpm-4.1-1.06 
[root@zeus /]# ls /var/lib/rpm/ 
Basenames     __db.003  Installtid   Provideversion  Sha1header 
Conflictname  Dirnames  Name         Pubkeys         Sigmd5 
__db.001      Filemd5s  Packages     Requirename     Triggername 
__db.002      Group     Providename  Requireversion 
[root@zeus /]# rm -f /var/lib/rpm/__db.00* 
[root@zeus /]# service lpd status 
lpd is stopped 
[root@zeus /]# rpm -e LPRng 
error: Failed dependencies: 
        LPRng >= 3.7.4-9 is needed by (installed) 
redhat-config-printer-0.4.24-1 
[root@zeus /]# rpm -e LPRng redhat-config-printer 
error: Failed dependencies: 
        redhat-config-printer = 0.4.24-1 is needed by (installed) 
redhat-config-printer-gui-0.4.24-1 
[root@zeus /]# rpm -e LPRng redhat-config-printer redhat-config-printer-gui 
warning: /etc/alchemist/namespace/printconf/local.adl saved as 
/etc/alchemist/namespace/printconf/local.adl.rpmsave 
[root@zeus /]# ls /var/lib/rpm/ 
Basenames     __db.003  Installtid   Provideversion  Sha1header 
Conflictname  Dirnames  Name         Pubkeys         Sigmd5 
__db.001      Filemd5s  Packages     Requirename     Triggername 
__db.002      Group     Providename  Requireversion 


Comment 32 Pete Zaitcev 2002-10-11 05:45:48 UTC
Unfortunately, rpm can get its database so inconsistent during the hang,
that "rm /var/lib/rpm/__db*" does not help. For example, I did
"rpm -e at", it hang, with strace -p showing
 select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)

Now, running "rpm -e at" produces:
error reading information on service atd: No such file or directory
error: %preun(at-3.1.8-31) scriptlet failed, exit status 1

If it really fails just because it races with a script subprocess,
one is left to wonder why it was too hard to use fork, exec, and wait4.
psmWait, indeed.


Comment 33 Roy Stogner 2002-10-11 21:27:14 UTC
I'd like to suggest that everyone experiencing this bug keep their RPM database
backed up.  One of my "killall -9 rpm; rm -f /var/lib/rpm/__db*; rpm
--rebuilddb" cycles left me with a database that has forgotten about 3/4 of the
packages on my  system.

Comment 34 Dale Lovelace 2002-11-14 19:50:14 UTC
Hi Jeff! If you use apt (http://psyche.freshrpms.net/rpm.html?id=243) to 
install an RPM the __db.00? files are left behind on every occasion. Perhaps 
looking at this can give you a clue.

Comment 35 Warren Togami 2002-11-16 12:24:58 UTC
I ran into a hang on a fresh system today running rpm-4.1-9 while removing the
kernel-2.4.18-14 package.  strace showed those timeouts forever.

This system had run apt-get a few times.  Are you saying that apt-get may be
triggering this rpm bug?  Maybe this is why Red Hat was unable to reproduce this
after 4.1-9 was released.

Comment 36 Ville Skyttä 2002-11-16 13:39:42 UTC
I've seen this several times too, with -U, -F and -e, and I *think* it has only happened with packages that have some %pre or %post (or the corresponding -un) scripts, possibly if'ing on the $1 argument to determine whether an upgrade or erase is going on.

$ rpm -q rpm
rpm-4.1-1.06


Comment 37 Jeff Johnson 2002-11-16 14:17:05 UTC
ville.skytta: You need/want the missed SIGCHLD fix
in rpm-4.1-9 at ftp:://people.redhat.com/jbj/test-4.1.

THere are far too many different problem here for me
to solve any problem efficiently, so I'm
gonna close this bug.

Feel free to open individual bugs and I'll try to
get you sorted out.

Comment 38 R.K.Aa. 2002-11-16 16:20:33 UTC
could bug 77857 be related? The two systems i saw this bug on were both systems
i had upgraded from 7.2/7.3. I wasn't aware i had to manually delete symlinks
and reinstall rpm if i upgrade to RH8.0. It isn't mentioned in the release-notes
either.

Comment 39 Jeff Johnson 2002-11-16 16:36:25 UTC
Symlinks are unlikely to be the problem, other
than that rpm may not upgrade at all unless
symlinks are removed.

Comment 40 Warren Togami 2002-11-16 22:25:23 UTC
If you are very rarely seeing lockup problems with rpm-4.1-9 and you use
apt-get, please read Bug 77988 and help us figure out this problem.


Comment 41 Tim Wunder 2003-01-15 16:50:51 UTC
WOW! This is just closed as WORKSFORME?
Well, FWIW, this DOESN'T work for me. It seems that whenever (or nearly) I use
the Package Manager, or up2date when su'd to root, rpm locks up on me. Removng
the __db.? files from /var/lib/rpm clears it up. I've since only run the Package
Manager and Up2Date while logged in as root and have managed to avoid the
problem. Last time it happened, I recall a python process was still hanging
around that *may* have actually been the problem. Since I need to run up2date
soon again, I'll do it su'd to root and will hopefully be able to reproduce the
problem.
Is there another bug somewhere that I should be looking at (a bugzilla search
didn't find anything, but I may not have provided the right search terms...)

Comment 42 Shawn Walker 2003-02-06 18:07:40 UTC
I am seeing this issue as well, I do not use apt-get. I had this issue just
today after removing old kernel packages one after another when I updated to
2.4.18-24. Usually only rebooting the system (for some reason) or sometimes even
then I have to do a rpm --rebuilddb and then installs and removals will work again

This is very frustrating.

Comment 43 Daniel J Blueman 2003-07-17 10:31:09 UTC
I have been experiencing this issue on RedHat Linux 8.0 systems with rpm-4.1-
1.06. This occurs a significant amount of the time - like 30-50%!

rpm sits in pause(2) waiting for signals, but never receives any, as reported 
above.

Please advise if you'd like a fresh bug report opening. I think it will add to 
the confusion though, and could be considered a duplicate!

Comment 44 Bill Marrs 2003-09-01 18:42:04 UTC
One of my RH9 servers seems to have gotten hosed, I don't know how to get it on 
its feet again.  The symptom seem similar to this bug, but perhaps worse (I 
also can't run up2date).

Redhat 9
% rpm -e at-3.1.8-33       
error reading information on service atd: No such file or directory
error: %preun(at-3.1.8-33) scriptlet failed, exit status 1
# rpm -q rpm
rpm-4.2-0.69
# rm -rf /var/lib/rpm/__db*
zsh: no matches found: /var/lib/rpm/__db*
# up2date -l                               
zsh: 20499 segmentation fault  up2date -l



Comment 45 Bill Marrs 2003-09-01 18:58:56 UTC
Ignore that last bit about update not working for me... I reinstalled Python 
and it's better now.  But, I'm still getting that error from rpm -e .

Comment 46 Brian "netdragon" Bober 2003-11-02 09:10:45 UTC
Is there a new bug report for this? It still isn't fixed and I'd say
enough people have commented to say its reproducible. I just had this
problem again on Redhat 9 trying to remove (rpm -e) proftpd. I can't
remove proftpd, and rpm always locks up. I tried the db thing, and
after a few tries, it worked. This is a persistant problem that has
existed in all distributions. I have apt-get along with rpm version 4.2


Note You need to log in before you can comment on or make changes to this bug.