75647 – rpm command hangs

Bug 75647 - rpm command hangs

Summary: rpm command hangs

Keywords:
Status:	CLOSED RAWHIDE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	rpm
Sub Component:
Version:	8.0
Hardware:	i386
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Jeff Johnson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-10-10 16:49 UTC by Joel Barrios
Modified:	2008-05-01 15:38 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2002-10-19 21:42:07 UTC
Embargoed:

Attachments	(Terms of Use)

Description Joel Barrios 2002-10-10 16:49:57 UTC

Description of Problem:

rpm command HANGS very nastily and frequently while doing ANYTHING when used as
root (this meas from a simple rpm -q to rpm -e. Process cannot be stopped with
CTRL-C neither kill -15, but only with a nasty kill -9

rpm command behavior goes back to normal only after rebooting or doing a rm -f
/var/lib/rpm/__db*.

Version-Release number of selected component (if applicable):
rpm-4.1-1.06

How Reproducible:
It is not reporducible arbitrarily. It happens at random, specially after
several hours of using the system.

Steps to Reproduce:
1. 
2. 
3. 

Actual Results:


Expected Results:


Additional Information:
	
This bug seems to be related to
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75553

Comment 1 Martin Garton 2002-10-11 10:46:38 UTC

I have the exact same problem.  I have tried a few different rpm commands and
each time it hangs. In case its useful. pstack when stuck looks like this:

0x420d3b2e: __GI_select + 0x1e (0, 0, f4240, 4054b818, 401453b0, 40169b88)
0x401453f1: __os_yield_rpmdb + 0x41 (0, f4240, 400d4a4b, 40169b88, 40587f30,
405727d8)
0x400d4aaf: __db_tas_mutex_lock_rpmdb + 0x6f (805b760, 405727d8, 4054b858, 0,
bfffe494, 40169b88) + 50
0x40130bc7: __lock_get_internal + 0x757 (805ba60, 37, 0, 805beac, 1, 0) + 10
0x401303ff: __lock_get_rpmdb + 0xef (805b760, 37, 0, 805beac, 1, 805bee0) + 20
0x400fc988: __db_cursor_rpmdb + 0x158 (805bb00, 0, bfffe558, 0, 400ac000,
40169cd4) + 30
0x4011dcaa: __ham_open_rpmdb + 0x7a (805bb00, bfffe628, 0, 10, 0, bfffe598) + 20
0x400f9b0f: __db_dbopen_rpmdb + 0xef (805bb00, bfffe628, 10, 1a4, 0, 4000a190) + 40
0x400f986e: __db_open_rpmdb + 0x32e (805bb00, bfffe628, 0, 2, 10, 1a4) + 80
0x400d3f5d: db3open + 0x40d (8053270, 0, bfffe6e8, ffffffff, 400ac000, 40169ef4)
+ 30
0x400cab70: dbiOpen + 0xf0 (8053270, 0, 0, 1a4, 0, bfffe728) + 20
0x400cc0ae: openDatabase + 0x11e (8052f60, 0, 3, 80530dc, 0, 1a4) + 10
0x400cc296: rpmdbOpen + 0x56 (8052f60, 80530dc, 0, 1a4, bfffe7c8, 4000a190) + 10
0x4008bee9: rpmtsOpenDB + 0x59 (80530a0, 0, 804ad2c, 804a2e0, bffff898,
4000a190) + 10
0x4008c063: rpmtsInitIterator + 0x33 (80530a0, 2, bffffaa1, 0, 0, 0) + 10b0
0x400796c3: rpmQueryVerify + 0x443 (804a2e0, 80530a0, bffffaa1, 804ab90,
bffff918, 0) + 10
0x4007a413: rpmcliQuery + 0xa3 (80530a0, 804a2e0, 804ad28, 804a020, 0, 4212a2d0)
+ 30
0x080497ed: main + 0x37d (4, bffff964, bffff978, 400124b8, 4, 8049240)
0x420158d4: __libc_start_main + 0xa4 (8049470, 4, bffff964, 8048f0c, 8049c08,
4000a950) + 400006a8

Comment 2 Jeff Johnson 2002-10-11 15:04:01 UTC

This smells like a stale lock, but I can't tell.

What was going on previously to the command that hung?
For example, if "kill -9" was previously executed, this
will leave a stale lock.

Comment 3 Martin Garton 2002-10-11 15:29:05 UTC

It currently hangs on any command, so your stale lock idea seems likely.  The
first time it hung, was at the end of an "rpm -Uvh" command on a previously
uninstalled rpm (kernel-source in case its important)
The progress meter reached 100% and then hung.  Unfortunately i didnt get a
stack trace then. If you tell me how to remove he stale lock i can try to
reproduce and see where it hangs the first time.

Comment 4 Jeff Johnson 2002-10-11 15:40:32 UTC

OK, it seems you killed rpm as it was closing the
database leaving a stale lock that caused subsequent commands
to fail.

Stale locks are removed by
	rm -f /var/lib/rpm/__db*
rpm itself cannot do that automagically w/o
opening lock race windows. Well I might be able
to detect stale locks, but that algorithm is very
race prone indeed.

Comment 5 Colin Tinker 2002-10-13 16:44:00 UTC

I have seen this problem on 5 systems so far as originally said after prolonged
usage or if the system is under heavy load it will happen, for example running
seti at nice 19 will cause it to happen fairly frequently.

Comment 6 Ben Herrick 2002-10-15 04:57:27 UTC

This actually happened to me once. I'm pretty sure it was caused by my nscd
instance crashing though. The only way to stop rpm was to 'kill -9' it, and then
the only way to get subsequent rpm runs to work was to reboot. During the
reboot, I noticed that nscd failed to shutdown.

I use nscd to query a pair of ldaps servers for all my user information. Since
upgrading to Redhat8 it seems to crash an awful lot. Are the people who are
affected by this problem running nscd? Perhaps even nscd with ldap(s)? I hope
this helps.

P.S. If my nscd problem continues, I will open a bugzilla ticket for it in the
next few days...

Comment 7 Daniel Tschan 2002-10-16 23:05:00 UTC

I can confirm this bug. rpm deadlocks at every other run, no matter what options
are used. After executing rm -f /var/lib/rpm/__db* I can use rpm once, then it
deadlocks again.

Comment 8 Joel Barrios 2002-10-17 01:16:59 UTC

I have discovered something interesting about this bug: it happens when
installing or managing packages created with the rpm-4.0.x, specially large
ones. Since I have managed only new packages created with rpm-4.1, rpm command
has not hung in the last week. I haven't have problems with any RPM created with
version 4.1.

Comment 9 Daniel Tschan 2002-10-17 09:33:22 UTC

The problem disappeared after upgrading apt. I usually use apt to install
software. The old apt version was linked against rpm 4.0.4, the new one against
rpm 4.1. The version of the RPM file itself doesn't seem to matter in my case.

Comment 10 Rick Johnson 2002-10-19 00:10:40 UTC

I can confirm that this seems to happen on high load machines. I run 3 RH 8.0 
servers, 2 running background distributed clients (FAH or Distributed.net). 
Third machine is sitting idle. Two machines under load both hung and required a 
force kill, then deletion of /var/lib/rpm/__db.00? files, followed by an rpm --
rebuilddb. Subsequent (and --force'd) installations went just fine. The RPM 
package was logwatch-4.1-3.noarch.rpm from www.logwatch.org (unsure of the 
version of RPM used to generate).

Comment 11 Steph Gosling 2002-10-19 21:42:01 UTC

Hi all.

Been seeing the same to the extent I hosed my database and decided to reinstall.
Thought it was happening to me only when installing packages until this happened:

here comes strace...:

[root@heroditus root]# strace rpm -e apmd
execve("/bin/rpm", ["rpm", "-e", "apmd"], [/* 22 vars */]) = 0
fcntl64(0, F_GETFD)                     = 0
fcntl64(1, F_GETFD)                     = 0
fcntl64(2, F_GETFD)                     = 0
uname({sys="Linux", node="heroditus", ...}) = 0
geteuid32()                             = 0
getuid32()                              = 0
getegid32()                             = 0
getgid32()                              = 0
getrlimit(0x3, 0xbffff4f8)              = 0
setrlimit(RLIMIT_STACK, {rlim_cur=2044*1024, rlim_max=RLIM_INFINITY}) = 0
getpid()                                = 1218
rt_sigaction(SIGRTMIN, {0x8141594, [], SA_RESTORER, 0x8161198}, NULL, 8) = 0
rt_sigaction(SIGRT_1, {0x8140b20, [], SA_RESTORER, 0x8161198}, NULL, 8) = 0
rt_sigaction(SIGRT_2, {0x8141600, [], SA_RESTORER, 0x8161198}, NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [RTMIN], NULL, 8) = 0
_sysctl({{CTL_KERN, KERN_VERSION}, 2, 0xbffff500, 30, (nil), 0}) = 0

...
...
This carries on as normal until:
...
...

gettimeofday({1035062495, 950883}, NULL) = 0
gettimeofday({1035062495, 950941}, NULL) = 0
close(8)                                = 0
gettimeofday({1035062495, 951006}, NULL) = 0
gettimeofday({1035062495, 951047}, NULL) = 0
gettimeofday({1035062495, 951078}, NULL) = 0
munmap(0x4041a000, 8192)                = 0
dup(1)                                  = 8
gettimeofday({1035062495, 951211}, NULL) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[], ~[KILL STOP], 8) = 0
rt_sigaction(SIGCHLD, {0x8142ac4, [], SA_RESTORER, 0x8161198}, {SIG_DFL}, 8) =
0rt_sigprocmask(SIG_SETMASK, ~[KILL STOP], NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
fork()                                  = 1219
--- SIGCHLD (Child exited) ---
wait4(0, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG, NULL) = 1219
sigreturn()                             = ? (mask now [RTMIN])
rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
pause(

It sits there for a minute or two, I get impatient and then from another xterm I
try 3 times to kill the rpm process, before kill -9'ing it:

)                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) ---
sigreturn()                             = ? (mask now [RTMIN])
rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) ---
sigreturn()                             = ? (mask now [RTMIN])
rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
pause()                                 = ? ERESTARTNOHAND (To be restarted)
--- SIGTERM (Terminated) ---
sigreturn()                             = ? (mask now [RTMIN])
rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0
rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0
pause()                                 = ? ERESTARTNOHAND (To be restarted)
+++ killed by SIGKILL +++
[root@heroditus root]#



Note, when next trying to remove this package things all worked flawlessly. The
full strace is available on request.

I hope this is of some use.

Cheers,

Comment 12 Jeff Johnson 2002-10-25 19:09:55 UTC

Most of these seem to be stale lock problems,
fix by doing
	rm -f /var/lib/rpm/__db*

The last strace looks like a missing SIGCHLD, the
fix is in rpm-4.1-9 packages at
	ftp://people.redhat.com/jbj/test-4.1
Errata coming.

Meanwhil I'm gonna close this bug, there's too many
problems to respond effectively.

Note You need to log in before you can comment on or make changes to this bug.