112811 – rpm jams, futex problem

Bug 112811 - rpm jams, futex problem

Summary: rpm jams, futex problem

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Paul Nasrat
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-01-02 16:35 UTC by Toni Willberg
Modified:	2007-11-30 22:10 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	485780 (view as bug list)
Environment:
Last Closed:	2005-04-19 18:40:51 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Toni Willberg 2004-01-02 16:35:53 UTC

Description of problem:

strace:
 open("/var/lib/rpm/Packages", O_RDONLY|O_LARGEFILE) = 3
 fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
 fstat64(3, {st_mode=S_IFREG|0644, st_size=29184000, ...}) = 0
 futex(0xb7bed9c8, FUTEX_WAIT, 0, NULL


Version-Release number of selected component (if applicable):
 Not sure, rpm doesn't work, remember?
 All packages from Rawhide

How reproducible:
 First time this happens with this rpm version, now it's 100%
reproducable.

Something wrong in my rpmdb?

Steps to Reproduce:
1. rpm -q rpm

Comment 1 Toni Willberg 2004-01-02 19:50:32 UTC

Ok, after removing the famous __db* -files it magically works again.

The versions of stuff I have:
 rpm-4.2.2-0.8
 glibc-2.3.3-3
 db4-4.2.52-1
 popt-1.8.2-0.8
 kernel-2.6.0-1.21

Of which rpm, popt, glibc and kernel were updated today, db4 last week.

Comment 2 Jeff Johnson 2004-01-03 07:02:01 UTC

Yup, rm -f /var/lib/rpm/__db* removes locks.

Comment 3 Toni Willberg 2004-01-03 08:53:18 UTC

Not a bug? What is it then?

Removing those db files manually every time it hangs is not correct
solution to this problem.

Comment 4 Jeff Johnson 2004-01-03 15:45:46 UTC

Do you have reproducer for "every time" hang? If so, then
this is bug against glibc or kernel probably. If not, then
your problem is a usage problem. For example, the user is expected
to clean up consequences of "kill -9", that are in __db* files, i.e.
stale database locks will be left on "kill -9" termination.

If you have reproducer for "every time" hang, please reopen and I
will forward to appropriate glibc/kernel people.

Otherwise, yes, notabug.

Comment 5 Trond Eivind Glomsrød 2004-01-07 13:27:54 UTC

Here's one way to reproduce it on RHEL3 (reproduces similarly on other
versions) - run the following commands in parallel:

while [ 1 ]; do date; rpm -e kernel-source; date; rpm -i 
kernel-source-2.4.21-4.0.1.EL.i386.rpm ; done

while [ 1 ]; do date; rpm -e tcpdump; date; rpm -i 
tcpdump-3.7.2-7.i386.rpm ; done

After a few iterations both processes hang. Output from ps -aux:
root 1248  3.4  3.1 16060 11896 pts/0   S    14:38   0:04 rpm -e
kernel-source
root 1264 96.9  0.9  7748 3600 pts/1    R    14:38   1:55 rpm -e tcpdump

strace -p 1248:
Process 1248 attached - interrupt to quit
futex(0xb70c846c, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1248 detached

strace -p 1264:
Process 1264 attached - interrupt to quit
futex(0xb6fece58, FUTEX_WAIT, 2, NULL)  = -1 EAGAIN (Resource
temporarily unavailable)
futex(0xb6fece58, FUTEX_WAIT, 2, NULL)  = -1 EAGAIN (Resource
temporarily unavailable)
(repeats infinitively)

It may of course be glibc, but will let you reassign if you believe
that is correct.

Comment 6 Jeff Johnson 2004-01-08 16:30:15 UTC

Um, that ins't exactly deterministically reproducible, nor
is there known rpm application that needs that degree of
concurrent access. Sure there may be deadlock problems
at that degree of concurrency.

How many is "few" in units like number of pkgs or wall clock
time?

Comment 7 Ragnar Kjørstad 2004-01-08 16:44:31 UTC

I'd say from 2-5 iterations on the kernel-source package; (tcpdump is
much smaller and is reinstalled more times in the same time).

I reproduced this problem 5 times yesterday, the last two on
completely clean systems (RHL9 and RHEL3AS).

And regarding the need for this degree of conrurrent access; the
reason we detected the problem in the first place is because we see it
happen all the time in a real-world application.

Comment 8 Jeff Johnson 2004-01-08 20:15:53 UTC

Hmmm, 2-5 iterations is at least 2 orders of magnitude
more often than what I was expecting (and what I saw when I last
checked).

What kernel? What glibc? What rpm?

Comment 9 Ragnar Kjørstad 2004-01-09 11:46:59 UTC

kernel-smp-2.4.20-28.9
glibc-2.3.2-27.9.7
rpm-4.2-0.69

It may be relevant that this was tested on an SMP-system?

Comment 10 Trond Eivind Glomsrød 2004-03-08 08:19:09 UTC

The status of the bug doesn't match last info given.

Comment 11 John Ky 2004-05-15 14:33:42 UTC

Help help.  I got myself into a situation where RPM doesn't work
anymore.  I even tried to remove the "famous" DB files without
success. The following text show what I mean.  Thanks.


[root@localhost lib]# rpm -qa | grep http
httpd-manual-2.0.49-2
system-config-httpd-1.2.0-3
httpd-devel-2.0.49-2
libghttp-1.0.9-9.1.1
httpd-2.0.49-2
[root@localhost lib]# rpm -e httpd-devel-2.0.49-2
error: can't create transaction lock
[root@localhost lib]# slocate __db
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/union__dbswap.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiVec.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__dbi.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndex.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__db3.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexSet.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexItem.html
/var/lib/rpm/__db.003
/var/lib/rpm/__db.002
/var/lib/rpm/__db.001
[root@localhost lib]# rm -f /var/lib/rpm/__db.003
/var/lib/rpm/__db.002 /var/lib/rpm/__db.001
[root@localhost lib]# slocate __db
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/union__dbswap.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiVec.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__dbi.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndex.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__db3.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexSet.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexItem.html
/var/lib/rpm/__db.003
/var/lib/rpm/__db.002
/var/lib/rpm/__db.001
[root@localhost lib]# rpm -e httpd-devel
error: can't create transaction lock

Comment 12 Barry K. Nathan 2004-05-15 19:55:50 UTC

Try rebooting (instead of deleting the __db files), if you haven't
tried that already. If that still doesn't work, then try something
like this:

cp -a /var/lib/rpm /var/lib/rpm.backup
rpm --rebuilddb

(i.e. back up the rpm database in case something goes bad, then try
rebuilding it)

Comment 13 John Ky 2004-05-16 00:58:20 UTC

Hey, thanks.  Everything works now.  It looks like in a fit of panic,
I also deleted "/var/lock/rpm", which only made things worse.

[root@localhost lib]# cp -a /var/lib/rpm /var/lib/rpm.backup
[root@localhost lib]# rpm --rebuilddb
error: can't create transaction lock
[root@localhost lib]# cd rpm
[root@localhost rpm]# ls
Basenames     Group       Providename     Requireversion
Conflictname  Installtid  Provideversion  Sha1header
Dirnames      Name        Pubkeys         Sigmd5
Filemd5s      Packages    Requirename     Triggername
[root@localhost rpm]# ls -al
total 61248
drwxr-xr-x   2 root root     4096 May 16 10:47 .
drwxr-xr-x  26 root root     4096 May 16 10:48 ..
-rw-r--r--   1 rpm  rpm  10452992 May 15 23:29 Basenames
-rw-r--r--   1 rpm  rpm     12288 May 15 23:22 Conflictname
-rw-r--r--   1 rpm  rpm   1716224 May 15 23:29 Dirnames
-rw-r--r--   1 rpm  rpm  10543104 May 15 23:29 Filemd5s
-rw-r--r--   1 rpm  rpm     16384 May 15 23:29 Group
-rw-r--r--   1 rpm  rpm     16384 May 15 23:29 Installtid
-rw-r--r--   1 rpm  rpm     45056 May 15 23:29 Name
-rw-r--r--   1 rpm  rpm  43945984 May 15 23:29 Packages
-rw-r--r--   1 rpm  rpm    335872 May 15 23:29 Providename
-rw-r--r--   1 rpm  rpm     94208 May 15 23:29 Provideversion
-rw-r--r--   1 rpm  rpm     12288 May 15 23:29 Pubkeys
-rw-r--r--   1 rpm  rpm    241664 May 15 23:29 Requirename
-rw-r--r--   1 rpm  rpm    167936 May 15 23:29 Requireversion
-rw-r--r--   1 rpm  rpm     90112 May 15 23:29 Sha1header
-rw-r--r--   1 rpm  rpm     45056 May 15 23:29 Sigmd5
-rw-r--r--   1 rpm  rpm     12288 May 15 23:29 Triggername
[root@localhost rpm]# slocate __db
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/union__dbswap.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiVec.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__dbi.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndex.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__db3.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexSet.html
/usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexItem.html
/var/lib/rpm/__db.003
/var/lib/rpm/__db.002
/var/lib/rpm/__db.001
[root@localhost rpm]# rpm -ql rpm
/bin/rpm
  *snip*
/usr/share/man/sk/man8/rpm.8.gz
/var/lib/rpm
/var/lock/rpm
/var/spool/repackage
[root@localhost rpm]# cd /var/lock
[root@localhost lock]# ls
lvm  subsys
[root@localhost lock]# mkdir rpm
[root@localhost lock]# ls -al
total 20
drwxrwxr-x   5 root lock 4096 May 16 10:50 .
drwxr-xr-x  22 root root 4096 May 11 23:47 ..
drwx------   2 root root 4096 May 11 23:52 lvm
drwxr-xr-x   2 root root 4096 May 16 10:50 rpm
drwxr-xr-x   2 root root 4096 May 16 10:44 subsys
[root@localhost lock]# chmod a+rx rpm
[root@localhost lock]# ls -l
total 12
drwx------  2 root root 4096 May 11 23:52 lvm
drwxr-xr-x  2 root root 4096 May 16 10:50 rpm
drwxr-xr-x  2 root root 4096 May 16 10:44 subsys
[root@localhost lock]# rpm --rebuilddb

Comment 14 Phil Mayers 2004-11-15 18:43:33 UTC

We have also just experienced this problem on RedHat Enterprise 3;
although I have experienced it multiple times on Fedora installs doing
nothing more complicated than a "yum update", so much so that my
personal system is structured to allow me to reinstall in under 10
minutes. The presence on RHEL is however a whole order of magnitude
more irritating.

The "offending" applications were rhnsd (running rhn_check) and
"up2date". Both RedHat applications. I don't think "notabug" is an
appropriate classification for this, since the RedHat update mechanism
(for which we pay) just destroyed a production server - since the
futex hang happened in the middle of an up2date, the system now thinks
it has e.g. two version of glibc installed, and god knows what else.
Given the role of the server, a reinstall to known setup is mandated
by our operational procedures, and by good system administration sense
in general.

Relevant versions:

kernel-smp-2.4.21-9.EL
kernel-smp-2.4.21-15.EL
kernel-2.4.21-9.EL
kernel-pcmcia-cs-3.1.31-13
kernel-2.4.21-15.EL
kernel-2.4.21-15.0.2.EL
kernel-2.4.21-9.0.1.EL
kernel-source-2.4.21-15.0.2.EL
kernel-smp-2.4.21-9.0.1.EL
kernel-utils-2.4-8.37.5
kernel-smp-2.4.21-15.0.2.EL

glibc (note these are broken - the actual files for the newer packages
are installed, but the RPM entries for the older ones are still there)

glibc-2.3.2-95.20
glibc-2.3.2-95.27
glibc-common-2.3.2-95.20
glibc-common-2.3.2-95.27
glibc-devel-2.3.2-95.20
glibc-headers-2.3.2-95.20
glibc-kernheaders-2.4-8.34
glibc-profile-2.3.2-95.20
glibc-utils-2.3.2-95.20

RPM:

redhat-rpm-config-8.0.28-2
rhnlib-1.3-12
rhnlib-1.8-6.p22
rpm-4.2.2-0.14
rpm-build-4.2.2-0.14
rpmdb-redhat-3-0.20040505
rpm-devel-4.2.2-0.14
rpm-python-4.2.2-0.14
up2date-4.2.33-1


It hung on a futex; further examination revealed that an "rhn_check"
was running. "kill -9" of that allowed up2date to proceed to one more
package (chkconfig) - that's even more broken:

chkconfig-1.3.8-3

...but files for chkconfig-1.3.11-0.3 are installed (as determined by
MD5 checking).

There was no way to persude the up2date to complete, hence a "kill -9"
and "rm -f /var/lib/rpm/__*" was required.

Relevant diagnostics:

[nsg@mr3 nsg]$ uname -a
Linux mr3.cc.ic.ac.uk 2.4.21-15.0.2.ELsmp #1 SMP Wed Jun 16 22:52:07
EDT 2004 i686 i686 i386 GNU/Linux

[root@mr3 root]# gdb --pid 25681
(gdb) bt
#0  0xb75d51fb in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib/tls/libpthread.so.0
#1  0xb6098930 in ?? ()
#2  0xb6098918 in ?? ()
#3  0x0895fd68 in ?? ()
#4  0xb73ef204 in ?? () from /usr/lib/librpmdb-4.2.so
#5  0xb60bef48 in ?? ()
#6  0xb6098918 in ?? ()
#7  0xbfffb058 in ?? ()
#8  0xb73afee1 in __lock_get_rpmdb () from /usr/lib/librpmdb-4.2.so
Previous frame identical to this frame (corrupt stack?)



[root@mr3 root]# strace -d -v -p 25681
Process 25681 attached - interrupt to quit
 [wait(0x137f) = 25681]
pid 25681 stopped, [SIGSTOP]
 [wait(0x57f) = 25681]
pid 25681 stopped, [SIGTRAP]
futex(0xb6098940, FUTEX_WAIT, 2, NULL

[root@mr3 root]# ltrace -S -p 25681
SYS_futex(0xb6098940, 0, 2, 0, 2)                                    
         = -4

"kill -CONT 25681" did not work (I would not have expected it to, but
worth a go).

[root@mr3 root]# cat /proc/25681/maps
08048000-080ee000 r-xp 00000000 68:16 571375     /usr/bin/python
080ee000-0810b000 rw-p 000a5000 68:16 571375     /usr/bin/python
0810b000-0a94f000 rw-p 00000000 00:00 0
b503d000-b503e000 rw-p 0004c000 00:00 0
b5313000-b5627000 rw-p 00005000 00:00 0
b56b8000-b59cc000 rw-p 0017a000 00:00 0
b5a48000-b5a7a000 rw-p 00027000 00:00 0
b5b16000-b5b42000 rw-p 00000000 00:00 0
b5b43000-b5b8e000 rw-p 0002d000 00:00 0
b5b8f000-b5bc7000 rw-p 0004c000 00:00 0
b5bc8000-b5bf5000 rw-p 00039000 00:00 0
b5bf6000-b5d72000 rw-p 0002e000 00:00 0
b604f000-b60bf000 rw-s 00000000 68:12 65556      /var/lib/rpm/__db.003
b60bf000-b6201000 rw-s 00000000 68:12 65555      /var/lib/rpm/__db.002
b6201000-b6205000 rw-s 00000000 68:12 65554      /var/lib/rpm/__db.001
b62d1000-b635a000 rw-p 0002e000 00:00 0
b6434000-b64c6000 rw-p 00029000 00:00 0
<snip - lots more info, let me know if needed>

Note that the futex is within the /var/lib/rpm/__db.003 file.

Please explain, because I don't understand, why a lock that disappears
with the locking process is not used?

I have an "lsof" of the process as well, lots of RPMs open etc. which
is also available if needed.

I cannot provide more diagnosis as the system is now in reinstall, but
I would very much like to ascertain the cause, and method of
prevention of this problem, before we lose another server.

One thing that's possibly of interest - the "up2date" was run from
within a "screen" session. I doubt it's relevant but I recall SIGPIPE
and similar RPM bugs in the past.

(Grin - just use postgres guys, it can't possibly be worse than
Berkeley UnreliableDB)

Comment 15 Jeff Johnson 2004-12-08 04:59:46 UTC

Berkeley DB is quite reliable and high performing, thank you,
and runs rings around postgress in performance. OTOH,
diagnosing problems is perhaps harder, particularly
with the complexity of NPTL.

There are several issues here, let me try to take them
one-by-one.

First of all, one needs to distinguish stale locks
from active locks, it's more than just "hung" that needs
to be identified.

Berkely DB concurrent locks can be displayed by doing
    cd /var/lib/rpm
    /usr/lib/rpm/rpmdb_stat -CA

The quiescent behavior (i.e. no other processes
accessing) should be no locks present. So run that
command to identify stal locks, i.e. locks that
persist a long time are "stale".

Stale locks occur under exceptional conditions, including
   a) use of kill -9
   b) rpm (or other application using rpmdb) segfaults
   c) machine reboots
etc. If you have a reproducer that generates a stale lock
from a (arguably) non-exceptional event then you will have
my instant attention.

The fix for "stale" locks is
    rm -f /var/lib/rpm/__db*
which can be done (in practice, there's a possibility
of a small lock race, certainly an acceptable risk if
clearly "hung") any time.

Stale locks are persistent because the name (i.e. the
futex address) is saved in the __db* file to be shared
by processes concurrently (I believe). Certainly the
most desirable locking scheme is one that evaporates
when the running process exits, that was the intent
in choosing (as recommended by Sleepycat) posix mutexes
based on futexes. If locks are actually persisting,
well, reality has to be lived with, shared posix mutexes
and rm -f /var/lib/rpm/__db* are as onerous as, say, ipcs.

The other locks that may be present are active, and
you may well be seeing an interaction between rhnsd,
rhnapplet, up2date, and perhaps rpm itself that might
be tuned better. Usually, however, the cause of "hung"
problems is stale locks. If that is incorrect, then
please control for stale locks and report as a reliable
reproducer.

Next are the unpleasant side effects of "hung", like multiply
(and perhaps partially) installed packages. For starters, there
is a watchdog deadlock detector that can be run that will
break deadlocks after a period of time. Yes not pretty, but
perhaps easier than cleaning up the aftermath again and again.
I will dig out details if absolutely necessary.

I have also seen problems with rpm+berkeleydb on ppc smp.
Judging from the sophistication and the concern, your machine
sounds like smp. The symptom that I saw was segfaults within
minutes running, say, 2 rpm -qa loops. on a 4-way ppc smp.
The cure was to change to db-4.2.52 within rpm, which has been
built for the next RHEL update, but I'd be happt to make
the packages available if necessary.

I know of no current issues like SIGPIPE with the current
RHEL3 rpm based on the number of concurrent bug reports.
The issue with the 4-way ppc was the last known issue
(which I suspect was lack of patch.4.1.25.2 as cause), and
db-4.2.52 was the fix.

I can't think of anything that screen might add to the puzzle.

I can describe howto fix multiple rpmdb entries more easily
than resinstalling if you wish, but perhaps that is info
overload.

So, in summary,

1) Are you seeing "stale" locks? Nuke with
    rm -f /var/lib/rpm/__db*
if so (but certainly try to identify why they are happening.)

2) If active locks are causing the hang, what processes are
involved?

3) If there is other cleanup necessary, I can either assist
with instructions on what to do, or can even attempt
to remove the duplicate entries (often easier than explaining).
Because of the nature of the code paths within rpm, partially
installed packages are very rare, usually it's just duplcate
entries in the rpm database, and perhaps a few files left
on the file system.

Try examining
    rpm -qa --last
output to assess what needs fixing.

And a request please: It's a little bit easier to
handle one problem report per person, bug pile ups,
particularly rpmdb "hung" problems, are difficult
to sort out in a crowd. So try to open up seperate bug
reports please.

Comment 16 Jeremy Katz 2005-04-19 18:40:51 UTC

Closing due to inactivity.  If this issue still occurs with current releases,
please reopen and set the release in which you've encountered the problem.

Comment 17 Trond Eivind Glomsrød 2005-04-20 08:44:21 UTC

Doing rpm installs/uninstalls in parallell does seem to work on RHEL4.. at
least, it's been going OK for 20 minutes.

Note You need to log in before you can comment on or make changes to this bug.