Bug 81645

Summary:	rpm-4.2 hangs: blocked on futex(2)
Product:	[Retired] Red Hat Raw Hide	Reporter:	Nathan G. Grennan <redhat-bugzilla>
Component:	rpm	Assignee:	Jeff Johnson <jbj>
Status:	CLOSED WORKSFORME	QA Contact:	Mike McLean <mikem>
Severity:	high	Docs Contact:
Priority:	medium
Version:	1.0	CC:	corporal_pisang, ebfekete, graham, ij2fdc402, kjetilho, kmaraas, mitr, noa-bugzilla-redhat
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2003-01-21 13:36:51 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Nathan G. Grennan 2003-01-12 02:39:00 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 Galeon/1.2.7 (X11; Linux i686; U;) Gecko/20021216

Description of problem:
I ran "rpm -e kernel-2.4.20-2.9" when it hung. I did a strace -p pid and got
"futex(0x404236fc, FUTEX_WAIT, 0, NULL". Then I ran "killall -9 rpm". I also had
to run "killall -9 rpmq" since I had run rpm -qa after it had hung. Then I ran
"rm /var/lib/__db*". Then I ran "rpm --rebuilddb" just in case, and got "error:
db4 error(16) from dbenv->remove: Device or resource busy".

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. rpm -e package

    

Actual Results:  rpm hung

Expected Results:  rpm to finish and return to the command prompt

Additional info:

This is while running kernel-2.4.20-2.10 and using glibc-2.3.1-32.

Comment 1 Nathan G. Grennan 2003-01-13 08:02:55 UTC

another hang while Upgrading packages with rpm -F

kernel-2.4.20-2.12, glibc-2.3.1-34, rpm-4.2-0.54

futex(0x405be4bc, FUTEX_WAIT, 0, NULL

Comment 2 Jeff Johnson 2003-01-14 21:53:32 UTC

I'm gonna turn this into a category bug, as there's
sure to be lots and lots of weirdness during the
transition to NPTL. Fwiw, nptl works for rpm, quite
well thank you.

Please include glibc and kernel version, release, and arch
with any reports.

Comment 3 Kjartan Maraas 2003-01-16 18:13:30 UTC

I'm seeing this now with the latest glibc and kernel from rawhide. Compaq
laptop, intel based.

Comment 4 Kjetil T. Homme 2003-01-17 09:53:07 UTC

another incident: mozilla was blocked on futex().  SIGTERM had no effect,
SIGKILL  hanged the machine, not even Alt+SysRQ+B had any effect.

Comment 5 Kjetil T. Homme 2003-01-17 09:55:24 UTC

sorry forgot version information:

glibc-2.3.1-21
kernel-smp-2.4.20-2.10
mozilla-1.2.1-4
XFree86-4.2.99.2-0.20021217.0 (nv driver)

Comment 6 Nathan G. Grennan 2003-01-18 07:44:27 UTC

This hanging problem with rawhide is worse than in 4.1-1.06 in RedHat 8.0.

Comment 7 Jeff Johnson 2003-01-20 16:05:09 UTC

AFAIK, the latest beta has functional futexes in kernel-2.4.20-2.22
and rocksolid pthreads in glibc-2.3.1-38, so there's little
reason to keep open reports from previous versions for problems
that have already been fixed.

Please upgrade to (at least) kernel-2.4.20-22, glibc-2.3.1-38 and
rpm-4.2-0.56 before reporting problems.

Comment 8 Nathan G. Grennan 2003-01-20 16:11:00 UTC

I will try what you suggest via rawhide. The new beta has glibc-2.3.1-36, and
kernel-2.4.20-2.21.

Comment 9 Nathan G. Grennan 2003-01-21 04:28:04 UTC

As expect, it Still there.

kernel-2.4.20-2.22
glibc-2.3.1-38
rpm-4.2-0.56

[root@cygnusx-1 ~]# ps ax | grep rpm
27453 pts/2    S      0:01 rpm -i xfig-3.2.3d-11.i386.rpm
root@cygnusx-1 ~]# strace -p 27453
futex(0x405c30cc, FUTEX_WAIT, 0, NULL

Comment 10 Jeff Johnson 2003-01-21 13:16:11 UTC

FWIW, xfig-3.2.3d-11.i386.rpm installs for me.

Have you done
   rm -f /var/lib/rpm/__db*
after upgrade?

Is the problem reproducible, removing __db* files first?

Comment 11 Nathan G. Grennan 2003-01-21 13:31:40 UTC

I killed rpm, removed the __db files and ran it again. It didn't hang the second
time. It isn't reproducable every time. It is random like hangs with 4.1-1.06
from RedHat 8.0 are.

Comment 12 Jeff Johnson 2003-01-21 13:36:51 UTC

Hmmm, the missed SIGCHLD in rpm-4.1-1.06 and this behavior are
very unlikely to be related.

I'm gonna close WORKSFORME because I don't see any way to reproduce.

Comment 13 Nathan G. Grennan 2003-01-21 13:47:10 UTC

Are you going to just ignore this bug? Will you reopen it if I can get others to
reproduce it and report?

As for suggestions of how to reproduce it. Try running setiathome in the
background. Do a big directory copy at the same time. In general, case a high
load. This is what has seemed to always help induce hangs.

A theory I have never tested is that it is caused, or helped by running the low
latency patch.

Comment 14 Jeff Johnson 2003-01-21 13:56:48 UTC

I'm not ignoring this bug. I can't fix what I can't see, however.

If you can reproduce, I'll be happy to fix.

Comment 15 Daniel Resare 2003-02-15 11:34:12 UTC

I've experienced a few cases of this hang with 

kernel-2.4.20-2.48
glibc-2.3.1-45
rpm-4.2-0.66

however, after I rebuilt my database with 'rpm --rebuilddb' i haven't been able
to reproduce.

Comment 16 Corporal Pisang 2003-02-19 03:42:59 UTC

glibc-2.3.1-46
rpm-4.2-0.68
db4-4.0.14-20.i386.rpm

rpm is not usable at all

ive deleted __db*

rpm --rebuilddb gives - 
rpmdb: unable to join the environment
error: db4 error(11) from dbenv->open: Resource temporarily unavailable
error: cannot open Packages index

rpm -qa gives -
rpmdb: unable to join the environment
error: db4 error(11) from dbenv->open: Resource temporarily unavailable
error: cannot open Packages index using db3 - Resource temporarily unavailable (11)
error: cannot open Packages database in /var/lib/rpm
no packages

Comment 17 Kjetil T. Homme 2003-03-04 00:59:46 UTC

the bug is still present, although it strikes only intermittently.

# rpm -ivh --replacepkgs XFree86-devel_4.2.99.901-20030213.0_i386.rpm 
warning: XFree86-devel_4.2.99.901-20030213.0_i386.rpm: V3 DSA signature: NOKEY,
key ID 897da07a
Preparing...                ########################################### [100%]
   1:XFree86-devel          ########################################### [100%]
[hang]
# strace -f -p 1625
futex(0x405bf1ec, FUTEX_WAIT, 0, NULL) [hang]
# kill -KILL 1625 (SIGTERM has no effect)

after this, the same rpm command will not do anything.
# rpm -ivh --replacepkgs XFree86-devel_4.2.99.901-20030213.0_i386.rpm 
[hang]
# strace -f -p 2363
futex(0x4061803c, FUTEX_WAIT, 0, NULL <unfinished ...>

the LD_ASSUME_KERNEL=2.2.5 sledgehammer works.  removing the __db.00X files also
gets rpm going again.

I have kept a copy of the __db.00X files so that I can reproduce this.

system is stock Phoebe 5:
glibc-2.3.1-46
kernel-smp-2.4.20-2.48
rpm-4.2-0.66

Comment 18 Need Real Name 2003-03-15 14:45:03 UTC

Comment 17 says it all... 

i get exact same with:
glibc-2.3.1-38
kernel-2.4.20-2.54
rpm-4.2-0.56

If democracy means anything i vote to reopen this bug as 
HONESTTHEREREALLYISABUG.

Comment 19 Kjetil T. Homme 2003-03-15 16:13:39 UTC

my offer of an rpmdb exhitbiting the problem still stands, but I'm wary of
uploading more than 40 MiB of data to Bugzilla if there isn't a demand.

perhaps we need to open a new bug to make them listen?

Comment 20 Kjartan Maraas 2003-03-15 16:57:18 UTC

Tried the new packages from here?

ftp://people.redhat.com/jbj/test-4.2

Comment 21 Kjetil T. Homme 2003-03-15 18:00:41 UTC

okay, upgraded to 4.2.0-71.  now it segfaults on my working database :)

# rpm -vvqa
[...]
D:  read h#    2541 Header V3 DSA signature: NOKEY, key ID e42d547b
aalib-devel-1.4rc5-fr1
Segmentation fault

this was the last package in the database.  no debugging info in the RPM, so
running it under GDB yields nothing.  the tail of strace is

futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0
futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
write(1, "aalib-devel-1.4rc5-fr1\n", 23) = 23
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++

plugging in my broken database, it hangs as before.

Comment 22 Ivan Makfinsky 2003-06-07 20:54:53 UTC

This bugs seems to be related to a corrupt Packages file in /var/lib/rpm/. I
cannot uncorrupt the file and cannot install any rpms either. I am going to be
very upset if I have to re-install the system due to this.

I am also hanging on the message:

futex(0x405b3940, FUTEX_WAIT, 0, NULL

Comment 23 Eric Schwartz 2003-07-31 16:22:34 UTC

I see the same hang on futex with a [internally developed] program we use that uses the db 
(Sleepcat) library...I noticed rpm uses it as well.  When the db library is built with out pthreads, 
specifically HAVE_MUTEX_PTHREADS & HAVE_MUTEX_X86_GCC_ASSEMBLY, 
everything works for our program.  With those defined it *always* hangs on: 
... 
time([1059434367])                      = 1059434367 
futex(0x81eb1f8, FUTEX_WAIT, -606348324, NULL 
*hang* 
 
so I think the problem is some interaction between the db library and the new thread stuff... 
 
versions we are using: 
RH9 - kernel-2.4.20-8 
gcc-3.2.2-5 
glibc-2.3.2-11.9 
db4-4.0.14-20 
 
this problem is not present on RH8 or previous...

Comment 24 Ihar Filipau 2003-08-04 13:37:42 UTC

[ I'm reopen this but just to make little bit more noise ]
Fully upgraded RH9 (rhn-applet-tui says 'Ignored. No updates available.')
It hanged.
Any call to rpm hangs on futex syscall.
Bugs:
 1. ^C/^Z doesn't work (Shame!!! And what I'm supposed to do
       if this is the only terminal I do have?)
 2. It hangs
 3. Cleaning files manually is pretty annoying.
Yes - after 'rm /var/lib/rpm/__db*' it works.
But still - to have this kind of bug in this kind of tool...

And yes - files are recreated after any run of rpm under root - even rpm -qa.

P.S. Actually this is the second time this bug has beaten me: first time it was
 first run of RH8.0 - rpm just hanged. I've spent more than half of day trying
to figure out what it going wrong. I didn't knew this magic recipe 'rm
/../__db*', but reboot helped then - I do not know why... :-(

Comment 25 Steve Bonds 2003-08-05 10:05:55 UTC

Just to add another data point, I've seen the same intermittent futex() hang on
RedHat 9 with the following:

kernel-smp-2.4.20-18.9
gcc-3.2.2-5 (not that it matters since I'm using the RedHat binaries)
glibc-2.3.2-27.9
db4-4.0.14-20
rpm-4.2-0.69

We can't make RedHat keep the bug open, but we can start recommending that our
friends and businesses stay off RedHat 9.  ;-)

Comment 26 Steve Bonds 2003-08-05 16:21:48 UTC

For all you watching this bug, take a look at bug # 101062.  It looks like the
newest version of RPM (4.2-1 from ftp://ftp.rpm.org/pub/rpm/dist/rpm-4.2.x)
fixes the stale lock problem that causes this hang.

I haven't thoroughly tested it, but it seems there is some hope.

  -- Steve

Comment 27 Graham Mainwaring 2004-09-22 19:29:52 UTC

I'm fully updated RH9 and just experienced this bug.

kernel: 2.4.20-28.9
glibc: 2.3.2-27.9.7
rpm: 4.2-0.69

So it isn't fixed by rpm 4.2. For whatever that information is worth 
at this point.

Comment 28 Jeff Johnson 2004-09-22 19:37:16 UTC

Yes, the problem is not fixed by rpm-4.2-0.69 as in RHL 9.

The problem *IS* fixed in rpm-4.2-1 available from ftp.rpm.org.

No errata is planned, RHL 9 is already end-of-life.