Bug 81645
Summary: | rpm-4.2 hangs: blocked on futex(2) | ||
---|---|---|---|
Product: | [Retired] Red Hat Raw Hide | Reporter: | Nathan G. Grennan <redhat-bugzilla> |
Component: | rpm | Assignee: | Jeff Johnson <jbj> |
Status: | CLOSED WORKSFORME | QA Contact: | Mike McLean <mikem> |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 1.0 | CC: | corporal_pisang, ebfekete, graham, ij2fdc402, kjetilho, kmaraas, mitr, noa-bugzilla-redhat |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-01-21 13:36:51 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Nathan G. Grennan
2003-01-12 02:39:00 UTC
another hang while Upgrading packages with rpm -F kernel-2.4.20-2.12, glibc-2.3.1-34, rpm-4.2-0.54 futex(0x405be4bc, FUTEX_WAIT, 0, NULL I'm gonna turn this into a category bug, as there's sure to be lots and lots of weirdness during the transition to NPTL. Fwiw, nptl works for rpm, quite well thank you. Please include glibc and kernel version, release, and arch with any reports. I'm seeing this now with the latest glibc and kernel from rawhide. Compaq laptop, intel based. another incident: mozilla was blocked on futex(). SIGTERM had no effect, SIGKILL hanged the machine, not even Alt+SysRQ+B had any effect. sorry forgot version information: glibc-2.3.1-21 kernel-smp-2.4.20-2.10 mozilla-1.2.1-4 XFree86-4.2.99.2-0.20021217.0 (nv driver) This hanging problem with rawhide is worse than in 4.1-1.06 in RedHat 8.0. AFAIK, the latest beta has functional futexes in kernel-2.4.20-2.22 and rocksolid pthreads in glibc-2.3.1-38, so there's little reason to keep open reports from previous versions for problems that have already been fixed. Please upgrade to (at least) kernel-2.4.20-22, glibc-2.3.1-38 and rpm-4.2-0.56 before reporting problems. I will try what you suggest via rawhide. The new beta has glibc-2.3.1-36, and kernel-2.4.20-2.21. As expect, it Still there. kernel-2.4.20-2.22 glibc-2.3.1-38 rpm-4.2-0.56 [root@cygnusx-1 ~]# ps ax | grep rpm 27453 pts/2 S 0:01 rpm -i xfig-3.2.3d-11.i386.rpm root@cygnusx-1 ~]# strace -p 27453 futex(0x405c30cc, FUTEX_WAIT, 0, NULL FWIW, xfig-3.2.3d-11.i386.rpm installs for me. Have you done rm -f /var/lib/rpm/__db* after upgrade? Is the problem reproducible, removing __db* files first? I killed rpm, removed the __db files and ran it again. It didn't hang the second time. It isn't reproducable every time. It is random like hangs with 4.1-1.06 from RedHat 8.0 are. Hmmm, the missed SIGCHLD in rpm-4.1-1.06 and this behavior are very unlikely to be related. I'm gonna close WORKSFORME because I don't see any way to reproduce. Are you going to just ignore this bug? Will you reopen it if I can get others to reproduce it and report? As for suggestions of how to reproduce it. Try running setiathome in the background. Do a big directory copy at the same time. In general, case a high load. This is what has seemed to always help induce hangs. A theory I have never tested is that it is caused, or helped by running the low latency patch. I'm not ignoring this bug. I can't fix what I can't see, however. If you can reproduce, I'll be happy to fix. I've experienced a few cases of this hang with kernel-2.4.20-2.48 glibc-2.3.1-45 rpm-4.2-0.66 however, after I rebuilt my database with 'rpm --rebuilddb' i haven't been able to reproduce. glibc-2.3.1-46 rpm-4.2-0.68 db4-4.0.14-20.i386.rpm rpm is not usable at all ive deleted __db* rpm --rebuilddb gives - rpmdb: unable to join the environment error: db4 error(11) from dbenv->open: Resource temporarily unavailable error: cannot open Packages index rpm -qa gives - rpmdb: unable to join the environment error: db4 error(11) from dbenv->open: Resource temporarily unavailable error: cannot open Packages index using db3 - Resource temporarily unavailable (11) error: cannot open Packages database in /var/lib/rpm no packages the bug is still present, although it strikes only intermittently. # rpm -ivh --replacepkgs XFree86-devel_4.2.99.901-20030213.0_i386.rpm warning: XFree86-devel_4.2.99.901-20030213.0_i386.rpm: V3 DSA signature: NOKEY, key ID 897da07a Preparing... ########################################### [100%] 1:XFree86-devel ########################################### [100%] [hang] # strace -f -p 1625 futex(0x405bf1ec, FUTEX_WAIT, 0, NULL) [hang] # kill -KILL 1625 (SIGTERM has no effect) after this, the same rpm command will not do anything. # rpm -ivh --replacepkgs XFree86-devel_4.2.99.901-20030213.0_i386.rpm [hang] # strace -f -p 2363 futex(0x4061803c, FUTEX_WAIT, 0, NULL <unfinished ...> the LD_ASSUME_KERNEL=2.2.5 sledgehammer works. removing the __db.00X files also gets rpm going again. I have kept a copy of the __db.00X files so that I can reproduce this. system is stock Phoebe 5: glibc-2.3.1-46 kernel-smp-2.4.20-2.48 rpm-4.2-0.66 Comment 17 says it all... i get exact same with: glibc-2.3.1-38 kernel-2.4.20-2.54 rpm-4.2-0.56 If democracy means anything i vote to reopen this bug as HONESTTHEREREALLYISABUG. my offer of an rpmdb exhitbiting the problem still stands, but I'm wary of uploading more than 40 MiB of data to Bugzilla if there isn't a demand. perhaps we need to open a new bug to make them listen? Tried the new packages from here? ftp://people.redhat.com/jbj/test-4.2 okay, upgraded to 4.2.0-71. now it segfaults on my working database :) # rpm -vvqa [...] D: read h# 2541 Header V3 DSA signature: NOKEY, key ID e42d547b aalib-devel-1.4rc5-fr1 Segmentation fault this was the last package in the database. no debugging info in the RPM, so running it under GDB yields nothing. the tail of strace is futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0 futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0 rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 write(1, "aalib-devel-1.4rc5-fr1\n", 23) = 23 --- SIGSEGV (Segmentation fault) @ 0 (0) --- +++ killed by SIGSEGV +++ plugging in my broken database, it hangs as before. This bugs seems to be related to a corrupt Packages file in /var/lib/rpm/. I cannot uncorrupt the file and cannot install any rpms either. I am going to be very upset if I have to re-install the system due to this. I am also hanging on the message: futex(0x405b3940, FUTEX_WAIT, 0, NULL I see the same hang on futex with a [internally developed] program we use that uses the db (Sleepcat) library...I noticed rpm uses it as well. When the db library is built with out pthreads, specifically HAVE_MUTEX_PTHREADS & HAVE_MUTEX_X86_GCC_ASSEMBLY, everything works for our program. With those defined it *always* hangs on: ... time([1059434367]) = 1059434367 futex(0x81eb1f8, FUTEX_WAIT, -606348324, NULL *hang* so I think the problem is some interaction between the db library and the new thread stuff... versions we are using: RH9 - kernel-2.4.20-8 gcc-3.2.2-5 glibc-2.3.2-11.9 db4-4.0.14-20 this problem is not present on RH8 or previous... [ I'm reopen this but just to make little bit more noise ] Fully upgraded RH9 (rhn-applet-tui says 'Ignored. No updates available.') It hanged. Any call to rpm hangs on futex syscall. Bugs: 1. ^C/^Z doesn't work (Shame!!! And what I'm supposed to do if this is the only terminal I do have?) 2. It hangs 3. Cleaning files manually is pretty annoying. Yes - after 'rm /var/lib/rpm/__db*' it works. But still - to have this kind of bug in this kind of tool... And yes - files are recreated after any run of rpm under root - even rpm -qa. P.S. Actually this is the second time this bug has beaten me: first time it was first run of RH8.0 - rpm just hanged. I've spent more than half of day trying to figure out what it going wrong. I didn't knew this magic recipe 'rm /../__db*', but reboot helped then - I do not know why... :-( Just to add another data point, I've seen the same intermittent futex() hang on RedHat 9 with the following: kernel-smp-2.4.20-18.9 gcc-3.2.2-5 (not that it matters since I'm using the RedHat binaries) glibc-2.3.2-27.9 db4-4.0.14-20 rpm-4.2-0.69 We can't make RedHat keep the bug open, but we can start recommending that our friends and businesses stay off RedHat 9. ;-) For all you watching this bug, take a look at bug # 101062. It looks like the newest version of RPM (4.2-1 from ftp://ftp.rpm.org/pub/rpm/dist/rpm-4.2.x) fixes the stale lock problem that causes this hang. I haven't thoroughly tested it, but it seems there is some hope. -- Steve I'm fully updated RH9 and just experienced this bug. kernel: 2.4.20-28.9 glibc: 2.3.2-27.9.7 rpm: 4.2-0.69 So it isn't fixed by rpm 4.2. For whatever that information is worth at this point. Yes, the problem is not fixed by rpm-4.2-0.69 as in RHL 9. The problem *IS* fixed in rpm-4.2-1 available from ftp.rpm.org. No errata is planned, RHL 9 is already end-of-life. |