Description of problem: strace: open("/var/lib/rpm/Packages", O_RDONLY|O_LARGEFILE) = 3 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=29184000, ...}) = 0 futex(0xb7bed9c8, FUTEX_WAIT, 0, NULL Version-Release number of selected component (if applicable): Not sure, rpm doesn't work, remember? All packages from Rawhide How reproducible: First time this happens with this rpm version, now it's 100% reproducable. Something wrong in my rpmdb? Steps to Reproduce: 1. rpm -q rpm
Ok, after removing the famous __db* -files it magically works again. The versions of stuff I have: rpm-4.2.2-0.8 glibc-2.3.3-3 db4-4.2.52-1 popt-1.8.2-0.8 kernel-2.6.0-1.21 Of which rpm, popt, glibc and kernel were updated today, db4 last week.
Yup, rm -f /var/lib/rpm/__db* removes locks.
Not a bug? What is it then? Removing those db files manually every time it hangs is not correct solution to this problem.
Do you have reproducer for "every time" hang? If so, then this is bug against glibc or kernel probably. If not, then your problem is a usage problem. For example, the user is expected to clean up consequences of "kill -9", that are in __db* files, i.e. stale database locks will be left on "kill -9" termination. If you have reproducer for "every time" hang, please reopen and I will forward to appropriate glibc/kernel people. Otherwise, yes, notabug.
Here's one way to reproduce it on RHEL3 (reproduces similarly on other versions) - run the following commands in parallel: while [ 1 ]; do date; rpm -e kernel-source; date; rpm -i kernel-source-2.4.21-4.0.1.EL.i386.rpm ; done while [ 1 ]; do date; rpm -e tcpdump; date; rpm -i tcpdump-3.7.2-7.i386.rpm ; done After a few iterations both processes hang. Output from ps -aux: root 1248 3.4 3.1 16060 11896 pts/0 S 14:38 0:04 rpm -e kernel-source root 1264 96.9 0.9 7748 3600 pts/1 R 14:38 1:55 rpm -e tcpdump strace -p 1248: Process 1248 attached - interrupt to quit futex(0xb70c846c, FUTEX_WAIT, 2, NULL <unfinished ...> Process 1248 detached strace -p 1264: Process 1264 attached - interrupt to quit futex(0xb6fece58, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) futex(0xb6fece58, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable) (repeats infinitively) It may of course be glibc, but will let you reassign if you believe that is correct.
Um, that ins't exactly deterministically reproducible, nor is there known rpm application that needs that degree of concurrent access. Sure there may be deadlock problems at that degree of concurrency. How many is "few" in units like number of pkgs or wall clock time?
I'd say from 2-5 iterations on the kernel-source package; (tcpdump is much smaller and is reinstalled more times in the same time). I reproduced this problem 5 times yesterday, the last two on completely clean systems (RHL9 and RHEL3AS). And regarding the need for this degree of conrurrent access; the reason we detected the problem in the first place is because we see it happen all the time in a real-world application.
Hmmm, 2-5 iterations is at least 2 orders of magnitude more often than what I was expecting (and what I saw when I last checked). What kernel? What glibc? What rpm?
kernel-smp-2.4.20-28.9 glibc-2.3.2-27.9.7 rpm-4.2-0.69 It may be relevant that this was tested on an SMP-system?
The status of the bug doesn't match last info given.
Help help. I got myself into a situation where RPM doesn't work anymore. I even tried to remove the "famous" DB files without success. The following text show what I mean. Thanks. [root@localhost lib]# rpm -qa | grep http httpd-manual-2.0.49-2 system-config-httpd-1.2.0-3 httpd-devel-2.0.49-2 libghttp-1.0.9-9.1.1 httpd-2.0.49-2 [root@localhost lib]# rpm -e httpd-devel-2.0.49-2 error: can't create transaction lock [root@localhost lib]# slocate __db /usr/share/doc/rpm-devel-4.3.1/apidocs/html/union__dbswap.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiVec.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__dbi.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndex.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__db3.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexSet.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexItem.html /var/lib/rpm/__db.003 /var/lib/rpm/__db.002 /var/lib/rpm/__db.001 [root@localhost lib]# rm -f /var/lib/rpm/__db.003 /var/lib/rpm/__db.002 /var/lib/rpm/__db.001 [root@localhost lib]# slocate __db /usr/share/doc/rpm-devel-4.3.1/apidocs/html/union__dbswap.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiVec.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__dbi.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndex.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__db3.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexSet.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexItem.html /var/lib/rpm/__db.003 /var/lib/rpm/__db.002 /var/lib/rpm/__db.001 [root@localhost lib]# rpm -e httpd-devel error: can't create transaction lock
Try rebooting (instead of deleting the __db files), if you haven't tried that already. If that still doesn't work, then try something like this: cp -a /var/lib/rpm /var/lib/rpm.backup rpm --rebuilddb (i.e. back up the rpm database in case something goes bad, then try rebuilding it)
Hey, thanks. Everything works now. It looks like in a fit of panic, I also deleted "/var/lock/rpm", which only made things worse. [root@localhost lib]# cp -a /var/lib/rpm /var/lib/rpm.backup [root@localhost lib]# rpm --rebuilddb error: can't create transaction lock [root@localhost lib]# cd rpm [root@localhost rpm]# ls Basenames Group Providename Requireversion Conflictname Installtid Provideversion Sha1header Dirnames Name Pubkeys Sigmd5 Filemd5s Packages Requirename Triggername [root@localhost rpm]# ls -al total 61248 drwxr-xr-x 2 root root 4096 May 16 10:47 . drwxr-xr-x 26 root root 4096 May 16 10:48 .. -rw-r--r-- 1 rpm rpm 10452992 May 15 23:29 Basenames -rw-r--r-- 1 rpm rpm 12288 May 15 23:22 Conflictname -rw-r--r-- 1 rpm rpm 1716224 May 15 23:29 Dirnames -rw-r--r-- 1 rpm rpm 10543104 May 15 23:29 Filemd5s -rw-r--r-- 1 rpm rpm 16384 May 15 23:29 Group -rw-r--r-- 1 rpm rpm 16384 May 15 23:29 Installtid -rw-r--r-- 1 rpm rpm 45056 May 15 23:29 Name -rw-r--r-- 1 rpm rpm 43945984 May 15 23:29 Packages -rw-r--r-- 1 rpm rpm 335872 May 15 23:29 Providename -rw-r--r-- 1 rpm rpm 94208 May 15 23:29 Provideversion -rw-r--r-- 1 rpm rpm 12288 May 15 23:29 Pubkeys -rw-r--r-- 1 rpm rpm 241664 May 15 23:29 Requirename -rw-r--r-- 1 rpm rpm 167936 May 15 23:29 Requireversion -rw-r--r-- 1 rpm rpm 90112 May 15 23:29 Sha1header -rw-r--r-- 1 rpm rpm 45056 May 15 23:29 Sigmd5 -rw-r--r-- 1 rpm rpm 12288 May 15 23:29 Triggername [root@localhost rpm]# slocate __db /usr/share/doc/rpm-devel-4.3.1/apidocs/html/union__dbswap.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiVec.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__dbi.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndex.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/group__db3.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexSet.html /usr/share/doc/rpm-devel-4.3.1/apidocs/html/struct__dbiIndexItem.html /var/lib/rpm/__db.003 /var/lib/rpm/__db.002 /var/lib/rpm/__db.001 [root@localhost rpm]# rpm -ql rpm /bin/rpm *snip* /usr/share/man/sk/man8/rpm.8.gz /var/lib/rpm /var/lock/rpm /var/spool/repackage [root@localhost rpm]# cd /var/lock [root@localhost lock]# ls lvm subsys [root@localhost lock]# mkdir rpm [root@localhost lock]# ls -al total 20 drwxrwxr-x 5 root lock 4096 May 16 10:50 . drwxr-xr-x 22 root root 4096 May 11 23:47 .. drwx------ 2 root root 4096 May 11 23:52 lvm drwxr-xr-x 2 root root 4096 May 16 10:50 rpm drwxr-xr-x 2 root root 4096 May 16 10:44 subsys [root@localhost lock]# chmod a+rx rpm [root@localhost lock]# ls -l total 12 drwx------ 2 root root 4096 May 11 23:52 lvm drwxr-xr-x 2 root root 4096 May 16 10:50 rpm drwxr-xr-x 2 root root 4096 May 16 10:44 subsys [root@localhost lock]# rpm --rebuilddb
We have also just experienced this problem on RedHat Enterprise 3; although I have experienced it multiple times on Fedora installs doing nothing more complicated than a "yum update", so much so that my personal system is structured to allow me to reinstall in under 10 minutes. The presence on RHEL is however a whole order of magnitude more irritating. The "offending" applications were rhnsd (running rhn_check) and "up2date". Both RedHat applications. I don't think "notabug" is an appropriate classification for this, since the RedHat update mechanism (for which we pay) just destroyed a production server - since the futex hang happened in the middle of an up2date, the system now thinks it has e.g. two version of glibc installed, and god knows what else. Given the role of the server, a reinstall to known setup is mandated by our operational procedures, and by good system administration sense in general. Relevant versions: kernel-smp-2.4.21-9.EL kernel-smp-2.4.21-15.EL kernel-2.4.21-9.EL kernel-pcmcia-cs-3.1.31-13 kernel-2.4.21-15.EL kernel-2.4.21-15.0.2.EL kernel-2.4.21-9.0.1.EL kernel-source-2.4.21-15.0.2.EL kernel-smp-2.4.21-9.0.1.EL kernel-utils-2.4-8.37.5 kernel-smp-2.4.21-15.0.2.EL glibc (note these are broken - the actual files for the newer packages are installed, but the RPM entries for the older ones are still there) glibc-2.3.2-95.20 glibc-2.3.2-95.27 glibc-common-2.3.2-95.20 glibc-common-2.3.2-95.27 glibc-devel-2.3.2-95.20 glibc-headers-2.3.2-95.20 glibc-kernheaders-2.4-8.34 glibc-profile-2.3.2-95.20 glibc-utils-2.3.2-95.20 RPM: redhat-rpm-config-8.0.28-2 rhnlib-1.3-12 rhnlib-1.8-6.p22 rpm-4.2.2-0.14 rpm-build-4.2.2-0.14 rpmdb-redhat-3-0.20040505 rpm-devel-4.2.2-0.14 rpm-python-4.2.2-0.14 up2date-4.2.33-1 It hung on a futex; further examination revealed that an "rhn_check" was running. "kill -9" of that allowed up2date to proceed to one more package (chkconfig) - that's even more broken: chkconfig-1.3.8-3 ...but files for chkconfig-1.3.11-0.3 are installed (as determined by MD5 checking). There was no way to persude the up2date to complete, hence a "kill -9" and "rm -f /var/lib/rpm/__*" was required. Relevant diagnostics: [nsg@mr3 nsg]$ uname -a Linux mr3.cc.ic.ac.uk 2.4.21-15.0.2.ELsmp #1 SMP Wed Jun 16 22:52:07 EDT 2004 i686 i686 i386 GNU/Linux [root@mr3 root]# gdb --pid 25681 (gdb) bt #0 0xb75d51fb in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #1 0xb6098930 in ?? () #2 0xb6098918 in ?? () #3 0x0895fd68 in ?? () #4 0xb73ef204 in ?? () from /usr/lib/librpmdb-4.2.so #5 0xb60bef48 in ?? () #6 0xb6098918 in ?? () #7 0xbfffb058 in ?? () #8 0xb73afee1 in __lock_get_rpmdb () from /usr/lib/librpmdb-4.2.so Previous frame identical to this frame (corrupt stack?) [root@mr3 root]# strace -d -v -p 25681 Process 25681 attached - interrupt to quit [wait(0x137f) = 25681] pid 25681 stopped, [SIGSTOP] [wait(0x57f) = 25681] pid 25681 stopped, [SIGTRAP] futex(0xb6098940, FUTEX_WAIT, 2, NULL [root@mr3 root]# ltrace -S -p 25681 SYS_futex(0xb6098940, 0, 2, 0, 2) = -4 "kill -CONT 25681" did not work (I would not have expected it to, but worth a go). [root@mr3 root]# cat /proc/25681/maps 08048000-080ee000 r-xp 00000000 68:16 571375 /usr/bin/python 080ee000-0810b000 rw-p 000a5000 68:16 571375 /usr/bin/python 0810b000-0a94f000 rw-p 00000000 00:00 0 b503d000-b503e000 rw-p 0004c000 00:00 0 b5313000-b5627000 rw-p 00005000 00:00 0 b56b8000-b59cc000 rw-p 0017a000 00:00 0 b5a48000-b5a7a000 rw-p 00027000 00:00 0 b5b16000-b5b42000 rw-p 00000000 00:00 0 b5b43000-b5b8e000 rw-p 0002d000 00:00 0 b5b8f000-b5bc7000 rw-p 0004c000 00:00 0 b5bc8000-b5bf5000 rw-p 00039000 00:00 0 b5bf6000-b5d72000 rw-p 0002e000 00:00 0 b604f000-b60bf000 rw-s 00000000 68:12 65556 /var/lib/rpm/__db.003 b60bf000-b6201000 rw-s 00000000 68:12 65555 /var/lib/rpm/__db.002 b6201000-b6205000 rw-s 00000000 68:12 65554 /var/lib/rpm/__db.001 b62d1000-b635a000 rw-p 0002e000 00:00 0 b6434000-b64c6000 rw-p 00029000 00:00 0 <snip - lots more info, let me know if needed> Note that the futex is within the /var/lib/rpm/__db.003 file. Please explain, because I don't understand, why a lock that disappears with the locking process is not used? I have an "lsof" of the process as well, lots of RPMs open etc. which is also available if needed. I cannot provide more diagnosis as the system is now in reinstall, but I would very much like to ascertain the cause, and method of prevention of this problem, before we lose another server. One thing that's possibly of interest - the "up2date" was run from within a "screen" session. I doubt it's relevant but I recall SIGPIPE and similar RPM bugs in the past. (Grin - just use postgres guys, it can't possibly be worse than Berkeley UnreliableDB)
Berkeley DB is quite reliable and high performing, thank you, and runs rings around postgress in performance. OTOH, diagnosing problems is perhaps harder, particularly with the complexity of NPTL. There are several issues here, let me try to take them one-by-one. First of all, one needs to distinguish stale locks from active locks, it's more than just "hung" that needs to be identified. Berkely DB concurrent locks can be displayed by doing cd /var/lib/rpm /usr/lib/rpm/rpmdb_stat -CA The quiescent behavior (i.e. no other processes accessing) should be no locks present. So run that command to identify stal locks, i.e. locks that persist a long time are "stale". Stale locks occur under exceptional conditions, including a) use of kill -9 b) rpm (or other application using rpmdb) segfaults c) machine reboots etc. If you have a reproducer that generates a stale lock from a (arguably) non-exceptional event then you will have my instant attention. The fix for "stale" locks is rm -f /var/lib/rpm/__db* which can be done (in practice, there's a possibility of a small lock race, certainly an acceptable risk if clearly "hung") any time. Stale locks are persistent because the name (i.e. the futex address) is saved in the __db* file to be shared by processes concurrently (I believe). Certainly the most desirable locking scheme is one that evaporates when the running process exits, that was the intent in choosing (as recommended by Sleepycat) posix mutexes based on futexes. If locks are actually persisting, well, reality has to be lived with, shared posix mutexes and rm -f /var/lib/rpm/__db* are as onerous as, say, ipcs. The other locks that may be present are active, and you may well be seeing an interaction between rhnsd, rhnapplet, up2date, and perhaps rpm itself that might be tuned better. Usually, however, the cause of "hung" problems is stale locks. If that is incorrect, then please control for stale locks and report as a reliable reproducer. Next are the unpleasant side effects of "hung", like multiply (and perhaps partially) installed packages. For starters, there is a watchdog deadlock detector that can be run that will break deadlocks after a period of time. Yes not pretty, but perhaps easier than cleaning up the aftermath again and again. I will dig out details if absolutely necessary. I have also seen problems with rpm+berkeleydb on ppc smp. Judging from the sophistication and the concern, your machine sounds like smp. The symptom that I saw was segfaults within minutes running, say, 2 rpm -qa loops. on a 4-way ppc smp. The cure was to change to db-4.2.52 within rpm, which has been built for the next RHEL update, but I'd be happt to make the packages available if necessary. I know of no current issues like SIGPIPE with the current RHEL3 rpm based on the number of concurrent bug reports. The issue with the 4-way ppc was the last known issue (which I suspect was lack of patch.4.1.25.2 as cause), and db-4.2.52 was the fix. I can't think of anything that screen might add to the puzzle. I can describe howto fix multiple rpmdb entries more easily than resinstalling if you wish, but perhaps that is info overload. So, in summary, 1) Are you seeing "stale" locks? Nuke with rm -f /var/lib/rpm/__db* if so (but certainly try to identify why they are happening.) 2) If active locks are causing the hang, what processes are involved? 3) If there is other cleanup necessary, I can either assist with instructions on what to do, or can even attempt to remove the duplicate entries (often easier than explaining). Because of the nature of the code paths within rpm, partially installed packages are very rare, usually it's just duplcate entries in the rpm database, and perhaps a few files left on the file system. Try examining rpm -qa --last output to assess what needs fixing. And a request please: It's a little bit easier to handle one problem report per person, bug pile ups, particularly rpmdb "hung" problems, are difficult to sort out in a crowd. So try to open up seperate bug reports please.
Closing due to inactivity. If this issue still occurs with current releases, please reopen and set the release in which you've encountered the problem.
Doing rpm installs/uninstalls in parallell does seem to work on RHEL4.. at least, it's been going OK for 20 minutes.