Description of Problem: rpm command HANGS very nastily and frequently while doing ANYTHING when used as root (this meas from a simple rpm -q to rpm -e. Process cannot be stopped with CTRL-C neither kill -15, but only with a nasty kill -9 rpm command behavior goes back to normal only after rebooting or doing a rm -f /var/lib/rpm/__db*. Version-Release number of selected component (if applicable): rpm-4.1-1.06 How Reproducible: It is not reporducible arbitrarily. It happens at random, specially after several hours of using the system. Steps to Reproduce: 1. 2. 3. Actual Results: Expected Results: Additional Information: This bug seems to be related to https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75553
I have the exact same problem. I have tried a few different rpm commands and each time it hangs. In case its useful. pstack when stuck looks like this: 0x420d3b2e: __GI_select + 0x1e (0, 0, f4240, 4054b818, 401453b0, 40169b88) 0x401453f1: __os_yield_rpmdb + 0x41 (0, f4240, 400d4a4b, 40169b88, 40587f30, 405727d8) 0x400d4aaf: __db_tas_mutex_lock_rpmdb + 0x6f (805b760, 405727d8, 4054b858, 0, bfffe494, 40169b88) + 50 0x40130bc7: __lock_get_internal + 0x757 (805ba60, 37, 0, 805beac, 1, 0) + 10 0x401303ff: __lock_get_rpmdb + 0xef (805b760, 37, 0, 805beac, 1, 805bee0) + 20 0x400fc988: __db_cursor_rpmdb + 0x158 (805bb00, 0, bfffe558, 0, 400ac000, 40169cd4) + 30 0x4011dcaa: __ham_open_rpmdb + 0x7a (805bb00, bfffe628, 0, 10, 0, bfffe598) + 20 0x400f9b0f: __db_dbopen_rpmdb + 0xef (805bb00, bfffe628, 10, 1a4, 0, 4000a190) + 40 0x400f986e: __db_open_rpmdb + 0x32e (805bb00, bfffe628, 0, 2, 10, 1a4) + 80 0x400d3f5d: db3open + 0x40d (8053270, 0, bfffe6e8, ffffffff, 400ac000, 40169ef4) + 30 0x400cab70: dbiOpen + 0xf0 (8053270, 0, 0, 1a4, 0, bfffe728) + 20 0x400cc0ae: openDatabase + 0x11e (8052f60, 0, 3, 80530dc, 0, 1a4) + 10 0x400cc296: rpmdbOpen + 0x56 (8052f60, 80530dc, 0, 1a4, bfffe7c8, 4000a190) + 10 0x4008bee9: rpmtsOpenDB + 0x59 (80530a0, 0, 804ad2c, 804a2e0, bffff898, 4000a190) + 10 0x4008c063: rpmtsInitIterator + 0x33 (80530a0, 2, bffffaa1, 0, 0, 0) + 10b0 0x400796c3: rpmQueryVerify + 0x443 (804a2e0, 80530a0, bffffaa1, 804ab90, bffff918, 0) + 10 0x4007a413: rpmcliQuery + 0xa3 (80530a0, 804a2e0, 804ad28, 804a020, 0, 4212a2d0) + 30 0x080497ed: main + 0x37d (4, bffff964, bffff978, 400124b8, 4, 8049240) 0x420158d4: __libc_start_main + 0xa4 (8049470, 4, bffff964, 8048f0c, 8049c08, 4000a950) + 400006a8
This smells like a stale lock, but I can't tell. What was going on previously to the command that hung? For example, if "kill -9" was previously executed, this will leave a stale lock.
It currently hangs on any command, so your stale lock idea seems likely. The first time it hung, was at the end of an "rpm -Uvh" command on a previously uninstalled rpm (kernel-source in case its important) The progress meter reached 100% and then hung. Unfortunately i didnt get a stack trace then. If you tell me how to remove he stale lock i can try to reproduce and see where it hangs the first time.
OK, it seems you killed rpm as it was closing the database leaving a stale lock that caused subsequent commands to fail. Stale locks are removed by rm -f /var/lib/rpm/__db* rpm itself cannot do that automagically w/o opening lock race windows. Well I might be able to detect stale locks, but that algorithm is very race prone indeed.
I have seen this problem on 5 systems so far as originally said after prolonged usage or if the system is under heavy load it will happen, for example running seti at nice 19 will cause it to happen fairly frequently.
This actually happened to me once. I'm pretty sure it was caused by my nscd instance crashing though. The only way to stop rpm was to 'kill -9' it, and then the only way to get subsequent rpm runs to work was to reboot. During the reboot, I noticed that nscd failed to shutdown. I use nscd to query a pair of ldaps servers for all my user information. Since upgrading to Redhat8 it seems to crash an awful lot. Are the people who are affected by this problem running nscd? Perhaps even nscd with ldap(s)? I hope this helps. P.S. If my nscd problem continues, I will open a bugzilla ticket for it in the next few days...
I can confirm this bug. rpm deadlocks at every other run, no matter what options are used. After executing rm -f /var/lib/rpm/__db* I can use rpm once, then it deadlocks again.
I have discovered something interesting about this bug: it happens when installing or managing packages created with the rpm-4.0.x, specially large ones. Since I have managed only new packages created with rpm-4.1, rpm command has not hung in the last week. I haven't have problems with any RPM created with version 4.1.
The problem disappeared after upgrading apt. I usually use apt to install software. The old apt version was linked against rpm 4.0.4, the new one against rpm 4.1. The version of the RPM file itself doesn't seem to matter in my case.
I can confirm that this seems to happen on high load machines. I run 3 RH 8.0 servers, 2 running background distributed clients (FAH or Distributed.net). Third machine is sitting idle. Two machines under load both hung and required a force kill, then deletion of /var/lib/rpm/__db.00? files, followed by an rpm -- rebuilddb. Subsequent (and --force'd) installations went just fine. The RPM package was logwatch-4.1-3.noarch.rpm from www.logwatch.org (unsure of the version of RPM used to generate).
Hi all. Been seeing the same to the extent I hosed my database and decided to reinstall. Thought it was happening to me only when installing packages until this happened: here comes strace...: [root@heroditus root]# strace rpm -e apmd execve("/bin/rpm", ["rpm", "-e", "apmd"], [/* 22 vars */]) = 0 fcntl64(0, F_GETFD) = 0 fcntl64(1, F_GETFD) = 0 fcntl64(2, F_GETFD) = 0 uname({sys="Linux", node="heroditus", ...}) = 0 geteuid32() = 0 getuid32() = 0 getegid32() = 0 getgid32() = 0 getrlimit(0x3, 0xbffff4f8) = 0 setrlimit(RLIMIT_STACK, {rlim_cur=2044*1024, rlim_max=RLIM_INFINITY}) = 0 getpid() = 1218 rt_sigaction(SIGRTMIN, {0x8141594, [], SA_RESTORER, 0x8161198}, NULL, 8) = 0 rt_sigaction(SIGRT_1, {0x8140b20, [], SA_RESTORER, 0x8161198}, NULL, 8) = 0 rt_sigaction(SIGRT_2, {0x8141600, [], SA_RESTORER, 0x8161198}, NULL, 8) = 0 rt_sigprocmask(SIG_BLOCK, [RTMIN], NULL, 8) = 0 _sysctl({{CTL_KERN, KERN_VERSION}, 2, 0xbffff500, 30, (nil), 0}) = 0 ... ... This carries on as normal until: ... ... gettimeofday({1035062495, 950883}, NULL) = 0 gettimeofday({1035062495, 950941}, NULL) = 0 close(8) = 0 gettimeofday({1035062495, 951006}, NULL) = 0 gettimeofday({1035062495, 951047}, NULL) = 0 gettimeofday({1035062495, 951078}, NULL) = 0 munmap(0x4041a000, 8192) = 0 dup(1) = 8 gettimeofday({1035062495, 951211}, NULL) = 0 rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0 rt_sigprocmask(SIG_BLOCK, ~[], ~[KILL STOP], 8) = 0 rt_sigaction(SIGCHLD, {0x8142ac4, [], SA_RESTORER, 0x8161198}, {SIG_DFL}, 8) = 0rt_sigprocmask(SIG_SETMASK, ~[KILL STOP], NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 fork() = 1219 --- SIGCHLD (Child exited) --- wait4(0, [WIFEXITED(s) && WEXITSTATUS(s) == 0], WNOHANG, NULL) = 1219 sigreturn() = ? (mask now [RTMIN]) rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 pause( It sits there for a minute or two, I get impatient and then from another xterm I try 3 times to kill the rpm process, before kill -9'ing it: ) = ? ERESTARTNOHAND (To be restarted) --- SIGTERM (Terminated) --- sigreturn() = ? (mask now [RTMIN]) rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 pause() = ? ERESTARTNOHAND (To be restarted) --- SIGTERM (Terminated) --- sigreturn() = ? (mask now [RTMIN]) rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 pause() = ? ERESTARTNOHAND (To be restarted) --- SIGTERM (Terminated) --- sigreturn() = ? (mask now [RTMIN]) rt_sigprocmask(SIG_BLOCK, ~[], [RTMIN], 8) = 0 rt_sigprocmask(SIG_SETMASK, [RTMIN], NULL, 8) = 0 pause() = ? ERESTARTNOHAND (To be restarted) +++ killed by SIGKILL +++ [root@heroditus root]# Note, when next trying to remove this package things all worked flawlessly. The full strace is available on request. I hope this is of some use. Cheers,
Most of these seem to be stale lock problems, fix by doing rm -f /var/lib/rpm/__db* The last strace looks like a missing SIGCHLD, the fix is in rpm-4.1-9 packages at ftp://people.redhat.com/jbj/test-4.1 Errata coming. Meanwhil I'm gonna close this bug, there's too many problems to respond effectively.