Bug 73097
Summary: | rpm-4.1 hangs, can't be killed: READ THIS FIRST | ||
---|---|---|---|
Product: | [Retired] Red Hat Linux | Reporter: | Peter van Egdom <p.van.egdom> |
Component: | rpm | Assignee: | Jeff Johnson <jbj> |
Status: | CLOSED WORKSFORME | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 8.0 | CC: | aleksey, barryn, bradd+redhat, ddaniels, dhaselho, dhollis, dts, edwardam, herrold, jari.oksanen, jason, jelly+redhatweb, joe, johan.sunnerstig, kekelley, k_wayne, lsof, marius.andreiana, melevittfl, menscher, mherrick, nerijus, nicku, per.starback, redhat-bugzilla, redhat-bugzilla, redhat.com, redhat, rivenburgh, roystgnr, simon, sjdavis, stk, valankar, yaneti, yiango |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i386 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2003-03-08 17:48:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Peter van Egdom
2002-08-30 18:15:58 UTC
Just a couple of minutes after I filed this bug report, I succeeded trying to reproduce it on one of my 'null' machines. With the following command I managed to reproduce it : [root@localhost tmp]# while true; do echo -n de-Installing && rpm -ev redhat-config-kickstart && echo -n Installing && rpm -ivh redhat-config-kickstart-2.3.3-1.noarch.rpm ; done ps. (Let this command - or something similar - repeat itself for about 10 minutes). - Press CTRL-C a couple of times. - Then start this command again. (sooner or later RPM will not do anything anymore. (CTRL-C does not work, kill -9 does). - note that commands like "rpm -qa" don't work anymore in this state. (until the /var/lib/rpm/__db* files are deleted) I can confirm this bug in a fresh install of null. Removing /var/lib/rpm/__db* files and rebuilding the db corrects problem. rpm-4.1-0.81 librpm404-4.0.4-8x.26 rpm-devel-4.1-0.81 rpm404-python-4.0.4-8x.26 redhat-rpm-config-7.3.93-1 rpm-build-4.1-0.81 OK, I'm gonna use this bug as an umbrella "rpm hangs" bug to try to sort out the 3 or 4 underlying issues. If you find yourself reading this text, try to figger which category of complaint you have, and then go look for other bugs that cover the 3-4 categories of problems/complaints/solutions. If none of these catagories apply, please open another bug, not append text here, or we're all gonna go bananas :-) There are (at least) 2 main issues here that need to be distinguished: 1) hanging 2) responsiveness to signals There are also several types of "hanging" (one of which is not hanging at all) that need to be identified: 0) package erasure on upgrade is bunched at the end of the transaction, i.e. at 100% and there is no erasure progress bar displayed. This is not a hang at all, use top to find out whether rpm is still executing. 1) hanging from stale locks (i.e. kill -9) Detect by attaching strace to the process, if you see a 1 per second select call, this is the likeliest explanation. 2) hanging from missing SIGCHLD (see #73134). The easiest way to detect is to use kill, but a fix is gonna be in rpm-4.1-1.04 shortly. 3) other hangs from concurrent access to the rpm database, new in rpm-4.1. Here are the rules for rpm-4.1: 0) Don't kill rpm if it's at 100% and using cpu cycles. Yes, it's unfortunate that rpm does not provide progress on erasures during upgrade, feel free to open a bug report on that specific item. 1) If you do "kill -9", then you *will* hang on later executions (due to stale locks), and it's the user's responsibility to remove stale locks by doing rm -f /var/lib/rpm/__db* files to fix, as rpm cannot perform this action without opening lock race windows. Ditto database corruption, it's the user's responsibility to fix by doing rpm --rebuilddb This also applies to reboots. 2) If rpm segfaults, or otherwise terminates because of exceptional and pathological behavior, you *will* hang later. The real problem is the segfault, the hang is just a derivative symptom. So go to 1), fix the hang, and report the segfault (or other abnormal pathology) as a separate problem. 3) If rpm is unresponsive to ^C, this is because rpm now runs with signals blocked, hoping to get to a point where it's safe to exit (and close the database) to avoid database corruption. If you choose to intervene with "kill -9", or a reboot, or something else, go to 1) right now. Otherwise, the problem (probably) has to do with the definition of a point where it's "safe" to exit. Off to open up "rpm hangs" per-category bugs shortly, please figger your category and direct your problems and comments there. Since you have closed all other rpm bug reports I decided to mention it here. I have tried 4.1-1.06, 4.1-9(test release), and assorted 4.2 versions. 4.2 is just as bad as 4.1 when it comes to hangs. 4.1-9 is much better, but I within days of doing a fresh install of RedHat 8.0 I have seen rpm-4.1-9 lockup with "NULL, NULL, NULL (Timeout)". It has been months since RedHat 8.0 was released. An offical errata hasn't been released of something like 4.1-9, which seems insane to me. I know it isn't a perfect fix, but it is alot better than shipped version. Are these issues going to be fixed any time soon? Since this is the umbrella bug for 'rpm hangs' issues, I'll throw some more fuel to the fire. So far I haven't felt compelled to say anything (though I've experienced the problems described on all of the 8.0 machines I maintain at least once). But tonight I've stumbled on a new one: rpm cannot be killed at all. killall -9 rpm Returns instantly, and the rpm process remains. kill -9 <PID> has the same result. I've even tried killing the shell and 'su -' under which it was running with no impact. This machine has been up since installation of 8.0 (a fresh install--not an upgrade), and has had its rpmdb rebuilt twice due to previous hangs (it was possible to kill -9 in those cases). strace -p <PID> Produces no output at all (though this is after kill -9, so maybe that makes things behave differently, even if it doesn't get rid of it). The machine is a fully up2date 8.0, with rpm-4.1-1.06 (I'll see about getting 4.1-9 installed after I reboot to get rid of this devil of an rpm process). Pretty frustrating. your 4.1-9 rpms still has the same bug! hang during install, upgrade, remove. no I don't stop or kill the process I wait for 2 minutes after the progress bar finished (which should have to be enough on a 2.2GHz P4 with 1GB RAM). I'd happy to test any newer test version of rpm:-)) This is not a RPM hang but may be related. I was tring to add the kernel-2.4.18-18.8.0.src.rpm when I encountered: [root]rpm -i kernel-2.4.18-18.8.0.src.rpm warning: kernel-2.4.18-18.8.0.src.rpm: V3 DSA signature: NOKEY, key ID db42a60e kernel-2.4.18-18.8.0 In looking deeper I found in the /var/lib/rpm/ three __db001 __db002 __db003 locks. I removed them and did a rpm --rebuilddb. The system still refuses to intall the source files. I am using rpm-4.1.1.06 I'm not sure where I fall into the bug situations above but I'm pretty sure that it's related. I'm running into a problem that I've yet to be able to consistently reproduce whereby RPM simply stops responding to new requests. For example doing a simple 'rpm -qa' just sits there and doesn't respond (same for -Uvh and -e). You can't kill it except with -9 as mentioned above. In some measure of tracking, I have noticed that the /etc/cron.daily/rpm job is also stalled during the overnight run. kill -9 is the only solution to that as well. The only way I've found to recover from this problem is to reboot the machine. Upon reboot, RPM responds to all requests until it happens again. I know it's not a lot to go on, but I'm sure it's related to what's going on. On a related not, is there somewhere that documents the changes to the new RPM? Speficially what happened to rpm --rebuild? Yes, there's a "rm -f /var/lib/rpm/__db*" during reboot. Simpler to type the command yourself, but reboot will do. "man rpm" describes --rebuild issues The good news is that rpm-4.2-0.28nptl (probably, untested as of this moment) fixes the stale lock problem. The bad news is that you need to run a kernel that supplies /dev/futex, and a version of glibc that uses NPTL. Same RPM problems on vanilla RH 8.0 here. Last time RPM got into an infinite loop after I did a 'rpm -e <somepackage>' at the same time as a 'rpm -qa'. Had to kill it with 'killall -KILL' the thing wouldn't stop otherwise. I removed the __db.00? files and some stale rpmrebuilddb.<deadpid> dirs and ran 'rpm --rebuilddb' before I noticed this thread. I got another infinite loop as a result. 'rpm --initdb' locked up too. Sick of it all I just removed 'Packages' as it was obviously corrupt and did an 'rpm --initdb'. Obviously my old RPM DB contents went down the toilet but at least I have no more lockups. Should I start to backup that binary 'Packages' file like Windows users backup their registry file? Storing those RPM DBs in binary format is IMHO a very bad idea. Why not use a nice easy-to-fix/machine portable plain text file instead? Thanks, -vasc The respective bug also created a problem on our server. The incident occurred when attempting to install perl-5.8.0-55.src.rpm. (This was being done in order to re-compile perl and make it more compatible to mod_perl. The compilation was to include -Dusethreads.) However, rpm hung during the install of the source code. Consequently, the package management utilility, available through the display panel (Start -> System Settings -> Packages) no longer would run. It would start to 'Check system package status' and bomb before completing. In order to resolve this, the aforementioned procedures to repair the respective database(s) wehere completed. That is, a) rm -f /var/lib/rpm/__db* and b) rpm --rebuilddb. 3) etc. Yet, this didn't bring on line the package management utilitiy. However, it should be noted, it appears as if, 'something' became corrupt in addition to the these databases or these databases couldn't be repaired. I can't be certain which without clarification about what's going on behind the scenes. And although it was apparent the rpm utility was offline (at least to some degree), I could still display the list of packages which were installed on the system via 'rpm -qa' from the command prompt. I tried to install rpm_python believing there could be someting wrong / corrupt with that module, but was unable to complete this operation. Next, I proceeded to istall perl modules, all the rpm modules, and librmp*. Although, perl was already on this system, it was'nt within the /usr/bin directory and so other packages wouldn't install properly. Dependency errors occured. Hence, perl was the first among the modules which were re-installed. I was also able to re-install rmp-python. Botton line, upon re-installing all these modules, the Package Management utility function, available to the Panel (Start -> System Settings -> Packages) was finally brought on line. The rpm databases were also rebuild for good measure-although I don't know whether this had any benefit. I'm sharing this information, in case someone else has problems upon executing the newest rmp module (rpm-4.1-x) and the package management goes offline. Perhaps, the steps performed in this post can aid them. This post relates to the initial rpm bug report since it appears to have had repercussion elsewhere. I hope this can be helpful to someone, including the development team. Best regards. Steve. For what it's worth, I have an strace for the first unkillable RPM that corrupts its database: # strace -p 17657 select(0, NULL, NULL, NULL, {0, 83979}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) ... Before RPM wedged with "rpm -e im-sdk", "rpm -e hwcrypto" worked just fine. Following the kill -9 of RPM and rpm --rebuilddb, "rpm -e im-sdk" worked fine. Just to clarify, I deleted the temporary files in /var/lib/rpm before rpm --rebuilddb. I have the same problem on RH 8.0 (upgraded from 7.3)! sometimes hangs when Installing rpm packages.. have to remove the /var/lib/rpm/__db.00* and --rebuilddb or reboot the machine.. Just to confirm, I too see the select(0, NULL, NULL, NULL, {0, *}) = 0 (Timeout) select(0, NULL, NULL, NULL, {...}) = 0 (Timeout) select(0, NULL, NULL, NULL, {...}) = 0 (Timeout) select(0, NULL, NULL, NULL, {...}) = 0 (Timeout) select(0, NULL, NULL, NULL, {...}) = 0 (Timeout) select(0, NULL, NULL, NULL, {...}) = 0 (Timeout) ... loop in an strace. And have experienced the hangs on at least two RH8.0 boxes with entirely different architectures. RPM version 4.1 Also a segmentation violation when refreshing libpng-1.2.2-8.i386.rpm. I am trying to decide where that fits in these individual bugs right now. I can consistantly reproduce a hung RPM process that is unresponsive to kill signals. It happens to me on removal of the plucker-desktop RPM. I'm not sure if this is a consistant problem bourne out by the (very) bad packaging of plucker-desktop or whether it's unique to the state of my machine. It sounds like a fix is already in the pipe; but if you want a testbed for this one, it's reproducable here. I am also seeing this on a customers RH 8.0 server (Athlon 1800+ CPU, Epox AMD 761 chipset motherboard, software raid drive setup). This machine is a clean install. I did not strace the process (I should have ... but didn't find this bug until later) ... I did notice a pattern of disk access at a regular, repeating interval. Actually it's happening again and I see the same in strace now ... select(0, NULL, NULL, NULL, {0, 78120}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) ..... rpm hanged twice with phoebe2 beta too on rpm -e. On a similar instalation on different system, rpm -e with those packages didn't hanged at all. Is it possible that interrupting "rpm -qa", which ought to be a read-only operation, could cause this problem? I hadn't run rpm since the last reboot, so the stale database files would have been removed. Then I ran "rpm -qa" forgetting to add "| less," interrupted that process *, ran "rpm -qa | less," then ran "rpm --upgrade krb5-libs-1.2.5-8.i386.rpm krb5-devel-1.2.5-8.i386.rpm" and the rpm process hung with the same select loop. * rpm printed "warning: Exiting on signal ..." Why a warning? Interrupting processes that produce output is quite normal. The previous comment is right on target. I just confirmed several times in a row that interrupting an 'rpm -qa' command with Ctrl-C (stock RH 8.0 install) leaves 3 __db* files lying around in /var/lib/rpm. (and the signal seems to take a long time to be recognized). bash-2.05b$ cd download/ bash-2.05b$ ls ... openoffice-1.0.1-8.i386.rpm.... bash-2.05b$ su go to root root@dhcp-1044-65 download# rpm -ivh openoffice-1.0.1-8.i386.rpm error: openoffice-1.0.1-8.i386.rpm: V3 DSA signature: BAD, key ID db42a60e error: openoffice-1.0.1-8.i386.rpm cannot be installed rpm v 4.1 I've had a rpm lock so that even removing __* did not help, and rpm --rebuild hung in the select() loop too. "db_dump Packages" also hung in the same way. However, db4.0_dump from my Debian machine succesfully dumped the file, so I rebuilt it and put it back in place. So I reckon there's a bug in libdb used by rpm 4.1 in RH8, which is fixed in a new release of bdb or in a custom Debian patch. I can put that Packages file online if it helps. Same problem here: RPM hung after installing the first of two packages: Preparing... ########################################### [100%] 1:vte-devel ########################################### [ 50%] After killing RPM, subsequent hangs occur: open("/var/lib/rpm/Packages", O_RDONLY|O_LARGEFILE) = 3 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=41865216, ...}) = 0 brk(0x805d000) = 0x805d000 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 64000}) = 0 (Timeout) I've been seeing this problem off and on. I've gotten used to removing __db00* and then rpm --rebuildb and then carrying on. But this time it seems much more severe. I am unable to remove a particular package. Each time I try to upgrade or remove mozilla, I get the hang. This has resulted in [root@snow root]# rpm -q mozilla mozilla-1.0.1-26 mozilla-1.0.1-24 mozilla-1.2.1-0_rh8_xft That's right -- three versions of mozilla. rpm -e on any of them and I get the hang. Does this count as a reproducible case? Can I provide any data to help track down this realy, really annoying bug? rpm stopped working a few days ago for me and now that there's a sendmail exploit going around I wanted to check for an update but can't. I read and followed the above advise, but still cannot rebuild the database. running strace rpm --rebuilddb yields (after several minutes of uninteresting data flying by): stat64("/var/lib/rpmrebuilddb.12565/__db.013", 0xbfffdd20) = -1 ENOENT (No such file or directory) stat64("/var/lib/rpmrebuilddb.12565/__db.014", 0xbfffdd20) = -1 ENOENT (No such file or directory) stat64("/var/lib/rpmrebuilddb.12565/__db.015", 0xbfffdd20) = -1 ENOENT (No such file or directory) futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0 rmdir("/var/lib/rpmrebuilddb.12565") = 0 close(6) = 0 open("/var/lib/rpm/Pubkeys", O_RDWR|O_LARGEFILE) = 3 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 close(3) = 0 futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0 close(5) = 0 futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0 munmap(0x40838000, 458752) = 0 munmap(0x406f6000, 1318912) = 0 munmap(0x40025000, 16384) = 0 futex(0x4212e028, FUTEX_WAKE, 2147483647, NULL) = 0 rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0 rt_sigaction(SIGHUP, {SIG_DFL}, NULL, 8) = 0 rt_sigaction(SIGINT, {SIG_DFL}, NULL, 8) = 0 rt_sigaction(SIGTERM, {SIG_DFL}, NULL, 8) = 0 rt_sigaction(SIGQUIT, {SIG_DFL}, NULL, 8) = 0 rt_sigaction(SIGPIPE, {SIG_DFL}, NULL, 8) = 0 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 exit_group(0) = ? And leaves three __db* files around, which I manually remove. potential workaround? i've gone to apt-get and synaptic, rpm -ivh is giving me nothing but problems, for whatever reason, apt-get and synaptic get 'around' this problem FYI, apt-get (and synaptic) use rpm -Uvh under the hood. The SIGCHLD and SIGPIPE issues in rpm-4.1 are fixed in packages at ftp://ftp.rpm.org/pub/rpm/test-4.2 ftp://ftp.rpm.org/pub/rpm/test-4.1.1 so I'm gonna close this tracking bug. Yes there will be an errata. *PLEASE* open individual bug reports rather than reopening this bug. There's far, far too many different problems here to solve any. is the errata out yet? I can't see it. It's been many months now... Still no errata. Red Hat QA is doing their own thing. You can get a working rpm from rpm.org for 8.0 and 9. Jeff, can I get a date when we can expect to see an errata for this issue? Thanks very much for your efforts. I can't give you a date, only Red Hat can. FWIW, the most important bug was fixed last October, and the errata was queued 3/18/03. I've done my part, complain to Red Hat, not me. Try rpm-4.2-1 (RHL9) or rpm-4.1.1-1.8x (RHL8) packages from ftp://ftp.rpm.org/pub/rpm/dist All known "hang" problems are fixed there. Jeff said:
> I've done my part, complain to Red Hat, not me.
I thought that's what we were doing by filing a bugzilla entry...
Jeff, which is the proper way to complain to RedHat about their lack of action if not here? On a related note, does anyone know if this bug exists in RHEL 2.1(assuming not since it's based on 7.1), or 3.0(based on 9.0, right?) ? Regards Johan VERIFIED WORKAROUND RPM hangs on RH 8.0 from a fresh install. The rpm command hangs on every execution after the intital freeze. The keyboard interrupt does not kill the process and the kill command must be used to remove the process. Followed the suggestion in comment #15 in this defect. 1. remove the /var/lib/rpm/__db.00* 2. run: rpm --rebuilddb The problem is cleared and the rpm command works fine. Interesting... apparently *someone* is watching this, since further comments to complain about RedHat's lack of action are being removed. Wouldn't it be simpler to just release the errata? Is there any more official channel than bugzilla to lodge a complaint? Should I just open up a RFE for a working rpm? I'm running RH9 which shows RPM 4.2 I get the same symptoms as shown above. RedHat Network's support pages say to send a note if the documented way to rebuild the database fails. No response. RedHat wants me to upgrade to Enterprise Linux, but clearly I can't do that with a corrupt RPM database. Sadly it sure is looking like Microsoft, of all companies, is more responsive to its customers than RedHat is to its. AFAIK you have to *reinstall* (not upgrade) to go from Red Hat Linux to Red Hat Enterprise Linux (at least if you want Red Hat to support you). So a corrupted RPM database shouldn't interfere with that. Also note the existence of Fedora Core: http://fedora.redhat.com/ In some sense, Red Hat Linux has branched out in two directions. Red Hat Enterprise Linux is one, and Fedora Core is the other. You may want to take a look at Fedora Core. IMO the main drawback of Fedora Core is that you will have to upgrade much more often with it than with Red Hat Enterprise Linux; that may or may not be acceptable depending on your circumstances... Is this really fixed? Seth's post indicates not: http://skvidal.wordpress.com/2007/06/15/common-problem-in-yum/ |