From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.6) Gecko/20011120 Description of problem: rpm hangs (is in an endless select() loop) on two different x86 systems for me. Possibly the db is corrupted (although the last installed .rpms were different on both systems). Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: Not able to reproduce the actual bug that now makes rpm hang on any operations, but the rpm hang *itself* is easy to reproduce: try rpm -qa or rpm -Fvh or rpm -Uvh... anything that accesses the db makes rpm hang. Actual Results: Please see "additional information". Expected Results: A clean run of -qa or -U. Additional info: I saw this on two different systems now. Both are mostly RH70 systems that have been upgraded to RH72, ximian-GNOME (with de-installing RedHat-GNOME-RPMs first), and, as last update a few days ago, evolution-1.0 and some GNOME RPM updates from Ximian's FTP server. On both servers, "rpm" simply hung today on some operations. strace on the running processes showed that it was doing select()s with raising sleep times upto 1 second and then endlessly select()ing for 1 second until rpm was killed. On the first server, where I observed it trying to upgrade openssh packages (where it directly hung, as well as on any further tries), it was sufficient to "just" kill rpm and run "rpm --rebuilddb". On the second machine now, at the moment, I saw that a cron job was hanging: # ps ax|grep rpm 22814 ? S 0:00 /bin/sh /etc/cron.daily/rpm 22815 ? S 0:00 awk -v progname=/etc/cron.daily/rpm progname {????? 22816 ? S 0:00 /usr/lib/rpm/rpmq -q --all --qf %{name}-%{version}-%{ 24049 pts/1 S 0:00 grep rpm # strace -p 22816 select(0, NULL, NULL, NULL, {0, 860000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> # killall rpmq for quite some time. However, on this one, not even "rpm --rebuilddb -v" doesn't work. :-( At the moment, the command (after a minute of work) hangs and "strace -p" on this process says: # strace -p 24110 select(0, NULL, NULL, NULL, {0, 310000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0} <unfinished ...> [...] What can I do or better: what did rpm do wrong? It seems to have corrupted the database on both systems. There's plenty of space left on the hds: Filesystem 1k-blocks Used Available Use% Mounted on /dev/hda2 3075536 774796 2144508 27% / /dev/hda7 10080488 9608468 369608 97% /home /dev/hda9 1035660 34752 948300 4% /tmp /dev/hda6 12096724 3580360 7901880 32% /usr /dev/hda8 6103648 4735916 1057680 82% /var and a few hours ago, before the rpm cronjob started, it worked flawlessly (so the db was still intact). Any ideas? rpm -qa runs here upto this lines: freetype-devel-2.0.1-4 gqview-0.10.1-ximian.3 libgal7-0.8-ximian.1 gtop-1.0.13-ximian.1 xscreensaver-3.32-ximian.6 and then stops. However, on the other (first) system, it was a different package. So I suspect it's experiencing problems with the package *after*. Here's a full strace -p for an "rpm -qa" (last lines): write(1, "gtop-1.0.13-ximian.1\n", 21) = 21 pread(3, "\0\0\0\0\0\0\0\0\315\6\0\0\0\0\0\0\314\6\0\0\1\0\346\17"..., 4096, 7131136) = 4096 pread(3, "\0\0\0\0\0\0\0\0\314\6\0\0\315\6\0\0\313\6\0\0\1\0\346"..., 4096, 7127040) = 4096 pread(3, "\0\0\0\0\0\0\0\0\313\6\0\0\314\6\0\0\306\6\0\0\1\0\346"..., 4096, 7122944) = 4096 pread(3, "\0\0\0\0\0\0\0\0\306\6\0\0\313\6\0\0\307\6\0\0\1\0\346"..., 4096, 7102464) = 4096 pread(3, "\0\0\0\0\0\0\0\0\307\6\0\0\306\6\0\0\0\0\0\0\1\0\210\r"..., 4096, 7106560) = 4096 write(1, "xscreensaver-3.32-ximian.6\n", 27) = 27 select(0, NULL, NULL, NULL, {0, 1000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 2000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 4000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 8000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 64000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 128000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 256000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {0, 512000}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout) So, according to the values it feeds to select(), rpm is waiting for *nothing*. Let's try to get rpm dump core at that point: This GDB was configured as "i386-redhat-linux". (gdb) file /usr/lib/rpm/rpmq Reading symbols from /usr/lib/rpm/rpmq...(no debugging symbols found)...done. (gdb) set args -q --all (gdb) run Starting program: /usr/lib/rpm/rpmq -q --all passwd-0.64.1-4 xpat2-1.06-5 [...] gtop-1.0.13-ximian.1 xscreensaver-3.32-ximian.6 (no debugging symbols found)...(no debugging symbols found)... Program received signal SIGSEGV, Segmentation fault. 0x402605de in __select () from /lib/i686/libc.so.6 (gdb) where #0 0x402605de in __select () from /lib/i686/libc.so.6 #1 0x40135cb4 in __DTOR_END__ () from /usr/lib/librpmdb-4.0.3.so #2 0x4011a4b7 in __os_yield_rpmdb (dbenv=0x0, usecs=1000000) at ../db/dist/../os/os_spin.c:108 #3 0x400baa9f in __db_tas_mutex_lock_rpmdb (dbenv=0x8075bc0, mutexp=0x4032da18) at ../db/dist/../mutex/mut_tas.c:150 #4 0x4011489a in memp_fget_rpmdb (dbmfp=0x80760e0, pgnoaddr=0xbfffe65c, flags=0, addrp=0xbfffe638) at ../db/dist/../mp/mp_fget.c:268 #5 0x400eb416 in __db_goff_rpmdb (dbp=0x8075e38, dbt=0xbfffe740, tlen=16584, pgno=2071, bpp=0x8076164, bpsz=0x807616c) at ../db/dist/../db/db_overflow.c:134 #6 0x400efa97 in __db_ret_rpmdb (dbp=0x8075e38, h=0x403d22cc, indx=239, dbt=0xbfffe740, memp=0x8076164, memsize=0x807616c) at ../db/dist/../db/db_ret.c:54 #7 0x400e353b in __db_c_get_rpmdb (dbc_arg=0x8076118, key=0xbfffe760, data=0xbfffe740, flags=21) at ../db/dist/../db/db_cam.c:805 #8 0x400b7bb9 in db3c_get () from /usr/lib/librpmdb-4.0.3.so #9 0x400b80b8 in db3cget () from /usr/lib/librpmdb-4.0.3.so #10 0x400b1a17 in dbiGet () from /usr/lib/librpmdb-4.0.3.so #11 0x400b490a in rpmdbNextIterator () from /usr/lib/librpmdb-4.0.3.so #12 0x4007fab9 in showMatches () from /usr/lib/librpm-4.0.3.so #13 0x400805a4 in rpmQueryVerify () from /usr/lib/librpm-4.0.3.so #14 0x40080629 in rpmQuery () from /usr/lib/librpm-4.0.3.so #15 0x08049e8c in poptResetContext () #16 0x401966b7 in __libc_start_main (main=0x80495f0 <poptResetContext+604>, argc=3, ubp_av=0xbffffa64, init=0x804902c <_init>, fini=0x804a110 <_fini>, rtld_fini=0x4000db64 <_dl_fini>, stack_end=0xbffffa5c) at ../sysdeps/generic/libc-start.c:129 (gdb) [I killed it using "kill -SEGV 24199" from another shell] Hope this helps to locate the endless-loop bug in rpm. Please respond to me at <tevessen> if there is any way to rescue my existing RPM db. Many thanks. Kind regards from Cologne/Germany johannes # rpm -qa|grep ^rpm rpm-perl-4.0.3-0.79 rpm-build-4.0.3-0.79 [stops here] # rpm --version RPM Version 4.0.3 popt-1.6.3-0.79 # ls -l /var/lib/rpm insgesamt 26768 -rw-r--r-- 1 rpm rpm 5365760 Dez 4 11:50 Basenames -rw-r--r-- 1 rpm rpm 12288 Nov 16 19:09 Conflictname -rw-r--r-- 1 root root 8192 Dez 3 23:46 __db.001 -rw-r--r-- 1 root root 1310720 Dez 3 23:46 __db.002 -rw-r--r-- 1 root root 262144 Dez 4 11:50 Dirnames -rw-r--r-- 1 rpm rpm 24576 Dez 4 11:50 Group -rw-r--r-- 1 root root 16384 Dez 4 11:50 Installtid -rw-r--r-- 1 rpm rpm 45056 Dez 4 11:50 Name -rw-r--r-- 1 rpm rpm 20840448 Dez 4 11:50 Packages -rw-r--r-- 1 rpm rpm 167936 Dez 4 11:50 Providename -rw-r--r-- 1 root root 24576 Dez 4 11:50 Provideversion -rw-r--r-- 1 root root 8192 Aug 20 19:50 Removetid -rw-r--r-- 1 rpm rpm 249856 Dez 4 11:50 Requirename -rw-r--r-- 1 root root 61440 Dez 4 11:50 Requireversion -rw-r--r-- 1 rpm rpm 12288 Dez 3 23:36 Triggername
Sorry, typo. The systems weren't upgraded to RH72 but RH71.
Try rm /var/lib/rpm/__db* These are temporaray cache files, and are in rpm prepatory to permitting concurrent database access.
"WORKSFORME" is a *good* $RESOLUTION if you know how to fix it and I don't. :-} However, yes, removing the files fixed it. I just wanted to update this bug as dupe of 55920. "KNOWNBUG" would be a better $RESOLUTION, IMHO, since this is what it is. But - why did I get it up and running again on the other system by just running "rpm --rebuilddb"?!
You don't need to "^C" rpm -qa as root to make the __db* files stay. It's already sufficient to do rpm -qa | less and quit less with 'q'. This is enough to leave __db* files lying around in /var/lib/rpm. And this is, I think, a quite common scenario.
WORKSFORME indicates that no action, other than supplying info, was taken. It's simply not possible to set up an enum of the union of all possible resolutions. rpm --rebuilddb removes those files, so the effect is the same. And yes, there are probably many scenarios in which the files are left. The fix will be to implement and permit concurrent access to the database.
No, "rpm --rebuilddb" _doesn't_ fix it. I had it running here as documented in this bug id, and it experienced the _same_ problem, staying in endless select() loops (see strace above). So "rpm --rebuilddb", at least in some cases, is not a solution, since it also hangs.
I have had this problem too, but the database was left in a state where I had to both remove the __db* files *and* run rpm --rebuilddb before the hang disappeared. Annoyingly I had also done none of the things (like ^C during a query or whatever) that would normally cause such a problem, before the hang occurred (though I did interrupt it once it became obvious there was a problem). The hang occurred while running "rpm -qf /usr/lib/python1.5/*" if that's any use.
Narrowed this down a little. Hang currently appears on -qf of any file called README, at least all the ones I've tried. gdb trace looks similar to the one above, but running with "step" shows it looping around in the same bits of code over and over (I've detached, left it running awhile, done the same thing and it looks similar, just different numbers for some of the function arguments). Will attach a short section in a minute. *pure speculation* this could be a bug in db 3.x rather than rpm. Perhaps related to the fact that there are *lots* of files with the basename "README" ...
Created attachment 40567 [details] Short section of gdb output for this "hang"
Would you prefer I open a new bug for this? It is reproducible at least for me; but I'm not seeing the select() in the trace so it may be a different bug. Please note that "rpm --rebuilddb" has not cleared this problem.
Oddly enough, a fix suggested in response to an older database-corruption problem, does seem to have cured the problem for a while (--rebuilddb, twice, didn't): [root@desktop rpm]# mv Basenames Basenames.orig [root@desktop rpm]# db_dump !$ | db_load Basenames db_dump Basenames.orig | db_load Basenames However I'm now hanging on "rpm -qf /usr/lib/python2.2/site-packages" instead :o(
The fix for now is still rm -f /var/lib/rpm/_db* AFAICT the problem you are describing has only to do with ^C or other abnormal termination, a known problem, and will need either a setgid remove helper or a signal handler in rpmlib to fix. If ^C (or other abnormal termination) is not the root problem, then please, another bugzilla report.
No, this is happening 1) without any __db* files there 2) after a --rebuilddb 3) continues to happen, at a different place in the database, after doing a db_dump ... db_load on the relevent database file/table (Basenames). 4) with the latest version: [bill@desktop bill]$ rpm -q rpm rpm-4.0.4-0.2 It looks like this might be a different bug, so I'll open a new one.