57196 – rpm 4.0.3 endless select() loops (backtrace included)

Bug 57196 - rpm 4.0.3 endless select() loops (backtrace included)

Summary: rpm 4.0.3 endless select() loops (backtrace included)

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	rpm
Sub Component:
Version:	7.1
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Jeff Johnson
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-12-06 19:12 UTC by Johannes Tevessen
Modified:	2008-05-01 15:38 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-12-06 19:21:31 UTC
Embargoed:

Attachments	(Terms of Use)
Short section of gdb output for this "hang" (13.15 KB, text/plain) 2001-12-14 00:25 UTC, Bill Crawford	no flags	Details
View All

Description Johannes Tevessen 2001-12-06 19:12:15 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:0.9.6) Gecko/20011120

Description of problem:
rpm hangs (is in an endless select() loop) on two different x86
systems for me. Possibly the db is corrupted (although the last
installed .rpms were different on both systems).


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
Not able to reproduce the actual bug that now makes rpm hang
on any operations, but the rpm hang *itself* is easy to
reproduce: try rpm -qa or rpm -Fvh or rpm -Uvh... anything
that accesses the db makes rpm hang.


Actual Results:  Please see "additional information".


Expected Results:  A clean run of -qa or -U.


Additional info:

I saw this on two different systems now. Both are mostly RH70 systems
that have been upgraded to RH72, ximian-GNOME (with de-installing
RedHat-GNOME-RPMs first), and, as last update a few days ago,
evolution-1.0 and some GNOME RPM updates from Ximian's FTP
server.

On both servers, "rpm" simply hung today on some operations. strace
on the running processes showed that it was doing select()s with
raising sleep times upto 1 second and then endlessly select()ing
for 1 second until rpm was killed. On the first server, where I observed
it trying to upgrade openssh packages (where it directly hung, as
well as on any further tries), it was sufficient to "just" kill rpm and
run "rpm --rebuilddb".

On the second machine now, at the moment, I saw that a cron job
was hanging:

# ps ax|grep rpm
22814 ?        S      0:00 /bin/sh /etc/cron.daily/rpm
22815 ?        S      0:00 awk -v progname=/etc/cron.daily/rpm progname
{?????  
22816 ?        S      0:00 /usr/lib/rpm/rpmq -q --all --qf
%{name}-%{version}-%{
24049 pts/1    S      0:00 grep rpm
# strace -p 22816
select(0, NULL, NULL, NULL, {0, 860000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
# killall rpmq

for quite some time. However, on this one, not even
"rpm --rebuilddb -v" doesn't work. :-( At the moment,
the command (after a minute of work) hangs and
"strace -p" on this process says:

# strace -p 24110
select(0, NULL, NULL, NULL, {0, 310000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
[...]

What can I do or better: what did rpm do wrong? It seems
to have corrupted the database on both systems.

There's plenty of space left on the hds:

Filesystem           1k-blocks      Used Available Use% Mounted on
/dev/hda2              3075536    774796   2144508  27% /
/dev/hda7             10080488   9608468    369608  97% /home
/dev/hda9              1035660     34752    948300   4% /tmp
/dev/hda6             12096724   3580360   7901880  32% /usr
/dev/hda8              6103648   4735916   1057680  82% /var

and a few hours ago, before the rpm cronjob started, it
worked flawlessly (so the db was still intact).

Any ideas?

rpm -qa runs here upto this lines:

freetype-devel-2.0.1-4
gqview-0.10.1-ximian.3
libgal7-0.8-ximian.1
gtop-1.0.13-ximian.1
xscreensaver-3.32-ximian.6

and then stops. However, on the other (first) system, it was a different
package. So I suspect it's experiencing problems with the package
*after*.

Here's a full strace -p for an "rpm -qa" (last lines):

write(1, "gtop-1.0.13-ximian.1\n", 21)  = 21
pread(3, "\0\0\0\0\0\0\0\0\315\6\0\0\0\0\0\0\314\6\0\0\1\0\346\17"...,
4096, 7131136) = 4096
pread(3, "\0\0\0\0\0\0\0\0\314\6\0\0\315\6\0\0\313\6\0\0\1\0\346"...,
4096, 7127040) = 4096
pread(3, "\0\0\0\0\0\0\0\0\313\6\0\0\314\6\0\0\306\6\0\0\1\0\346"...,
4096, 7122944) = 4096
pread(3, "\0\0\0\0\0\0\0\0\306\6\0\0\313\6\0\0\307\6\0\0\1\0\346"...,
4096, 7102464) = 4096
pread(3, "\0\0\0\0\0\0\0\0\307\6\0\0\306\6\0\0\0\0\0\0\1\0\210\r"...,
4096, 7106560) = 4096
write(1, "xscreensaver-3.32-ximian.6\n", 27) = 27
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 2000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 4000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 8000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 64000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 128000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 256000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 512000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)

So, according to the values it feeds to select(), rpm is waiting for
*nothing*.

Let's try to get rpm dump core at that point:

This GDB was configured as "i386-redhat-linux".
(gdb) file /usr/lib/rpm/rpmq    
Reading symbols from /usr/lib/rpm/rpmq...(no debugging symbols
found)...done.
(gdb) set args -q --all
(gdb) run
Starting program: /usr/lib/rpm/rpmq -q --all
passwd-0.64.1-4
xpat2-1.06-5
[...]
gtop-1.0.13-ximian.1
xscreensaver-3.32-ximian.6
(no debugging symbols found)...(no debugging symbols found)...
Program received signal SIGSEGV, Segmentation fault.
0x402605de in __select () from /lib/i686/libc.so.6
(gdb) where
#0  0x402605de in __select () from /lib/i686/libc.so.6
#1  0x40135cb4 in __DTOR_END__ () from /usr/lib/librpmdb-4.0.3.so
#2  0x4011a4b7 in __os_yield_rpmdb (dbenv=0x0, usecs=1000000) at
../db/dist/../os/os_spin.c:108
#3 0x400baa9f in __db_tas_mutex_lock_rpmdb (dbenv=0x8075bc0,
mutexp=0x4032da18)
    at ../db/dist/../mutex/mut_tas.c:150
#4  0x4011489a in memp_fget_rpmdb (dbmfp=0x80760e0, pgnoaddr=0xbfffe65c,
flags=0, addrp=0xbfffe638)
    at ../db/dist/../mp/mp_fget.c:268
#5  0x400eb416 in __db_goff_rpmdb (dbp=0x8075e38, dbt=0xbfffe740,
tlen=16584, pgno=2071, bpp=0x8076164, 
    bpsz=0x807616c) at ../db/dist/../db/db_overflow.c:134
#6  0x400efa97 in __db_ret_rpmdb (dbp=0x8075e38, h=0x403d22cc, indx=239,
dbt=0xbfffe740, memp=0x8076164, 
    memsize=0x807616c) at ../db/dist/../db/db_ret.c:54
#7  0x400e353b in __db_c_get_rpmdb (dbc_arg=0x8076118, key=0xbfffe760,
data=0xbfffe740, flags=21)
    at ../db/dist/../db/db_cam.c:805
#8  0x400b7bb9 in db3c_get () from /usr/lib/librpmdb-4.0.3.so
#9  0x400b80b8 in db3cget () from /usr/lib/librpmdb-4.0.3.so
#10 0x400b1a17 in dbiGet () from /usr/lib/librpmdb-4.0.3.so
#11 0x400b490a in rpmdbNextIterator () from /usr/lib/librpmdb-4.0.3.so
#12 0x4007fab9 in showMatches () from /usr/lib/librpm-4.0.3.so
#13 0x400805a4 in rpmQueryVerify () from /usr/lib/librpm-4.0.3.so
#14 0x40080629 in rpmQuery () from /usr/lib/librpm-4.0.3.so
#15 0x08049e8c in poptResetContext ()
#16 0x401966b7 in __libc_start_main (main=0x80495f0
<poptResetContext+604>, argc=3, ubp_av=0xbffffa64, 
    init=0x804902c <_init>, fini=0x804a110 <_fini>, rtld_fini=0x4000db64
<_dl_fini>, stack_end=0xbffffa5c)
    at ../sysdeps/generic/libc-start.c:129
(gdb) 


[I killed it using "kill -SEGV 24199" from another shell]

Hope this helps to locate the endless-loop bug in rpm.
Please respond to me at <tevessen> if
there is any way to rescue my existing RPM db. Many
thanks.

Kind regards from Cologne/Germany
johannes

# rpm -qa|grep ^rpm
rpm-perl-4.0.3-0.79
rpm-build-4.0.3-0.79
[stops here]

# rpm --version
RPM Version 4.0.3

popt-1.6.3-0.79

# ls -l /var/lib/rpm
insgesamt 26768
-rw-r--r--    1 rpm      rpm       5365760 Dez  4 11:50 Basenames
-rw-r--r--    1 rpm      rpm         12288 Nov 16 19:09 Conflictname
-rw-r--r--    1 root     root         8192 Dez  3 23:46 __db.001
-rw-r--r--    1 root     root      1310720 Dez  3 23:46 __db.002
-rw-r--r--    1 root     root       262144 Dez  4 11:50 Dirnames
-rw-r--r--    1 rpm      rpm         24576 Dez  4 11:50 Group
-rw-r--r--    1 root     root        16384 Dez  4 11:50 Installtid
-rw-r--r--    1 rpm      rpm         45056 Dez  4 11:50 Name
-rw-r--r--    1 rpm      rpm      20840448 Dez  4 11:50 Packages
-rw-r--r--    1 rpm      rpm        167936 Dez  4 11:50 Providename
-rw-r--r--    1 root     root        24576 Dez  4 11:50 Provideversion
-rw-r--r--    1 root     root         8192 Aug 20 19:50 Removetid
-rw-r--r--    1 rpm      rpm        249856 Dez  4 11:50 Requirename
-rw-r--r--    1 root     root        61440 Dez  4 11:50 Requireversion
-rw-r--r--    1 rpm      rpm         12288 Dez  3 23:36 Triggername

Comment 1 Johannes Tevessen 2001-12-06 19:21:26 UTC

Sorry, typo. The systems weren't upgraded to RH72 but RH71.

Comment 2 Jeff Johnson 2001-12-06 19:41:45 UTC

Try
	rm /var/lib/rpm/__db*
These are temporaray cache files, and are in rpm prepatory
to permitting concurrent database access.

Comment 3 Johannes Tevessen 2001-12-06 20:01:40 UTC

"WORKSFORME" is a *good* $RESOLUTION if you know how to fix it
and I don't. :-}

However, yes, removing the files fixed it. I just wanted
to update this bug as dupe of 55920.

"KNOWNBUG" would be a better $RESOLUTION, IMHO, since this is
what it is.

But - why did I get it up and running again on the other
system by just running "rpm --rebuilddb"?!

Comment 4 Johannes Tevessen 2001-12-06 20:11:24 UTC

You don't need to "^C" rpm -qa as root to make the __db* files
stay. It's already sufficient to do

rpm -qa | less

and quit less with 'q'. This is enough to leave __db* files
lying around in /var/lib/rpm. And this is, I think, a quite
common scenario.

Comment 5 Jeff Johnson 2001-12-06 20:20:02 UTC

WORKSFORME indicates that no action, other than supplying info,
was taken. It's simply not possible to set up an enum of the union
of all possible resolutions.

rpm --rebuilddb removes those files, so the effect is the same.

And yes, there are probably many scenarios in which the
files are left. The fix will be to implement and permit concurrent
access to the database.

Comment 6 Johannes Tevessen 2001-12-06 20:29:14 UTC

No, "rpm --rebuilddb" _doesn't_ fix it. I had it running here
as documented in this bug id, and it experienced the _same_
problem, staying in endless select() loops (see strace above).

So "rpm --rebuilddb", at least in some cases, is not a
solution, since it also hangs.

Comment 7 Bill Crawford 2001-12-09 00:05:57 UTC

I have had this problem too, but the database was left in a state where I had to
both remove the __db* files *and* run rpm --rebuilddb before the hang
disappeared.  Annoyingly I had also done none of the things (like ^C during a
query or whatever) that would normally cause such a problem, before the hang
occurred (though I did interrupt it once it became obvious there was a problem).

The hang occurred while running "rpm -qf /usr/lib/python1.5/*" if that's any
use.

Comment 8 Bill Crawford 2001-12-14 00:23:13 UTC

Narrowed this down a little.  Hang currently appears on -qf of any file called
README, at least all the ones I've tried.  gdb trace looks similar to the one
above, but running with "step" shows it looping around in the same bits of code
over and over (I've detached, left it running awhile, done the same thing and it
looks similar, just different numbers for some of the function arguments).  Will
attach a short section in a minute.

*pure speculation* this could be a bug in db 3.x rather than rpm.  Perhaps
related to the fact that there are *lots* of files with the basename "README"
...

Comment 9 Bill Crawford 2001-12-14 00:25:18 UTC

Created attachment 40567 [details]
Short section of gdb output for this "hang"

Comment 10 Bill Crawford 2001-12-14 00:28:21 UTC

Would you prefer I open a new bug for this?  It is reproducible at least for me;
but I'm not seeing the select() in the trace so it may be a different bug.

Please note that "rpm --rebuilddb" has not cleared this problem.

Comment 11 Bill Crawford 2001-12-14 00:41:48 UTC

Oddly enough, a fix suggested in response to an older database-corruption
problem, does seem to have cured the problem for a while (--rebuilddb, twice,
didn't):

[root@desktop rpm]# mv Basenames Basenames.orig
[root@desktop rpm]# db_dump !$ | db_load Basenames
db_dump Basenames.orig | db_load Basenames

However I'm now hanging on "rpm -qf /usr/lib/python2.2/site-packages" instead
:o(

Comment 12 Jeff Johnson 2001-12-14 16:16:27 UTC

The fix for now is still 
	rm -f /var/lib/rpm/_db*
AFAICT the problem you are describing has only to do with ^C or
other abnormal termination, a known problem, and will need either a setgid
remove
helper or a signal handler in rpmlib to fix.

If ^C (or other abnormal termination) is not the root problem, then
please, another bugzilla report.

Comment 13 Bill Crawford 2001-12-14 22:47:50 UTC

No, this is happening

1) without any __db* files there
2) after a --rebuilddb
3) continues to happen, at a different place in the database, after doing a
db_dump ... db_load on the relevent database file/table (Basenames).
4) with the latest version:

[bill@desktop bill]$ rpm -q rpm
rpm-4.0.4-0.2

It looks like this might be a different bug, so I'll open a new one.

Note You need to log in before you can comment on or make changes to this bug.