Bug 77562

Summary:	rpm sometimes hangs
Product:	[Retired] Red Hat Linux	Reporter:	djh <djh>
Component:	rpm	Assignee:	Jeff Johnson <jbj>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	8.0	CC:	binand, gczarcinski, jason, jr-redhatbugs2, jsightler, nerijus, per.starback, rivenburgh, tristan.hill, wtogami, yaneti, yiango
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2002-11-11 05:04:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description djh 2002-11-09 01:32:09 UTC

As requested in bug #74726, I'm opening a fresh bugreport.

RPM will sometimes hang when installing/upgrading a package.

eg.
# rpm -Uvh foo.rpm
Preparing...                ########################################### [100%]
   1:foo                    ########################################### [100%]
<rpm hangs here>

Stracing the hung RPM process shows this..
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
...

A backtrace of the hung RPM process looks like this..
#0  0x08198b8e in select ()
#1  0x08212654 in _GLOBAL_OFFSET_TABLE_ ()
#2  0x080fcd61 in __os_yield_rpmdb ()
#3  0x080c3a2f in __db_tas_mutex_lock_rpmdb ()
#4  0x080f4547 in __lock_get_internal ()
#5  0x080f3d7f in __lock_get_rpmdb ()
#6  0x080ddd62 in __db_c_put_rpmdb ()
#7  0x0809331f in db3cput ()
#8  0x0809059c in rpmdbAdd ()
#9  0x08062d0a in rpmpsmStage ()
#10 0x08062410 in rpmpsmStage ()
#11 0x08062925 in rpmpsmStage ()
#12 0x0807d6d5 in rpmtsRun ()
#13 0x0806dd11 in rpmInstall ()
#14 0x08048e4d in main ()
#15 0x0815b7f0 in __libc_start_main ()

In order to use rpm again, I had to SIGKILL the RPM process (it was unresponsive
to all other signals), and delete stale locks (/var/lib/rpm/__*).

This bug has been seen with the RPM version from Psyche/8.0 (4.1-1.06), AND with
the 4.1-9 RPMs from ftp://people.redhat.com/jbj/test-4.1/ .  (also confirmed by
jr-redhatbugs)

It is not easy to reproduce.  With 4.1-1.06, I saw it randomly every few days. 
With 4.1-9, I've seen it once after 3 weeks of use.  

---

Please DO NOT add any comments to this bug unless you are seeing EXACTLY the
same bug (same symptoms and same backtrace).  If you're seeing something
slightly different, go read bug #73097.  Then look for other existing RPM
bugreports, or create a new one.

Comment 1 Jeff Johnson 2002-11-09 15:12:05 UTC

Once "every 3 weeks" or so indicates that the missed
SIGCHLD is fixed IMHO. That's all that's claimed in 4.1-9.

BTW, there's still a teensy race in -9, the final fix will
be to s/pause/sleep(1)/ in lib/psm.c. That isn't done
yet because adding sleep(1) makes a hard problem even
harder to sort out.

Howevere, before I can make that claim, other sources of
hangs need to be excluded. Can you read #73097 and try to
figger out what else was happening on the machine when
rpm hung every 3 weeks?

And please, let's not start another bug pileon. You guys
got me outnumbered :-)

Comment 2 Jeff Johnson 2002-11-09 15:23:21 UTC

The back trace indicates that the "hang" is of the
"stale lock" variety, not the missed SIGCHLD. What I
need is an informed (guess if necessary) hint about what
else was happening around the time rpm "hung".

Comment 3 Jordan Russell 2002-11-10 18:06:00 UTC

For me, this type of hang appears to always occur prior to installing/upgrading 
a particular package. If I specify two packages on the command line, it can 
either hang before installing the first package, e.g.:

# rpm -ivh pkg1.rpm pkg2.rpm
Preparing...      ########################################### [100%]
<HANG>

OR after the first package is installed but before installing the second 
package:

# rpm -ivh pkg1.rpm pkg2.rpm
Preparing...      ########################################### [100%]
   1:pkg1         ########################################### [100%]
<HANG>

I don't recall seeing any hangs during the "erase" phase after all packages 
have been installed/upgraded.

Unfortunately, I don't know of any way to consistency reproduce the hang. If I 
make a shell script that rapidly uninstalls/installs/upgrades packages, it'll 
never hang. Yet, on a few occasions, I've seen this hang occur when installing 
a single package after not using rpm for a week.

FYI, I'm not running multiple instances of rpm simultaneously, I'm not using 
any programs other than rpm to access the rpm databases, and when I experience 
a hang it's almost always after the preceding rpm completed successfully.

Comment 4 Jeff Johnson 2002-11-10 18:32:26 UTC

Did you throw an strace on the <HANG> or
just assume that since the error bar sez'
100%, then rpm must be hung. See #73200
for details.

FWIW, I ran randomized install/upgrade transactions
in a loop for 24 hours, so I believe there is
no missed SIGCHLD anymore.

But, yes, the problem (if any) is gonan be a bear to
track down.

Comment 5 Jordan Russell 2002-11-10 18:53:43 UTC

Yes, I have strace'd many of the hangs, and every single time I've gotten the 
same results as djdave.au -- endless select() timeouts. Same 
backtrace results too when I've done that. 'ps' always shows rpm in an S state.

Comment 6 Jeff Johnson 2002-11-10 20:07:42 UTC

select timeouts (and the backtrace) are
from the database.

I need to characterize the condition somehow
in order to begin to attempt to fix. If there's
a problem within rpm, then the problem will track
with either certain packages, or with the state
of those packages. For example, the kernel package
is often installed multiple times, which is different
for that package. There's a screwy code path in rpm
associated with files that both contain identical files
that is often lightly tested.

Another condition that tends to vary is whether this is/was
an upgrade (i.e. previous version existed) or an install.

Then there's where the hang occurs, in the trace above
during rpmdbAdd(), the other common location when upgrading
is rpmdbRemove().

Can you remember what package names were involved,
or was it just random?

Comment 7 Jordan Russell 2002-11-10 20:47:48 UTC

> Another condition that tends to vary is whether this is/was
> an upgrade (i.e. previous version existed) or an install.

I've seen hangs with both upgrades (-U, -F) and new installs (-i) of packages.

> Can you remember what package names were involved,
> or was it just random?

It's been a couple of weeks since the last hang (though I haven't been using 
rpm much lately since the systems are mostly configured), but from what I 
remember it was completely random. I got hangs with large packages (like 
kernel), and small utility packages (like ipchains, IIRC). After killing rpm, 
issuing the very same rpm command again did not hang; it didn't seem to make 
any difference whether I removed the __* files first or not (although afterward 
I did for good measure, and also an 'rpm --rebuilddb').

Comment 8 Jeff Johnson 2002-11-11 12:42:41 UTC

Removing __db* files first would prevent the hang
(but opens a lock window). I claim that --rebuilddb
is unnecessary, only rm -f __db* is needed, but won't
hurt.

I don't see any way to reproduce your problem so
I'm gonna close this bug Yet Again. Feel free to reopen
if you need help or have more info.