Bug 206275

Summary:	rpmq running as root gets stuck
Product:	[Fedora] Fedora	Reporter:	Horst H. von Brand <vonbrand>
Component:	rpm	Assignee:	Panu Matilainen <pmatilai>
Status:	CLOSED WORKSFORME	QA Contact:
Severity:	high	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	amk, bill-bugzilla.redhat.com, panagopoulosalexandrou, trevor
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-08-10 11:00:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Horst H. von Brand 2006-09-13 13:22:06 UTC

Description of problem:
Ran "rpm -q redhat-artwork" after updating today (for BZ), and it didn't come
back. Placing it into the background rpmq was running. From another
gnome-terminal as normal user it returned immediately. Killing off the rpmq
process (had to -KILL it, it would't respond otherwise) it now hangs:

[root@laptop13 ~]# ps -l -p 3712
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0  3712  3094  0  75   0 -  3532 futex  pts/0    00:00:00 rpmq

Version-Release number of selected component (if applicable):


How reproducible:
rpm-4.4.2-32

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jeff Johnson 2006-09-13 18:05:03 UTC

Look for stale locks by running (as root)
    cd /var/lib/rpm
    /usr/lib/rpm/rpmdb_stat -CA

Otherwise just do
    rm -f /var/lib/rpm/__db*

Comment 2 Trevor Cordes 2006-11-21 16:26:18 UTC

Not sure if this is related, but a FC5 box I just noticed is running very slow
has the following line in top:

 9639 root      25   0 11516 1128 1000 R 41.3  0.4  12667:18 rpmq

Ouch... must have been running for weeks.  kill SIGINT won't kill it.  -9 did.

I'm not sure if this has messed up the rpm db or not -- I will cross that bridge
when I come to it.

Comment 3 Jeff Johnson 2006-12-03 18:37:33 UTC

Segafualts and loss of data are likely due to removing an rpmdb environment
without correcting other problems in the rpmdb.

FYI: Most rpmdb "hangs" are now definitely fixed by purging stale read locks when opening
a database environment in rpm-4.4.8-0.4. There's more todo, but I'm quite sure that a
large class of problems with symptoms of "hang" are now corrected.

Detecting damaged by verifying when needed is well automated in rpm-4.4.8-0.4. Automatically 
correcting all possible damage is going to take more work, but a large class of problems is likely
already fixed in rpm-4.4.8-0.8 as well.

UPSTREAM

Comment 4 Horst H. von Brand 2007-01-02 17:47:48 UTC

rpmq from rpm-4.4.2-38.fc7 got stuck (shown as running, but no CPU usage IIRC)
when trying to run makewhatis (man-1.6e-1.fc7)recently (apropos(1) didn't know a
thing, so this might have happened a few times before), after rebooting and
successfully updating openmotif-->lestiff makewhatis went through.

Comment 5 Bill McGonigle 2007-02-16 19:50:50 UTC

I frequently see fc5 and fc6 machines with rpmq hung as described above, which
also locks any other rpm operations from happening (yum, rpm, etc) and I hear
from colleagues that it's common. 

Is it possible to backport the stale read lock purge bugfix from 4.4.8 to 4.4.2?

Comment 6 Trevor Cordes 2007-02-16 19:58:26 UTC

The other day I had a yum update hang on a box that had 192MB of RAM and for
some reason had the swap space disabled (fstab labelling issue).  I'm sure the
above problems are something else, as I've had rpm/yum hangs on boxes with 2GB
of RAM and 5GB swap.  But if you have a crappy box, check if your swap is
enabled!  And maybe the tools should nicely die with "out of mem" errors rather
than hanging?

Comment 7 Bill McGonigle 2007-02-16 20:15:15 UTC

The above info about stale locks was helpful.  Running 'rpmdb_stat -CA' shows,
during the period when nothing rpm-related works:

Locks grouped by object:
Locker   Mode      Count Status  ----------------- Object ---------------
      36 READ          2 HELD    0x353b8 len:  20 data:
0x11L0x06000x040x030000+.0xf20xbe0xf10x10000000000000

      35 READ          1 HELD    (64c11 304 bef22e2b 10f1 0) handle        0

Then when I kill the stuck processes and run db_recover in /var/lib/rpm,
'rpmdb_stat -CA' reports:

  db_stat: DB_ENV->open: No such file or directory

and then rpm transactions succeed as one would expect.

Comment 8 Jeff Johnson 2007-02-16 20:57:07 UTC

Technically, if there was a running process holding a lock, then the lock was not stale.

Stale locks is the term for locks that are not held by current processes.

Comment 9 Bill McGonigle 2007-02-16 23:01:25 UTC

Jeff - you point out an error in my previous comment.  Thanks.

The true order of operations above was: find the stuck process, kill (-9) it,
view the locks with rpmdb_stat (still there), run db_recover (no longer there),
run the next process.  

Please excuse the brainfart, comment #7 as written was inaccurate and non-useful.

Comment 10 Panu Matilainen 2007-07-17 19:57:39 UTC

*** Bug 213892 has been marked as a duplicate of this bug. ***

Comment 11 Panu Matilainen 2007-08-10 11:00:10 UTC

Considering the timing of these hangs and crashes, most likely yet another
manifestation of the kernel mmap() bug - see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213963#c65 for details.
Feel free to reopen if this still happens with current kernels.