206275 – rpmq running as root gets stuck

Bug 206275 - rpmq running as root gets stuck

Summary: rpmq running as root gets stuck

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Panu Matilainen
QA Contact:
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	213892 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-09-13 13:22 UTC by Horst H. von Brand
Modified:	2007-11-30 22:11 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2007-08-10 11:00:10 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Horst H. von Brand 2006-09-13 13:22:06 UTC

Description of problem:
Ran "rpm -q redhat-artwork" after updating today (for BZ), and it didn't come
back. Placing it into the background rpmq was running. From another
gnome-terminal as normal user it returned immediately. Killing off the rpmq
process (had to -KILL it, it would't respond otherwise) it now hangs:

[root@laptop13 ~]# ps -l -p 3712
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0  3712  3094  0  75   0 -  3532 futex  pts/0    00:00:00 rpmq

Version-Release number of selected component (if applicable):


How reproducible:
rpm-4.4.2-32

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jeff Johnson 2006-09-13 18:05:03 UTC

Look for stale locks by running (as root)
    cd /var/lib/rpm
    /usr/lib/rpm/rpmdb_stat -CA

Otherwise just do
    rm -f /var/lib/rpm/__db*

Comment 2 Trevor Cordes 2006-11-21 16:26:18 UTC

Not sure if this is related, but a FC5 box I just noticed is running very slow
has the following line in top:

 9639 root      25   0 11516 1128 1000 R 41.3  0.4  12667:18 rpmq

Ouch... must have been running for weeks.  kill SIGINT won't kill it.  -9 did.

I'm not sure if this has messed up the rpm db or not -- I will cross that bridge
when I come to it.

Comment 3 Jeff Johnson 2006-12-03 18:37:33 UTC

Segafualts and loss of data are likely due to removing an rpmdb environment
without correcting other problems in the rpmdb.

FYI: Most rpmdb "hangs" are now definitely fixed by purging stale read locks when opening
a database environment in rpm-4.4.8-0.4. There's more todo, but I'm quite sure that a
large class of problems with symptoms of "hang" are now corrected.

Detecting damaged by verifying when needed is well automated in rpm-4.4.8-0.4. Automatically 
correcting all possible damage is going to take more work, but a large class of problems is likely
already fixed in rpm-4.4.8-0.8 as well.

UPSTREAM

Comment 4 Horst H. von Brand 2007-01-02 17:47:48 UTC

rpmq from rpm-4.4.2-38.fc7 got stuck (shown as running, but no CPU usage IIRC)
when trying to run makewhatis (man-1.6e-1.fc7)recently (apropos(1) didn't know a
thing, so this might have happened a few times before), after rebooting and
successfully updating openmotif-->lestiff makewhatis went through.

Comment 5 Bill McGonigle 2007-02-16 19:50:50 UTC

I frequently see fc5 and fc6 machines with rpmq hung as described above, which
also locks any other rpm operations from happening (yum, rpm, etc) and I hear
from colleagues that it's common. 

Is it possible to backport the stale read lock purge bugfix from 4.4.8 to 4.4.2?

Comment 6 Trevor Cordes 2007-02-16 19:58:26 UTC

The other day I had a yum update hang on a box that had 192MB of RAM and for
some reason had the swap space disabled (fstab labelling issue).  I'm sure the
above problems are something else, as I've had rpm/yum hangs on boxes with 2GB
of RAM and 5GB swap.  But if you have a crappy box, check if your swap is
enabled!  And maybe the tools should nicely die with "out of mem" errors rather
than hanging?

Comment 7 Bill McGonigle 2007-02-16 20:15:15 UTC

The above info about stale locks was helpful.  Running 'rpmdb_stat -CA' shows,
during the period when nothing rpm-related works:

Locks grouped by object:
Locker   Mode      Count Status  ----------------- Object ---------------
      36 READ          2 HELD    0x353b8 len:  20 data:
0x11L0x06000x040x030000+.0xf20xbe0xf10x10000000000000

      35 READ          1 HELD    (64c11 304 bef22e2b 10f1 0) handle        0

Then when I kill the stuck processes and run db_recover in /var/lib/rpm,
'rpmdb_stat -CA' reports:

  db_stat: DB_ENV->open: No such file or directory

and then rpm transactions succeed as one would expect.

Comment 8 Jeff Johnson 2007-02-16 20:57:07 UTC

Technically, if there was a running process holding a lock, then the lock was not stale.

Stale locks is the term for locks that are not held by current processes.

Comment 9 Bill McGonigle 2007-02-16 23:01:25 UTC

Jeff - you point out an error in my previous comment.  Thanks.

The true order of operations above was: find the stuck process, kill (-9) it,
view the locks with rpmdb_stat (still there), run db_recover (no longer there),
run the next process.  

Please excuse the brainfart, comment #7 as written was inaccurate and non-useful.

Comment 10 Panu Matilainen 2007-07-17 19:57:39 UTC

*** Bug 213892 has been marked as a duplicate of this bug. ***

Comment 11 Panu Matilainen 2007-08-10 11:00:10 UTC

Considering the timing of these hangs and crashes, most likely yet another
manifestation of the kernel mmap() bug - see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213963#c65 for details.
Feel free to reopen if this still happens with current kernels.

Note You need to log in before you can comment on or make changes to this bug.