Bug 206275

Summary: rpmq running as root gets stuck
Product: [Fedora] Fedora Reporter: Horst H. von Brand <vonbrand>
Component: rpmAssignee: Panu Matilainen <pmatilai>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: rawhideCC: amk, bill-bugzilla.redhat.com, panagopoulosalexandrou, trevor
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-08-10 11:00:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Horst H. von Brand 2006-09-13 13:22:06 UTC
Description of problem:
Ran "rpm -q redhat-artwork" after updating today (for BZ), and it didn't come
back. Placing it into the background rpmq was running. From another
gnome-terminal as normal user it returned immediately. Killing off the rpmq
process (had to -KILL it, it would't respond otherwise) it now hangs:

[root@laptop13 ~]# ps -l -p 3712
F S   UID   PID  PPID  C PRI  NI ADDR SZ WCHAN  TTY          TIME CMD
4 S     0  3712  3094  0  75   0 -  3532 futex  pts/0    00:00:00 rpmq

Version-Release number of selected component (if applicable):


How reproducible:
rpm-4.4.2-32

Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Jeff Johnson 2006-09-13 18:05:03 UTC
Look for stale locks by running (as root)
    cd /var/lib/rpm
    /usr/lib/rpm/rpmdb_stat -CA

Otherwise just do
    rm -f /var/lib/rpm/__db*

Comment 2 Trevor Cordes 2006-11-21 16:26:18 UTC
Not sure if this is related, but a FC5 box I just noticed is running very slow
has the following line in top:

 9639 root      25   0 11516 1128 1000 R 41.3  0.4  12667:18 rpmq

Ouch... must have been running for weeks.  kill SIGINT won't kill it.  -9 did.

I'm not sure if this has messed up the rpm db or not -- I will cross that bridge
when I come to it.

Comment 3 Jeff Johnson 2006-12-03 18:37:33 UTC
Segafualts and loss of data are likely due to removing an rpmdb environment
without correcting other problems in the rpmdb.

FYI: Most rpmdb "hangs" are now definitely fixed by purging stale read locks when opening
a database environment in rpm-4.4.8-0.4. There's more todo, but I'm quite sure that a
large class of problems with symptoms of "hang" are now corrected.

Detecting damaged by verifying when needed is well automated in rpm-4.4.8-0.4. Automatically 
correcting all possible damage is going to take more work, but a large class of problems is likely
already fixed in rpm-4.4.8-0.8 as well.

UPSTREAM

Comment 4 Horst H. von Brand 2007-01-02 17:47:48 UTC
rpmq from rpm-4.4.2-38.fc7 got stuck (shown as running, but no CPU usage IIRC)
when trying to run makewhatis (man-1.6e-1.fc7)recently (apropos(1) didn't know a
thing, so this might have happened a few times before), after rebooting and
successfully updating openmotif-->lestiff makewhatis went through.

Comment 5 Bill McGonigle 2007-02-16 19:50:50 UTC
I frequently see fc5 and fc6 machines with rpmq hung as described above, which
also locks any other rpm operations from happening (yum, rpm, etc) and I hear
from colleagues that it's common. 

Is it possible to backport the stale read lock purge bugfix from 4.4.8 to 4.4.2?

Comment 6 Trevor Cordes 2007-02-16 19:58:26 UTC
The other day I had a yum update hang on a box that had 192MB of RAM and for
some reason had the swap space disabled (fstab labelling issue).  I'm sure the
above problems are something else, as I've had rpm/yum hangs on boxes with 2GB
of RAM and 5GB swap.  But if you have a crappy box, check if your swap is
enabled!  And maybe the tools should nicely die with "out of mem" errors rather
than hanging?


Comment 7 Bill McGonigle 2007-02-16 20:15:15 UTC
The above info about stale locks was helpful.  Running 'rpmdb_stat -CA' shows,
during the period when nothing rpm-related works:

Locks grouped by object:
Locker   Mode      Count Status  ----------------- Object ---------------
      36 READ          2 HELD    0x353b8 len:  20 data:
0x11L0x06000x040x030000+.0xf20xbe0xf10x10000000000000

      35 READ          1 HELD    (64c11 304 bef22e2b 10f1 0) handle        0

Then when I kill the stuck processes and run db_recover in /var/lib/rpm,
'rpmdb_stat -CA' reports:

  db_stat: DB_ENV->open: No such file or directory

and then rpm transactions succeed as one would expect.

Comment 8 Jeff Johnson 2007-02-16 20:57:07 UTC
Technically, if there was a running process holding a lock, then the lock was not stale.

Stale locks is the term for locks that are not held by current processes.



Comment 9 Bill McGonigle 2007-02-16 23:01:25 UTC
Jeff - you point out an error in my previous comment.  Thanks.

The true order of operations above was: find the stuck process, kill (-9) it,
view the locks with rpmdb_stat (still there), run db_recover (no longer there),
run the next process.  

Please excuse the brainfart, comment #7 as written was inaccurate and non-useful. 


Comment 10 Panu Matilainen 2007-07-17 19:57:39 UTC
*** Bug 213892 has been marked as a duplicate of this bug. ***

Comment 11 Panu Matilainen 2007-08-10 11:00:10 UTC
Considering the timing of these hangs and crashes, most likely yet another
manifestation of the kernel mmap() bug - see
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213963#c65 for details.
Feel free to reopen if this still happens with current kernels.