Bug 680508

Summary: rpm/yum cause mutex & rpmdb error messages
Product: [Fedora] Fedora Reporter: Felix Miata <mrmazda>
Component: rpmAssignee: Packaging Maintenance Team <packaging-team-maint>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 19CC: collura, ffesti, jones.peter.busi, pknirsch, pmatilai
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-14 08:52:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Felix Miata 2011-02-25 18:29:19 UTC
Description of problem:
error: rpmdb: unable to allocate memory for mutex: resize mutex region

How reproducible:
Use yum to install packages

Steps to Reproduce:
1.install minimal from alt.fedoraproject.org via http to previously prepared / partition
2.reboot
3.install various packages (mc, xorg, kde, firefox, seamonkey, etc) with yum
 
Actual results:
1-lots of messages like "error: rpmdb: unable to allocate memory for mutex: resize mutex region"
2-"installed" packages are reported by yum as "not installed"
3-further attempts to install give more messages than the few the first time it happens

Expected results:
1-no such errors
2-all packages selected for installation get installed

Additional info:
This has happened eventually on each of the past several installs over a period in excess of two weeks. It was only a bit over an hour between installation completion and the first occurrance today. Today's has only given these messages in one instance as yet, and not yet displayed are any installed packages are not installed messages. I've stopped attempting to proceed pending this bug. So far I only see these messages on screen, not anywhere I've found in any logs. Systems are older, with single core CPUs, such as K7 and P4, and with 1G or less RAM. Today's instance is Sempron K7 f6 m8 s1 @2000MHz using about half of 1G of total RAM, 68 of 923700 swap. / is EXT3 currently 48% full, 4875929 total 1K blocks.

Comment 1 Panu Matilainen 2011-02-28 05:15:04 UTC
If your filesystem blocksize is 1KB as your report seems to indicate, that's almost certainly the problem. Its a long-standing bug that something about 1KB block size upsets the rpmdb pretty badly - whether its rpm code, something in Berkeley DB or the kernel or the combination of the three nobody has ever really figured out AFAIK.

So if 'stat -f /var/lib/rpm' says block size 1024, reformat with a 4KB block size or put /var on a separate file system with 4KB block size and it should fix the thing.

Comment 2 Felix Miata 2011-02-28 05:48:32 UTC
Efficient small partition blocksize is only a problem on Fedora & Mandriva AFAIK. I know it's longstanding in Mandriva <https://qa.mandriva.com/show_bug.cgi?id=32547>, but AFAICR it's never before happened to me with Fedora. This has never happened to me using Knoppix, openSUSE or *buntu. Virtually all my / partitions are 4000M, 4400M or 4800M, where larger than minimal blocksize is too inefficient for compatibility with package managers that download all to-be-installed packages before installing any. A separate /var is too complex for system HDs with upwards of 20 partitions each already.

Comment 3 Panu Matilainen 2011-02-28 07:51:18 UTC
I've seen it on Fedora numerous times, it not a new thing by any means. Knoppix and *buntu use dpkg, which doesn't use Berkeley DB as the backend and Suse patches their rpm to use a different BDB access style, which probably avoids the issue (but has other issues).

Obviously it IS a bug that should be fixed - wherever it is - the db shouldn't care what the filesystem block size is. For now however, the known workaround is to use a bigger block size.

Comment 4 Felix Miata 2011-03-30 18:29:50 UTC
This happened today on host gx260, an Intel P4 box, trying to yum install system-config-printer from Konsole. I exited X, 'rpm --rebuilddb', rebooted to runlevel 3 to purge ram, and yum install system-config-printer succeeded.

Comment 5 Felix Miata 2011-05-01 23:08:39 UTC
I installed F15 "fresh" (deleting files & dirs only, not reformatting) on the comment 0 system today. So far, packages have only been installed, none removed or upgraded, and this bug has yet to appear.

Comment 6 Felix Miata 2011-08-06 20:32:22 UTC
I installed F15 to an EXT3 partition with 2048 block size in June on host gx27b. Last night I did updates, and got far fewer DB error messages than I remembered having had with 1024 block size installations, and so far no apparent outright yum/rpm failures have occurred. I then copied the whole filesystem content to a freshly formatted EXT3 partition with 4096 block size, and to one with 1024 block size.  DF reports:

1024 part: 4876508   2991580   1639136  65% /disks/fedora
    /dev/sda16      4.7G  2.9G  1.6G  65% /disks/fedora
4096 part: 4877032   3290728   1340512  72% /mnt
    /dev/sda17      4.7G  3.2G  1.3G  72% /mnt
2048 part: 4875929   3006298   1623839  65% /mnt2
    /dev/sda18      4.7G  2.9G  1.6G  65% /mnt2

Clearly the extra overhang from using 4096 (4096 uses 109.46% of what 2048 uses) is excessive when small / partition sizes are required.

Next I enabled the 4096 partition to be booted. Once that worked, I installed fedora-release-16. Then I yum removed some packages I'd never have use for, then yum installed unnecessarily removed deps, like kdebase-workspace, kdebase-session & firefox. All that went fine. Only after that I yum ungraded, and that produced a _lot_ of rpmdb errors. Afterward, simply doing rpm -q produced a huge error stream. Even rpm --rebuilddb produced one such error, but at least it stopped rpm queries from producing errors. Realizing not rebuild db first was a mistake, I redid by purging the 4096 filesystem, recopied from the 2048 filesystem, and rpm -rebuilddb before installing fedora-release-16. Next I updated only yum* and *rpm*, then removed, then reinstalled the unnecessarily removeds, then yum upgraded, all of which produced no apparent rpm db errors.

At conclusion of exercise, disk space was:
F15 on 2048 block size (992 from rpm -qa | wc -l):
  /dev/sda16       4876508   3032606   1598110  66% /disks/fedora
  /dev/sda16      4.7G  2.9G  1.6G  66% /disks/fedora
F16 on 4096 block size (1023 from rpm -qa | wc -l):
  /dev/sda17       4877032   3290096   1341144  72% /mnt
  /dev/sda17      4.7G  3.2G  1.3G  72% /mnt

Comment 7 Felix Miata 2011-08-07 02:10:11 UTC
Direct comparison of F16 content on 4096 copied to 1024 with neither mounted as /:
F16 on 4096 blocksize
    /dev/sda17       4877032 3310128   1321112  72% /mnt
    /dev/sda17      4.7G  3.2G  1.3G  72% /mnt
F16 on 1024 blocksize
    /dev/sda18       4875929 3001496   1628641  65% /mnt2
    /dev/sda18      4.7G  2.9G  1.6G  65% /mnt2
    
Disk /dev/sda: 120.0 GB, 120000000000 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda17       99474543   109306259     4915858+  83  Linux
/dev/sda18      109306323   119138039     4915858+  83  Linux

I originally did this last comparison to see if F16's newer rpm behaves better, but I can't tell any difference. rpm --rebuilddb generates a huge list of unable to allocate memory for mutex errors, which may be responsible for this dmesg tail:
[   43.337795] avahi-daemon[736]: avahi-daemon 0.6.30 starting up.
[   43.566860] systemd[1]: Cannot add dependency job for unit hwclock-load.service, ignoring: Unit hwclock-load.service failed to load: No such file or directory. See system logs and 'systemctl status hwclock-load.service' for details.
[   44.160256] multipathd (782): /proc/782/oom_adj is deprecated, please use /proc/782/oom_score_adj instead.
[   44.176723] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[   44.354510] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   44.549811] 
[   44.549814] =============================================
[   44.558273] [ INFO: possible recursive locking detected ]
[   44.558273] 3.0.0-1.fc16.i686.PAE #1
[   44.558273] ---------------------------------------------
[   44.558273] systemd-logind/758 is trying to acquire lock:
[   44.558273]  (&ep->mtx){+.+.+.}, at: [<c05333d5>] ep_scan_ready_list+0x32/0x154
[   44.558273] 
[   44.558273] but task is already holding lock:
[   44.558273]  (&ep->mtx){+.+.+.}, at: [<c0533874>] sys_epoll_ctl+0x103/0x481
[   44.558273] 
[   44.558273] other info that might help us debug this:
[   44.558273]  Possible unsafe locking scenario:
[   44.558273] 
[   44.558273]        CPU0
[   44.558273]        ----
[   44.558273]   lock(&ep->mtx);
[   44.558273]   lock(&ep->mtx);
[   44.558273] 
[   44.558273]  *** DEADLOCK ***
[   44.558273] 
[   44.558273]  May be due to missing lock nesting notation
[   44.558273] 
[   44.558273] 2 locks held by systemd-logind/758:
[   44.558273]  #0:  (epmutex){+.+.+.}, at: [<c053382f>] sys_epoll_ctl+0xbe/0x481
[   44.558273]  #1:  (&ep->mtx){+.+.+.}, at: [<c0533874>] sys_epoll_ctl+0x103/0x481
[   44.558273] 
[   44.558273] stack backtrace:
[   44.558273] Pid: 758, comm: systemd-logind Not tainted 3.0.0-1.fc16.i686.PAE #1
[   44.558273] Call Trace:
[   44.558273]  [<c084611e>] ? printk+0x2d/0x2f
[   44.558273]  [<c0473187>] __lock_acquire+0x805/0xb57
[   44.558273]  [<c040e02b>] ? sched_clock+0x8/0xb
[   44.558273]  [<c0465034>] ? sched_clock_local+0x10/0x18b
[   44.558273]  [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[   44.558273]  [<c04738ed>] lock_acquire+0xad/0xe4
[   44.558273]  [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[   44.558273]  [<c084d855>] __mutex_lock_common+0x49/0x2ee
[   44.558273]  [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[   44.558273]  [<c043a7a8>] ? __might_sleep+0x29/0xfb
[   44.558273]  [<c04727b6>] ? mark_lock+0x26/0x1f2
[   44.558273]  [<c084dc14>] mutex_lock_nested+0x43/0x49
[   44.558273]  [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[   44.558273]  [<c05333d5>] ep_scan_ready_list+0x32/0x154
[   44.558273]  [<c0532faf>] ? ep_remove+0x9b/0x9b
[   44.558273]  [<c053350b>] ep_poll_readyevents_proc+0x14/0x16
[   44.558273]  [<c05331ba>] ep_call_nested.constprop.2+0x6d/0x9a
[   44.558273]  [<c05334f7>] ? ep_scan_ready_list+0x154/0x154
[   44.558273]  [<c05332b6>] ep_eventpoll_poll+0x45/0x55
[   44.558273]  [<c0533970>] sys_epoll_ctl+0x1ff/0x481
[   44.558273]  [<c05330df>] ? ep_send_events_proc+0xd5/0xd5
[   44.558273]  [<c084f344>] syscall_call+0x7/0xb

Comment 8 Felix Miata 2012-05-29 15:51:58 UTC
I've been using mostly 2048 blocksize for a while to try to minimize this without guaranteeing complete avoidance, and have been successful in avoidance. Last time it occurred was on Mandriva, and I may have found a workaround. After the encounter via 'urpmi --auto-select' (akin to yum upgrade), I did 'swapoff -a', then repeated 'urpmi --auto-select', which completed without errors. My next Fedora installs as time permits will be transferred to 1024 blocksize partitions to follow up here.

Comment 9 Fedora End Of Life 2013-04-03 17:06:04 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 10 Felix Miata 2014-11-14 08:52:58 UTC
I've not experienced this in over two years, in any of F19, F20, F21, Rawhide or Mageia. -> worksforme