| Summary: | rpm/yum cause mutex & rpmdb error messages | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Felix Miata <mrmazda> |
| Component: | rpm | Assignee: | Packaging Maintenance Team <packaging-team-maint> |
| Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 19 | CC: | collura, ffesti, jones.peter.busi, pknirsch, pmatilai |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i386 | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-11-14 08:52:58 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Felix Miata
2011-02-25 18:29:19 UTC
If your filesystem blocksize is 1KB as your report seems to indicate, that's almost certainly the problem. Its a long-standing bug that something about 1KB block size upsets the rpmdb pretty badly - whether its rpm code, something in Berkeley DB or the kernel or the combination of the three nobody has ever really figured out AFAIK. So if 'stat -f /var/lib/rpm' says block size 1024, reformat with a 4KB block size or put /var on a separate file system with 4KB block size and it should fix the thing. Efficient small partition blocksize is only a problem on Fedora & Mandriva AFAIK. I know it's longstanding in Mandriva <https://qa.mandriva.com/show_bug.cgi?id=32547>, but AFAICR it's never before happened to me with Fedora. This has never happened to me using Knoppix, openSUSE or *buntu. Virtually all my / partitions are 4000M, 4400M or 4800M, where larger than minimal blocksize is too inefficient for compatibility with package managers that download all to-be-installed packages before installing any. A separate /var is too complex for system HDs with upwards of 20 partitions each already. I've seen it on Fedora numerous times, it not a new thing by any means. Knoppix and *buntu use dpkg, which doesn't use Berkeley DB as the backend and Suse patches their rpm to use a different BDB access style, which probably avoids the issue (but has other issues). Obviously it IS a bug that should be fixed - wherever it is - the db shouldn't care what the filesystem block size is. For now however, the known workaround is to use a bigger block size. This happened today on host gx260, an Intel P4 box, trying to yum install system-config-printer from Konsole. I exited X, 'rpm --rebuilddb', rebooted to runlevel 3 to purge ram, and yum install system-config-printer succeeded. I installed F15 "fresh" (deleting files & dirs only, not reformatting) on the comment 0 system today. So far, packages have only been installed, none removed or upgraded, and this bug has yet to appear. I installed F15 to an EXT3 partition with 2048 block size in June on host gx27b. Last night I did updates, and got far fewer DB error messages than I remembered having had with 1024 block size installations, and so far no apparent outright yum/rpm failures have occurred. I then copied the whole filesystem content to a freshly formatted EXT3 partition with 4096 block size, and to one with 1024 block size. DF reports:
1024 part: 4876508 2991580 1639136 65% /disks/fedora
/dev/sda16 4.7G 2.9G 1.6G 65% /disks/fedora
4096 part: 4877032 3290728 1340512 72% /mnt
/dev/sda17 4.7G 3.2G 1.3G 72% /mnt
2048 part: 4875929 3006298 1623839 65% /mnt2
/dev/sda18 4.7G 2.9G 1.6G 65% /mnt2
Clearly the extra overhang from using 4096 (4096 uses 109.46% of what 2048 uses) is excessive when small / partition sizes are required.
Next I enabled the 4096 partition to be booted. Once that worked, I installed fedora-release-16. Then I yum removed some packages I'd never have use for, then yum installed unnecessarily removed deps, like kdebase-workspace, kdebase-session & firefox. All that went fine. Only after that I yum ungraded, and that produced a _lot_ of rpmdb errors. Afterward, simply doing rpm -q produced a huge error stream. Even rpm --rebuilddb produced one such error, but at least it stopped rpm queries from producing errors. Realizing not rebuild db first was a mistake, I redid by purging the 4096 filesystem, recopied from the 2048 filesystem, and rpm -rebuilddb before installing fedora-release-16. Next I updated only yum* and *rpm*, then removed, then reinstalled the unnecessarily removeds, then yum upgraded, all of which produced no apparent rpm db errors.
At conclusion of exercise, disk space was:
F15 on 2048 block size (992 from rpm -qa | wc -l):
/dev/sda16 4876508 3032606 1598110 66% /disks/fedora
/dev/sda16 4.7G 2.9G 1.6G 66% /disks/fedora
F16 on 4096 block size (1023 from rpm -qa | wc -l):
/dev/sda17 4877032 3290096 1341144 72% /mnt
/dev/sda17 4.7G 3.2G 1.3G 72% /mnt
Direct comparison of F16 content on 4096 copied to 1024 with neither mounted as /:
F16 on 4096 blocksize
/dev/sda17 4877032 3310128 1321112 72% /mnt
/dev/sda17 4.7G 3.2G 1.3G 72% /mnt
F16 on 1024 blocksize
/dev/sda18 4875929 3001496 1628641 65% /mnt2
/dev/sda18 4.7G 2.9G 1.6G 65% /mnt2
Disk /dev/sda: 120.0 GB, 120000000000 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
Device Boot Start End Blocks Id System
/dev/sda17 99474543 109306259 4915858+ 83 Linux
/dev/sda18 109306323 119138039 4915858+ 83 Linux
I originally did this last comparison to see if F16's newer rpm behaves better, but I can't tell any difference. rpm --rebuilddb generates a huge list of unable to allocate memory for mutex errors, which may be responsible for this dmesg tail:
[ 43.337795] avahi-daemon[736]: avahi-daemon 0.6.30 starting up.
[ 43.566860] systemd[1]: Cannot add dependency job for unit hwclock-load.service, ignoring: Unit hwclock-load.service failed to load: No such file or directory. See system logs and 'systemctl status hwclock-load.service' for details.
[ 44.160256] multipathd (782): /proc/782/oom_adj is deprecated, please use /proc/782/oom_score_adj instead.
[ 44.176723] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
[ 44.354510] ip6_tables: (C) 2000-2006 Netfilter Core Team
[ 44.549811]
[ 44.549814] =============================================
[ 44.558273] [ INFO: possible recursive locking detected ]
[ 44.558273] 3.0.0-1.fc16.i686.PAE #1
[ 44.558273] ---------------------------------------------
[ 44.558273] systemd-logind/758 is trying to acquire lock:
[ 44.558273] (&ep->mtx){+.+.+.}, at: [<c05333d5>] ep_scan_ready_list+0x32/0x154
[ 44.558273]
[ 44.558273] but task is already holding lock:
[ 44.558273] (&ep->mtx){+.+.+.}, at: [<c0533874>] sys_epoll_ctl+0x103/0x481
[ 44.558273]
[ 44.558273] other info that might help us debug this:
[ 44.558273] Possible unsafe locking scenario:
[ 44.558273]
[ 44.558273] CPU0
[ 44.558273] ----
[ 44.558273] lock(&ep->mtx);
[ 44.558273] lock(&ep->mtx);
[ 44.558273]
[ 44.558273] *** DEADLOCK ***
[ 44.558273]
[ 44.558273] May be due to missing lock nesting notation
[ 44.558273]
[ 44.558273] 2 locks held by systemd-logind/758:
[ 44.558273] #0: (epmutex){+.+.+.}, at: [<c053382f>] sys_epoll_ctl+0xbe/0x481
[ 44.558273] #1: (&ep->mtx){+.+.+.}, at: [<c0533874>] sys_epoll_ctl+0x103/0x481
[ 44.558273]
[ 44.558273] stack backtrace:
[ 44.558273] Pid: 758, comm: systemd-logind Not tainted 3.0.0-1.fc16.i686.PAE #1
[ 44.558273] Call Trace:
[ 44.558273] [<c084611e>] ? printk+0x2d/0x2f
[ 44.558273] [<c0473187>] __lock_acquire+0x805/0xb57
[ 44.558273] [<c040e02b>] ? sched_clock+0x8/0xb
[ 44.558273] [<c0465034>] ? sched_clock_local+0x10/0x18b
[ 44.558273] [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[ 44.558273] [<c04738ed>] lock_acquire+0xad/0xe4
[ 44.558273] [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[ 44.558273] [<c084d855>] __mutex_lock_common+0x49/0x2ee
[ 44.558273] [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[ 44.558273] [<c043a7a8>] ? __might_sleep+0x29/0xfb
[ 44.558273] [<c04727b6>] ? mark_lock+0x26/0x1f2
[ 44.558273] [<c084dc14>] mutex_lock_nested+0x43/0x49
[ 44.558273] [<c05333d5>] ? ep_scan_ready_list+0x32/0x154
[ 44.558273] [<c05333d5>] ep_scan_ready_list+0x32/0x154
[ 44.558273] [<c0532faf>] ? ep_remove+0x9b/0x9b
[ 44.558273] [<c053350b>] ep_poll_readyevents_proc+0x14/0x16
[ 44.558273] [<c05331ba>] ep_call_nested.constprop.2+0x6d/0x9a
[ 44.558273] [<c05334f7>] ? ep_scan_ready_list+0x154/0x154
[ 44.558273] [<c05332b6>] ep_eventpoll_poll+0x45/0x55
[ 44.558273] [<c0533970>] sys_epoll_ctl+0x1ff/0x481
[ 44.558273] [<c05330df>] ? ep_send_events_proc+0xd5/0xd5
[ 44.558273] [<c084f344>] syscall_call+0x7/0xb
I've been using mostly 2048 blocksize for a while to try to minimize this without guaranteeing complete avoidance, and have been successful in avoidance. Last time it occurred was on Mandriva, and I may have found a workaround. After the encounter via 'urpmi --auto-select' (akin to yum upgrade), I did 'swapoff -a', then repeated 'urpmi --auto-select', which completed without errors. My next Fedora installs as time permits will be transferred to 1024 blocksize partitions to follow up here. This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle. Changing version to '19'. (As we did not run this process for some time, it could affect also pre-Fedora 19 development cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.) More information and reason for this action is here: https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19 I've not experienced this in over two years, in any of F19, F20, F21, Rawhide or Mageia. -> worksforme |