Bug 468437
Summary: | rpmdb environment corruption on ext4 | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Thomas J. Baker <tjb> |
Component: | kernel | Assignee: | Eric Sandeen <esandeen> |
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 10 | CC: | aquarichy, archit.shah, bennet, bugzilla.redhat.com, dew, dyoung, ffesti, guitarheadritz, jfrieben, jnovy, kernel-maint, mark, mcepl, mcepl, me, mishu, n3npq, placeholder, pmatilai, rhughes |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-12-18 06:38:42 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 438944 | ||
Attachments: |
The 2nd error is dependent on the first. Any abnormal exit (in this case the yum segfault is an abnormal exit) using an rpmdb actively will leave a stale lock. The 2nd message is just informing you of the stale lock. Please enable coredumps on the systems where you're seeing this, since it happens so irregularly it's next to impossible to catch otherwise. When we have a coredump and a backtrace there's a chance of looking into it. One possibility is PackageKit is going wild with threads - I know it is threaded to some extent but whether it does so in a way that would affect rpm I dunno. Created attachment 322478 [details]
bzip'd core file
Created attachment 322481 [details]
bzip'd core file from rpm
I just added an rpm core file from the i686 machine (smaller second one) and a get-updates core file from the x84_64 machine. Thanks. It's dying while doing what rpm does all the time: walking over the rpmdb. While it's not impossible that there's some obscure bug in that code, my suspicion is that PackageKit is accessing rpmdb from different threads which is not going to work (rpm isn't thread-safe). Adding Richard to the CC list. PackageKit spawns a new process to talk to yum using stdin and stdout as IPC. It does not use threads at all when using a spawned backend like yum. Created attachment 322856 [details] record of the gnome-terminal session (In reply to comment #8) > PackageKit spawns a new process to talk to yum using stdin and stdout as IPC. > It does not use threads at all when using a spawned backend like yum. Moreover, I got this beauty just with debuginfo-install. Created attachment 323311 [details]
backtrace of yum crashing with yum upgrade
So, I've got this beauty just with yum upgrade.
rpm-4.6.0-0.rc1.7.x86_64
compat-db45-4.5.20-5.fc10.x86_64
yum-3.2.20-3.fc10.noarch
compat-db46-4.6.21-5.fc10.x86_64
Created attachment 323329 [details]
output of package-cleanup --problems
not that much succesful either
The package-cleanup thing looks unrelated to me. What filesystem is used for /var/lib/rpm on the systems that are seeing these crashes, and what does 'stat -f /var/lib/rpm' say? On both my systems, I'm using ext4 root filesystems. The stat on i386: [root@katratzi tjb]# stat -f /var/lib/rpm File: "/var/lib/rpm" ID: 8836a5c2ae603191 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 2580302 Free: 1521700 Available: 1390628 Inodes: Total: 655360 Free: 504638 [root@katratzi tjb]# On x86_64: [root@continuity tjb]# stat -f /var/lib/rpm File: "/var/lib/rpm" ID: a06270a9a5aa7028 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 2580302 Free: 661026 Available: 529954 Inodes: Total: 655360 Free: 421439 [root@continuity tjb]# [matej@viklef redhat]$ stat -f /var/lib/rpm/ File: "/var/lib/rpm/" ID: 2d11fc8cad475559 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 7435466 Free: 697310 Available: 313515 Inodes: Total: 7677920 Free: 7321099 Yes, ext4 as well. *** Bug 471411 has been marked as a duplicate of this bug. *** Seems we have a fairly clear pattern to this: everybody having these issues is using ext4. In addition to what's reported here, the thread here https://www.redhat.com/archives/fedora-test-list/2008-November/msg00529.html shows a couple of more users having similar issues, all with ext4. Rpm (or actually Berkeley DB) is rather allergic to subtle filesystem etc bugs which almost nothing else seems to hit, the symptoms here aren't entirely unlike the mmap bug around 2.6.18-2.6.19, and similar issues happen even on ext3 with blocksize of 1024. One of the users on the test-list thread above mentions the environment corruption starting around kernel-2.6.27.4-26.fc9.x86_64, maybe that provides some hints where to start looking. And over to kernel folks... but I'll try to see if I can come up with some sort of reproducer for this. Anyone who can hit this regularly, if you'd like to step back through older kernels looking for a regression, that could be a big help. Thanks, -Eric Actually, first, can folks experiencing this first try the very latest F10 kernels (2.6.27.5-90 or later) and see if the problem persists, it contains this fix: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=ed9b3e3379731e9f9d2f73f3d7fd9e7d2ce3df4a ext4: Mark the buffer_heads as dirty and uptodate after prepare_write We need to make sure we mark the buffer_heads as dirty and uptodate so that block_write_full_page write them correctly. This fixes mmap corruptions that can occur in low memory situations. Thanks, -Eric My last core dump is from 11/10. I installed the -94 kernel on 11/11 (and others since) and haven't had a core dump since then. Before that I was getting them nearly once a day. Probably too early to call it fixed but it's looking good so far. Unfortunately, I just got another instance of breakage. I didn't find a core file though. Thomas, and this was on the newer kernel, I guess.... hm. I'll try to reproduce. I wonder if this can be lingering effects from a corrupted db under the older kernels... or if we still have the bug. FWIW I don't get this problem anymore with the new kernels. FWIW I don't get this problem anymore with the new kernels. # rpm -q kernel kernel-2.6.27.5-104.fc10.i686 kernel-2.6.27.5-109.fc10.i686 kernel-2.6.27.5-113.fc10.i686 This is an HP Compaq nx8220 laptop. FWIW, I just had another instance on a different system. It is possible that the effects are lingering from previous non-patched kernels but I have been running >-90 kernels for over a week. Yeah, happened to me yesterday again -- with -110 or -113 (not sure) Thomas, Matej - has either of you rebuilt the db since updating to the newer kernel? I guess the best course of action (for someone... :) is to do a fresh install under a new kernel, and see if it persists. Thanks, -Eric It happened to me twice on two different systems. I believe in both cases, it was the first time it happened since running >=90 kernels so the db hadn't been rebuilt post fixed kernel. It is possible that there were problems already present from before. My dedicated rawhide system is game for a reinstall. Hopefully, rawhide is installable today. (In reply to comment #26) > Thomas, Matej - has either of you rebuilt the db since updating to the newer > kernel? I guess the best course of action (for someone... :) is to do a fresh > install under a new kernel, and see if it persists. I believe I did. Rebuilding again (rpm --rebuilddb is enough?) and let's see what happens (or what not happens). I just got another corruption on the same system I had it on on Saturday. That means a rebuilt db under a >90 kernel got corrupted again. I hope to get to a reinstall on my other system on Friday. Saw it again on kernel 2.6.27.5-113.fc10.x86_64 as well. It seems to be much better than it was but still around. And another corruption on a laptop that the rpmdb was rebuilt on on Saturday and again yesterday. Either the rebuild is not fixing things or something is going on with the kernel. The problem seems worse in the last week after going away for about a week before that. Happened to me just now -- 2.6.27.5-120.fc10.i686 If anyone is game for more testing, could you try tune2fs -O ^uninit_bg /dev/blah run e2fsck... and see if it persists? (This should be safe to do, although I'm not sure it's the most tested fsck codepath...) There are also some more upstream patches we can try in rawhide. -Eric (wondering why he's the only one who can't hit this) :/ Issue observed after a fresh install formatting w/ext4 using boot.iso image from 2008-11-21 and updating to "rawhide" plus updates including kernel-2.6.27.5-120.fc10.x86_64. A suggestion & a question for those who can hit this; Can you boot a rescue disk & run e2fsck on the root fs to see if anything has gone badly, after you see this corruption? You could also test with "nodelalloc" as a mount option to see if that's a differentiator. Thanks, -Eric Created attachment 324579 [details]
stdout/stderr of fsck.ext4dev -v -f -c <device with root>
Tried first just to reboot to runlevel 1 and mount -o ro,remount <dev with root> and run fsck -- got two incosistent inode counts, but didn't make a logs (which might be caused because mount -o ro,remount was not enough). Then rebooted from Fedora10Beta DVD (unfortunately, I don't have any newer DVD at hand). After playing with tune2fs (and setting test_fs) I managed to run fsck.ext4dev from the hard drive on root and got attached results (everything is OK). Just to be sure output of tune2fs -l:
tune2fs 1.41.0 (10-Jul-2008)
Filesystem volume name: root
Last mounted on: <not available>
Filesystem UUID: 1c6e1f3d-7d96-4e98-9092-0e1024c30935
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent sparse_super large_file
Filesystem flags: signed_directory_hash test_filesystem
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 7677920
Block count: 7675904
Reserved block count: 383795
Free blocks: 1249252
Free inodes: 7321336
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1022
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 32672
Inode blocks per group: 1021
Filesystem created: Thu Aug 23 20:58:07 2007
Last mount time: Tue Nov 25 08:18:00 2008
Last write time: Tue Nov 25 08:35:54 2008
Mount count: 572
Maximum mount count: -1
Last checked: Thu Aug 23 20:58:07 2007
Check interval: 0 (<none>)
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: 56971d2b-ed00-4520-afdd-2adc8df45cf8
Journal backup: inode blocks
Created attachment 324580 [details]
stdout of dmesg
This bug appears to have been reported against 'rawhide' during the Fedora 10 development cycle. Changing version to '10'. More information and reason for this action is here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Just to be sure, is there anyone hitting this who does *not* have the rpm database on ext4? Thanks, -Eric The other thing that might be interesting is whether the problem/corruption persists across a reboot, to see if this is maybe only a problem in memory and not on disk... I ran into a similar problem on ext3. I installed from the release DVD, followed the updates, and ran into rpm database corruption on 12/8. /var/log/PackageKit contains: TI:19:27:43 TH:0x176c4f0 FI:pk-spawn.c FN:pk_spawn_argv,472 trying to set timeout when already set I am also willing to concede that my problem is unrelated. This is a T60 laptop that had problems with suspend/resume for the first few days, so I had a few hard reboots that could have been the source of the problem. Claiming rpmdb "corruption" without supplying at least a hint of what failure symptom was seen is not going to help identify the flaw, or fix a (claimed, we'll see) ext4 file system problem. databases are already hard debugging, but diagnosing a file system flaw with a database on top is even harder. Can you at least identify whether a PANIC message was seen with "corruption" reports, and (for extra credit) whether rpmdb_verify saw problems? What would be most interesting is what files in /var/lib/rpm are reported to have problems. Eric asks:
> Just to be sure, is there anyone hitting this who does *not* have the rpm
> database on ext4?
Yes i'm having exactly the same error.
Bennet, and you're on ext3 then I guess? Yes, I'm on ext3; sorry for not mentioning ;-) If I can get you any debug information, let me know, I'd be gladly helping. I thought ext3 but wanted to be explicit about it. :) Just for the record can you attach any relevant error messages from either rpm invocations, dmesg, /var/log/messages, whatever? Yes, off course, but you'll have to wait for tomorrow. I'm yet not at home. Hmm having looked at my /var/log/* directories and grep'd for "rpmdb" there are no entries. The error, however, I got during a 'yum update' was "rpmdb: Thread/Process ... failed: Thread died in Berkely DB library". It could be fixed via 'rm /var/lib/rpm/__db*' and 'rpmdb --rebuilddb' However there might be an interesting entry in /var/log/PackageKit #cat /var/log/PackageKit TI:02:41:51 TH:0x11c9510 FI:pk-spawn.c FN:pk_spawn_argv,488 trying to set timeout when already set Here are the informations about /var # stat -f /var File: "/var" ID: 7e0a6f3a23304d65 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 28310700 Free: 10602935 Available: 9164830 Inodes: Total: 7192576 Free: 6858268 Eric, if you could give me some hand on which further informations would be a matter of interest and how to obtain them, this would be great. (I'm really not that familiar with grep'ing around in /var/log-Directories and I'm not experienced enough to know what things to look at.) If the only failure symptom was rpmdb: Thread/Process ... failed: Thread died in Berkely DB library then that does not necessarily correlate with the (alleged) ext4 fs corruption being tracked here. The message will be displayed on next rpmdb dbenv open for __ALL__ exceptional exits from rpmlib where premature termination is undertaken with a locked db cursor; in short, a "dirty exit". OTOH, the (alleged) ext4 corruption will usually cause multiple PANIC: ... DB_RUNRECOVERY messages (In reply to comment #39) > Just to be sure, is there anyone hitting this who does *not* have the rpm > database on ext4? Hmm... just saw this on kernel-2.6.29-0.43.rc2.git1.fc11.x86_64, on ext3. No ominous messages in the logs, but it did start after I tried updating through PackageKit. 'rm /var/lib/rpm/__db*' and 'rpmdb --rebuilddb' fixed things for me. btw, uname -a: Linux bluesun 2.6.27.12-170.2.5.fc10.i686.PAE #1 SMP Wed Jan 21 01:54:56 EST 2009 i686 i686 i386 GNU/Linux and using ext3. I haven't seen this issue for months. Is it still relevant? i had to do this recently to a couple of fedora 10 boxes. but once done, it's no longer a problem. (In reply to comment #52) > I haven't seen this issue for months. Is it still relevant? Me neither. (In reply to comment #54) > (In reply to comment #52) > > I haven't seen this issue for months. Is it still relevant? > > Me neither. Me neither. Fresh install of Fedora 11 Beta replicated the other's descriptions of this bug exactly on the first post-install yum update. This was after the first full successful boot. Yum indicated that the update completed successfully and then subsequently, all rpm-related tools fail with Berkley DB related messages. --rebuildb recovery is successful, however. No other hardware or software issues since first boot. No crashes of anything else of any kind. (In reply to comment #56) > Fresh install of Fedora 11 Beta replicated the other's descriptions of this bug > exactly on the first post-install yum update. Sigh. And I thought maybe we were out of the woods on this one. :) Anything interesting about your hardware? Maybe memory & cpu count for starters? System is a Intel Core 2 Duo P8600 on Centrino 2 (Montevina chipset) with 2GB of DDR3 1333. I sent in the installation report when asked in the installer so perhaps you have that information somewhere... Jason, thanks. the hw profile stuff is anonymous, so unless you share the UUID for the profile we won't know. I guess the main thing I wondered was if it's dual core, and it is... hm. Do note that the error can come up in number of ways, just crashing with (write) locks held is sufficient to trigger this, and I can fairly reliably reproduce it by putting enough concurrent pressure on rpmdb even on ext3. I do think there *was* something in ext4 back when this was reported that made rpm rather unhappy on ext4 but whether that's still the case ... dunno. Panu, should this be an RPM bug, then? :) Created attachment 340596 [details]
error messages when yum tries to update
I'm not sure if this is the same bug, or a different bug.
I have a mildly old Dell GX270 that just had this happen for the second time in as many months. Its currently running F10, 512MB (minus video mem), 2 core P4 2.60GHz, avail disk space just over 30GB, Kernel 2.6.27.21-170.2.56.fc10.i686. Not sure what the file system is (nor how to check) but it would be whatever I would get by default. @#63 you can check your filesystem type by simple typing "mount" in a terminal, there should be an entry like (e.g.) "/dev/mapper/VolGroup00 on / type ext3 (rw)". (May depend on how your filesystem is build up and screwed together ;-).) Haven't got this bug for months after the last time. The above steps "'rm /var/lib/rpm/__db*' && 'rpmdb --rebuilddb'" have fixed the issue, so this might have been a one-time corruption. (Might it be reasonable to automatically do these commands if a "failed: Thread died in Berkely DB library"-error occurs; or at least point the user to these commands? Can these commands be harmful in any circumstances?) I dont know much this is the error i get when i try to update.It started 24hrs back. Says PackageManager error ------------- Error Type: <type 'exceptions.TypeError'> Error Value: rpmdb open failed File : /usr/share/PackageKit/helpers/yum/yumBackend.py, line 2316, in <module> main() File : /usr/share/PackageKit/helpers/yum/yumBackend.py, line 2312, in main backend = PackageKitYumBackend('', lock=True) File : /usr/share/PackageKit/helpers/yum/yumBackend.py, line 182, in __init__ self.yumbase = PackageKitYumBase(self) File : /usr/share/PackageKit/helpers/yum/yumBackend.py, line 2255, in __init__ self.repos.confirm_func = self._repo_gpg_confirm File : /usr/lib/python2.5/site-packages/yum/__init__.py, line 589, in <lambda> repos = property(fget=lambda self: self._getRepos(), File : /usr/lib/python2.5/site-packages/yum/__init__.py, line 395, in _getRepos self._getConfig() # touch the config class first File : /usr/lib/python2.5/site-packages/yum/__init__.py, line 192, in _getConfig self._conf = config.readMainConfig(startupconf) File : /usr/lib/python2.5/site-packages/yum/config.py, line 774, in readMainConfig yumvars['releasever'] = _getsysver(startupconf.installroot, startupconf.distroverpkg) File : /usr/lib/python2.5/site-packages/yum/config.py, line 844, in _getsysver idx = ts.dbMatch('provides', distroverpkg) This message is a reminder that Fedora 10 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 10. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '10'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 10's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 10 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 10 changed to end-of-life (EOL) status on 2009-12-17. Fedora 10 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |
This happens on the two systems I'm running rawhide on, one i386, one x86_64. I notice the CPU is pinned and it's the yumBackend.py process stuck. I look in the logs and see something like this: Oct 24 14:05:08 katratzi kernel: yumBackend.py[13268]: segfault at 6babacd4 ip 007fe3a1 sp bffb17f0 error 4 in libdb-4.5.so[70f000+12e000] or this on x86_64: Oct 19 17:13:26 localhost kernel: yumBackend.py[8076] general protection ip:3e894e8815 sp:7ffff4d3dd40 error:0 in libdb-4.5.so[3e89400000+12d000] I kill the yumBackend.py process, run yum clean and get another error: > yum clean Loaded plugins: refresh-packagekit rpmdb: Thread/process 11534/3086698176 failed: Thread died in Berkeley DB library Where it hangs without exitting. I then have to kill that process. To cleanup, I rm -f /var/lib/rpm/__* and rpm --rebuilddb. It doesn't happen that often, logs indicate October 2, 3, 5, 19 on one machine and twice in the last month on the other.