Bug 479818
Summary: | RPM/Yum commands report "Thread died in Berkeley DB library" | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Gavin Brown <gavin.brown> | ||||
Component: | rpm | Assignee: | Panu Matilainen <pmatilai> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 10 | CC: | brad.longo, brian.vowell, comcast.really.sucks, diego.ml, dtimms, fc-bugzilla, ffesti, james, jnovy, julian.fedora, linuxerianer, louizatakk, meejah, mefoster, pmatilai | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2009-06-18 17:23:34 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Gavin Brown
2009-01-13 11:29:06 UTC
What filesystem is /var on? /var doesn't have it's own partition, it's on the root partition which is formatted as ext3: $ mount /dev/sda3 on / type ext3 (rw) /proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) /dev/sda1 on /export type ext3 (rw) /dev/sda2 on /home type ext3 (rw) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) /dev/dm-0 on /media/disk type ext3 (rw,nosuid,nodev,uhelper=hal) gvfs-fuse-daemon on /home/gavin/.gvfs type fuse.gvfs-fuse-daemon (rw,nosuid,nodev,user=gavin) Ok, this is the first report of this error where the filesystem is not ext4. Can you also add 'stat -f /' output just in case? The error basically means that a former run crashed while inside the Berkeley DB and automatic cleanup is not able to sort it out. The quick "fix" is to do 1) ensure no rpm-related processes are running 2) "rm -f /var/lib/rpm/__*" Where and how the first crash occurred (that caused this situation) is another question though... Rpm's handling of the situation is buggy anyway, it should abort instead of hanging. Here's the output of stat -f /: $ stat -f / File: "/" ID: b2bc73600ffdff7c Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 6079813 Free: 4905699 Available: 4844002 Inodes: Total: 1548288 Free: 1358581 Removing the __ files resolved the issue. I seem to recall having to do that a lot back on Red Hat 8, should have remembered to try it this time! My memory's not what it used to be :p. I just had the same problem on Fedora 10. The error appeared when trying to install the webmin 1.441-1 from sourceforge. My filesystem is XFS. Kernel version 2.6.27.9-159.fc10.x86_64. Yum version 3.2.20-5.fc10. Same error here (not the first time I encounter it) on a ext3 filesystem. "yum -update", "yum -clean all" gives the "thread died in DB" error. /var/ is on the / partition (ext3). I followed P. Matilainen's fix and now everything works fine. Fedora 10 x86_64, kernel-2.6.28.2.x86_64, yum-3.2.20-5.fc10.noarch, rpm-4.6.0-0.rc3.1.fc10.x86_64 Seeing this today on a purely ext3 system. Fedora 10 i386, kernel-2.6.27.12-170.2.5.fc10.i686, yum-3.2.21-2.fc10.noarch, rpm-4.6.0-0.rc3.1.fc10.i386 I'm doing a rebuilddb and it's taking quite a long time (~8 minutes so far) -- I'm going to have to leave work soon and kill it if it doesn't complete, but I can try again tomorrow. Things were happy this morning and I did an update through kpackagekit this morning that didn't contain anything suspicious-looking (I use updates-testing). Regarding my rebuilddb: it completed after 10 minutes or so and everything looks good again. Should it take that long? Created attachment 331053 [details]
some environment values on my machine
[just a me too]: Fedora 10 i386
Anyway: is there any information that could be provided that would assist in learning more about this issue (before manually nuking __db files) ? Perhaps a complete copy of the on-disk rpmdb ?
ps: Seems others have been having similar trouble, and using the manual fix workaround eg: http://forums.fedoraforum.org/showthread.php?t=209092 Mine is on ext3 over lvm on / filesys /dev/mapper/vg1-lvslash on / type ext3 (rw) /dev/mapper/vg1-lvhome on /home type ext3 (rw) Another 'me too', please :-) I've tried to strace 'yum check-update' and it seems like it tries to open rpm database and then send signal to a process that doesn't exist. The process number didn't change during my tests, so it is probably stored somewhere in those __db files getgid32() = 0 getuid32() = 0 stat64("/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat64("/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat64("/var/lib/rpm", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 access("/var/lib/rpm", W_OK) = 0 access("/var/lib/rpm/__db.001", F_OK) = 0 access("/var/lib/rpm/Packages", F_OK) = 0 open("/var/lib/rpm/DB_CONFIG", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 4 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 fstat64(4, {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0 close(4) = 0 open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 4 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 mmap2(NULL, 24576, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7d4e000 close(4) = 0 open("/proc/stat", O_RDONLY) = 4 fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d4d000 read(4, "cpu 14782 951 5676 268213 28105 "..., 1024) = 694 read(4, ""..., 1024) = 0 close(4) = 0 munmap(0xb7d4d000, 4096) = 0 open("/var/lib/rpm/__db.002", O_RDWR|O_LARGEFILE) = 4 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 mmap2(NULL, 180224, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7d22000 close(4) = 0 open("/var/lib/rpm/__db.003", O_RDWR|O_LARGEFILE) = 4 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 mmap2(NULL, 1318912, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7be0000 close(4) = 0 open("/var/lib/rpm/__db.004", O_RDWR|O_LARGEFILE) = 4 fcntl64(4, F_SETFD, FD_CLOEXEC) = 0 mmap2(NULL, 352256, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7b8a000 close(4) = 0 kill(4550, SIG_0) = -1 ESRCH (No such process) write(2, "rpmdb: "..., 7rpmdb: ) = 7 write(2, "Thread/process 4550/3087615680 fa"..., 73Thread/process 4550/3087615680 failed: Thread died in Berkele y DB library) = 73 write(2, "\n"..., 1 ) = 1 The run that reports "thread died..." is not actually interesting in itself at all, this is just BDB safeguard saying essentially "a previous (rpm-related) process crashed in a way that cannot be automatically cleaned up". But ok, clearly this is not limited to ext4 despite early reports indicating that. Somebody reported that they were seeing this daily on some older kernels but only occasionally now, so maybe there was something in ext4 causing rpm crashes (and thus triggering this) far easier than otherwise. Grep for "segfault" in /var/log/messages*, lets see if there's some sort of pattern there. Any non-default yum plugins activated on these systems (unlikely cause perhaps but just collecting information...)? Also do enable core dumps on systems seeing this issue to catch the actual crash instead of after the fact "something went bust on previous run" message that people are seeing here. Backtraces (http://fedoraproject.org/wiki/StackTraces) preferred over core-dumps but core-dumps are certainly more useful than just "it crashed". My yum run didn't crash. It just locked up after reading the whole rpm database. <tons of pread64's> pread64(13, "\0\0\0\0\1\0\0\0\352\r\0\0\0\0\0\0\351\r\0\0\1\0\346\17\0\7\0\0\0K\0\0#"..., 4096, 14589952) = 4096 pread64(13, "\0\0\0\0\1\0\0\0\351\r\0\0\352\r\0\0\350\r\0\0\1\0\346\17\0\7\0root\0c"..., 4096, 14585856) = 4096 pread64(13, "\0\0\0\0\1\0\0\0\350\r\0\0\351\r\0\0\0\0\0\0\1\0\24\10\0\7e\0bsmi0"..., 4096, 14581760) = 4096 gettimeofday({1234170960, 962562}, NULL) = 0 gettimeofday({1234170960, 962716}, NULL) = 0 futex(0xb7d2984c, FUTEX_WAIT, 9, NULL <unfinished ...> <after some time I've manually killed it> +++ killed by SIGKILL +++ File descriptor 13 was '/var/lib/rpm/Packages' [root@aphrael rpm]# rpm -q yum yum-3.2.21-2.fc10.noarch [root@aphrael rpm]# rpm -q rpm rpm-4.6.0-0.rc3.1.fc10.i386 *** Bug 484995 has been marked as a duplicate of this bug. *** I get this bug too with Fedora 10 on an ext3 filesystem. And also not the first time. Seems to occur whenever I cancel a search in PackageKit before it completes. A rm -rf /var/lib/rpm/__db.* resolves the error for me. Have you also used PackageKit before this happened? I just get this error when cancelling searches (with the "Cancel" button), but maybe its related to something different. Thanks Julian, I don't much use PackageKit and this does start to explain, both why I never see this and why you get this error: [root@localhost ~]# pkcon search file /usr/bin/foo ^C [root@localhost ~]# rpm -qf /usr/bin/foo Freeing read locks for locker 0x867: 18182/139967018858224 Freeing read locks for locker 0x869: 18182/139967018858224 Freeing read locks for locker 0x86a: 18182/139967018858224 error: file /usr/bin/foo: No such file or directory Turns out PackageKit terminates the backend by SIGKILL when it doesn't respond to SIGQUIT, this pretty much by definition leaves stale locks behind. The stale locks are normally wiped out automatically as above but if that fails for whatever reason, you'll get "Thread died in Berkeley DB library". (In reply to comment #16) > I get this bug too with Fedora 10 on an ext3 filesystem. And also not the first > time. Seems to occur whenever I cancel a search in PackageKit before it > completes. > A > > rm -rf /var/lib/rpm/__db.* > > resolves the error for me. > > Have you also used PackageKit before this happened? I just get this error when > cancelling searches (with the "Cancel" button), but maybe its related to > something different. I also ended up in this problem after cancelling searches in the gpk-application. Suggestion also worked for me. (FC10, ext3) Doesn't occur anymore in Fedora 11. gpk lost it's cancel-button and pkcon can be killed without problems. Seems it's been fixed in F10 PackageKit too: * Wed May 13 2009 Richard Hughes <rhughes> - 0.3.15-3 - Apply a patch from upstream to disallow SIGKILL. - Fixes #487924 The error message can be caused by anything killing rpm (or librpm API user) or crashing inside it at an unfortunate time, PK seems to have been the most common cause of it so considering this closed due to the above PK update. I just recreated this problem on Fedora 15 x86_64 by hitting Ctrl-C "at the wrong time" during a yum search. # stat -f / File: "/" ID: c81eed21b79d360 Namelen: 255 Type: ext2/ext3 Block size: 4096 Fundamental block size: 4096 Blocks: Total: 2520122 Free: 370017 Available: 242000 Inodes: Total: 640848 Free: 378241 The fix was 'rm -f /var/lib/rpm/__*' as posted in Comment 3 above. I got into the situation described above by kill -9'ing a yum process that I'd suspended. It was doing a "yum search" at the time, so likely similar to Comment 21. Removing the __* files from Comment 3 worked fine for me. |