Bug 479818

Summary:

RPM/Yum commands report "Thread died in Berkeley DB library"

Product:

[Fedora] Fedora

Reporter:

Gavin Brown <gavin.brown>

Component:

rpm

Assignee:

Panu Matilainen <pmatilai>

Status:

CLOSED ERRATA

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

high

Docs Contact:

Priority:

low

Version:

CC:

brad.longo, brian.vowell, comcast.really.sucks, diego.ml, dtimms, fc-bugzilla, ffesti, james, jnovy, julian.fedora, linuxerianer, louizatakk, meejah, mefoster, pmatilai

Target Milestone:

---

Target Release:

---

Hardware:

i386

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-06-18 17:23:34 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
some environment values on my machine	none

Description Gavin Brown 2009-01-13 11:29:06 UTC

Description of problem:

Any attempt to install or erase an RPM package via rpm or yum results in an error message of the form:

rpmdb: Thread/process 12921/3087406864 failed: Thread died in Berkeley DB library

This also occurs if I try a command like rpm --rebuilddb.

Version-Release number of selected component (if applicable):

compat-db45-4.5.20-5.fc10.i386
rpm-4.6.0-0.rc3.1.fc10.i386
yum-3.2.20-5.fc10.noarch

How reproducible:

Every time.

Steps to Reproduce:

1. Try: rpm -ivh [package file]
2. Or: rpm -e [package name]
3. Or: yum install [package name or file]
4. Or: yum remove [package name]

Actual results:

An error of the above format. The process hangs until killed with a SIGKILL.

Expected results:

The command should successfully install or remove the package.

Additional info:

Query commands (ie rpm -qi rpm or rpm -qa) work fine.

I manually copied the contents of the rpm-debuginfo package onto my filesystem and ran rpm inside gdb - when the above error occurs, gdb itself hangs until killed.

Comment 1 Panu Matilainen 2009-01-13 16:00:30 UTC

What filesystem is /var on?

Comment 2 Gavin Brown 2009-01-13 16:39:14 UTC

/var doesn't have it's own partition, it's on the root partition which is formatted as ext3:

$ mount
/dev/sda3 on / type ext3 (rw)
/proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /export type ext3 (rw)
/dev/sda2 on /home type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/dm-0 on /media/disk type ext3 (rw,nosuid,nodev,uhelper=hal)
gvfs-fuse-daemon on /home/gavin/.gvfs type fuse.gvfs-fuse-daemon (rw,nosuid,nodev,user=gavin)

Comment 3 Panu Matilainen 2009-01-14 08:51:18 UTC

Ok, this is the first report of this error where the filesystem is not ext4. Can you also add 'stat -f /' output just in case?

The error basically means that a former run crashed while inside the Berkeley DB and automatic cleanup is not able to sort it out. The quick "fix" is to do
1) ensure no rpm-related processes are running
2) "rm -f /var/lib/rpm/__*"

Where and how the first crash occurred (that caused this situation) is another question though... Rpm's handling of the situation is buggy anyway, it should abort instead of hanging.

Comment 4 Gavin Brown 2009-01-14 10:27:37 UTC

Here's the output of stat -f /:

$ stat -f /
  File: "/"
    ID: b2bc73600ffdff7c Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 6079813    Free: 4905699    Available: 4844002
Inodes: Total: 1548288    Free: 1358581

Removing the __ files resolved the issue. I seem to recall having to do that a lot back on Red Hat 8, should have remembered to try it this time! My memory's not what it used to be :p.

Comment 5 Brian Vowell 2009-01-22 10:55:03 UTC

I just had the same problem on Fedora 10.  The error appeared when trying to install the webmin 1.441-1 from sourceforge.  My filesystem is XFS.  Kernel version 2.6.27.9-159.fc10.x86_64.  Yum version 3.2.20-5.fc10.

Comment 6 Florent Le Coz 2009-01-25 00:51:00 UTC

Same error here (not the first time I encounter it) on a ext3 filesystem.

Comment 7 Kaloyan Petrov 2009-01-25 21:38:34 UTC

"yum -update", "yum -clean all" gives the "thread died in DB" error. /var/ is on the / partition (ext3). I followed P. Matilainen's fix and now everything works fine.

Fedora 10 x86_64, kernel-2.6.28.2.x86_64, yum-3.2.20-5.fc10.noarch, rpm-4.6.0-0.rc3.1.fc10.x86_64

Comment 8 Mary Ellen Foster 2009-01-27 16:52:39 UTC

Seeing this today on a purely ext3 system.

Fedora 10 i386, kernel-2.6.27.12-170.2.5.fc10.i686, yum-3.2.21-2.fc10.noarch, rpm-4.6.0-0.rc3.1.fc10.i386

I'm doing a rebuilddb and it's taking quite a long time (~8 minutes so far) -- I'm going to have to leave work soon and kill it if it doesn't complete, but I can try again tomorrow.

Things were happy this morning and I did an update through kpackagekit this morning that didn't contain anything suspicious-looking (I use updates-testing).

Comment 9 Mary Ellen Foster 2009-01-27 16:53:59 UTC

Regarding my rebuilddb: it completed after 10 minutes or so and everything looks good again. Should it take that long?

Comment 10 David Timms 2009-02-05 21:05:27 UTC

Created attachment 331053 [details]
some environment values on my machine

[just a me too]: Fedora 10 i386

Anyway: is there any information that could be provided that would assist in learning more about this issue (before manually nuking __db files) ? Perhaps a complete copy of the on-disk rpmdb ?

Comment 11 David Timms 2009-02-05 21:21:38 UTC

ps: Seems others have been having similar trouble, and using the manual fix workaround eg:
http://forums.fedoraforum.org/showthread.php?t=209092

Mine is on ext3 over lvm on / filesys
/dev/mapper/vg1-lvslash on / type ext3 (rw)
/dev/mapper/vg1-lvhome on /home type ext3 (rw)

Comment 12 Pavel Urban 2009-02-09 10:12:47 UTC

Another 'me too', please :-)

I've tried to strace 'yum check-update' and it seems like it tries to open rpm database and then send signal to a process that doesn't exist. The process number didn't change during my tests, so it is probably stored somewhere in those __db files

getgid32()                              = 0
getuid32()                              = 0
stat64("/var", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/var/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/var/lib/rpm", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
access("/var/lib/rpm", W_OK)            = 0
access("/var/lib/rpm/__db.001", F_OK)   = 0
access("/var/lib/rpm/Packages", F_OK)   = 0
open("/var/lib/rpm/DB_CONFIG", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
fstat64(4, {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0
close(4)                                = 0
open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 24576, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7d4e000
close(4)                                = 0
open("/proc/stat", O_RDONLY)            = 4
fstat64(4, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7d4d000
read(4, "cpu  14782 951 5676 268213 28105 "..., 1024) = 694
read(4, ""..., 1024)                    = 0
close(4)                                = 0
munmap(0xb7d4d000, 4096)                = 0
open("/var/lib/rpm/__db.002", O_RDWR|O_LARGEFILE) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 180224, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7d22000
close(4)                                = 0
open("/var/lib/rpm/__db.003", O_RDWR|O_LARGEFILE) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 1318912, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7be0000
close(4)                                = 0
open("/var/lib/rpm/__db.004", O_RDWR|O_LARGEFILE) = 4
fcntl64(4, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 352256, PROT_READ|PROT_WRITE, MAP_SHARED, 4, 0) = 0xb7b8a000
close(4)                                = 0
kill(4550, SIG_0)                       = -1 ESRCH (No such process)
write(2, "rpmdb: "..., 7rpmdb: )               = 7
write(2, "Thread/process 4550/3087615680 fa"..., 73Thread/process 4550/3087615680 failed: Thread died in Berkele
y DB library) = 73
write(2, "\n"..., 1
)                    = 1

Comment 13 Panu Matilainen 2009-02-09 12:03:43 UTC

The run that reports "thread died..." is not actually interesting in itself at all, this is just BDB safeguard saying essentially "a previous (rpm-related) process crashed in a way that cannot be automatically cleaned up".

But ok, clearly this is not limited to ext4 despite early reports indicating that. Somebody reported that they were seeing this daily on some older kernels but only occasionally now, so maybe there was something in ext4 causing rpm crashes (and thus triggering this) far easier than otherwise.

Grep for "segfault" in /var/log/messages*, lets see if there's some sort of pattern there. Any non-default yum plugins activated on these systems (unlikely cause perhaps but just collecting information...)?

Also do enable core dumps on systems seeing this issue to catch the actual crash instead of after the fact "something went bust on previous run" message that people are seeing here. Backtraces (http://fedoraproject.org/wiki/StackTraces) preferred over core-dumps but core-dumps are certainly more useful than just "it crashed".

Comment 14 Pavel Urban 2009-02-09 12:13:30 UTC

My yum run didn't crash. It just locked up after reading the whole rpm database.

<tons of pread64's>

pread64(13, "\0\0\0\0\1\0\0\0\352\r\0\0\0\0\0\0\351\r\0\0\1\0\346\17\0\7\0\0\0K\0\0#"..., 4096, 14589952) = 4096
pread64(13, "\0\0\0\0\1\0\0\0\351\r\0\0\352\r\0\0\350\r\0\0\1\0\346\17\0\7\0root\0c"..., 4096, 14585856) = 4096
pread64(13, "\0\0\0\0\1\0\0\0\350\r\0\0\351\r\0\0\0\0\0\0\1\0\24\10\0\7e\0bsmi0"..., 4096, 14581760) = 4096
gettimeofday({1234170960, 962562}, NULL) = 0
gettimeofday({1234170960, 962716}, NULL) = 0
futex(0xb7d2984c, FUTEX_WAIT, 9, NULL <unfinished ...>
<after some time I've manually killed it>
+++ killed by SIGKILL +++

File descriptor 13 was '/var/lib/rpm/Packages'

[root@aphrael rpm]# rpm -q yum
yum-3.2.21-2.fc10.noarch
[root@aphrael rpm]# rpm -q rpm
rpm-4.6.0-0.rc3.1.fc10.i386

Comment 15 Panu Matilainen 2009-02-11 08:32:23 UTC

*** Bug 484995 has been marked as a duplicate of this bug. ***

Comment 16 Julian Aloofi 2009-02-12 17:52:38 UTC

I get this bug too with Fedora 10 on an ext3 filesystem. And also not the first time. Seems to occur whenever I cancel a search in PackageKit before it completes.
A

rm -rf /var/lib/rpm/__db.*

resolves the error for me.

Have you also used PackageKit before this happened? I just get this error when cancelling searches (with the "Cancel" button), but maybe its related to something different.

Comment 17 Panu Matilainen 2009-02-13 07:07:21 UTC

Thanks Julian, I don't much use PackageKit and this does start to explain, both why I never see this and why you get this error:
[root@localhost ~]# pkcon search file /usr/bin/foo
^C
[root@localhost ~]# rpm -qf /usr/bin/foo
Freeing read locks for locker 0x867: 18182/139967018858224
Freeing read locks for locker 0x869: 18182/139967018858224
Freeing read locks for locker 0x86a: 18182/139967018858224
error: file /usr/bin/foo: No such file or directory

Turns out PackageKit terminates the backend by SIGKILL when it doesn't respond to SIGQUIT, this pretty much by definition leaves stale locks behind. The stale locks are normally wiped out automatically as above but if that fails for whatever reason, you'll get "Thread died in Berkeley DB library".

Comment 18 Roman 2009-02-21 12:20:49 UTC

(In reply to comment #16)
> I get this bug too with Fedora 10 on an ext3 filesystem. And also not the first
> time. Seems to occur whenever I cancel a search in PackageKit before it
> completes.
> A
> 
> rm -rf /var/lib/rpm/__db.*
> 
> resolves the error for me.
> 
> Have you also used PackageKit before this happened? I just get this error when
> cancelling searches (with the "Cancel" button), but maybe its related to
> something different.

I also ended up in this problem after cancelling searches in the gpk-application. Suggestion also worked for me. (FC10, ext3)

Comment 19 Julian Aloofi 2009-06-11 14:04:48 UTC

Doesn't occur anymore in Fedora 11. gpk lost it's cancel-button and pkcon can be killed without problems.

Comment 20 Panu Matilainen 2009-06-18 17:23:34 UTC

Seems it's been fixed in F10 PackageKit too:

* Wed May 13 2009 Richard Hughes  <rhughes> - 0.3.15-3
- Apply a patch from upstream to disallow SIGKILL.
- Fixes #487924

The error message can be caused by anything killing rpm (or librpm API user) or crashing inside it at an unfortunate time, PK seems to have been the most common cause of it so considering this closed due to the above PK update.

Comment 21 Need Real Name 2011-09-03 20:37:15 UTC

I just recreated this problem on Fedora 15 x86_64 by hitting Ctrl-C "at the wrong time" during a yum search.


# stat -f /
  File: "/"
    ID: c81eed21b79d360 Namelen: 255     Type: ext2/ext3
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 2520122    Free: 370017     Available: 242000
Inodes: Total: 640848     Free: 378241


The fix was 'rm -f /var/lib/rpm/__*' as posted in Comment 3 above.

Comment 22 meejah 2012-07-10 20:49:48 UTC

I got into the situation described above by kill -9'ing a yum process that I'd suspended. It was doing a "yum search" at the time, so likely similar to Comment 21. Removing the __* files from Comment 3 worked fine for me.