199975 – RPM hangs and segfaults on multiple occasions

Bug 199975 - RPM hangs and segfaults on multiple occasions

Summary: RPM hangs and segfaults on multiple occasions

Keywords:
Status:	CLOSED UPSTREAM
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	rpm
Sub Component:
Version:	rawhide
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Paul Nasrat
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-07-24 18:16 UTC by Stewart Adam
Modified:	2007-11-30 22:11 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2006-12-04 00:12:54 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
The output of an strace on RPM (36.23 KB, application/octet-stream) 2006-07-24 21:43 UTC, Stewart Adam	no flags	Details
The 'core' file requested - It's from this command: yum install test (9.03 MB, application/octet-stream) 2006-07-24 21:47 UTC, Stewart Adam	no flags	Details
View All

Description Stewart Adam 2006-07-24 18:16:53 UTC

Description of problem:
When using RPM, seemingly at random times (and yum, too) it will exit with:

Segmentation Fault

and from that point on, I can't use any rpm command anymore. RPM will just hang
and I can't use < ctrl + c > to kill it, I have to open another terminal and type

$   killall -SIGKILL rpm

to kill the process. The funny part is once I reboot, it's perfectly fine again
until it decides to segfault...

Version-Release number of selected component (if applicable):
rpm-libs-4.4.2-31
rpm-4.4.2-31

How reproducible:
Sometimes (it's random)

Steps to Reproduce:
1. Start using yum, rpmbuild or rpm a lot
2. Wait for a segmentation fault to occur
3. Watch how from now on any command using RPM (such as yum) or rpm itself will
hang until a reboot is performed
  
Actual results:
RPM or the program using it hangs after these seeminly random segfaults

Expected results:
RPM, and there for the programs using it, function as normal

Additional info:
It's been occurring since the recent rpm and rpm-libs update I did. It hasn't
happened this boot yet, but when it does I'll provide a strace of the hung rpm
command.

Comment 1 Stewart Adam 2006-07-24 18:59:30 UTC

OK here's a yum output when it segfaults:
[user@host ~]$ sudo yum install scribes gnome-translate
Password:
Setting up Install Process
Setting up repositories
development               100% |=========================| 1.1 kB    00:00     
livna-development         100% |=========================|  951 B    00:00     
extras-development        100% |=========================| 1.1 kB    00:00     
rpmforge                  100% |=========================|  951 B    00:00     
Reading repository metadata in from local files
primary.xml.gz            100% |=========================| 770 kB    00:02     
################################################## 2211/2211
Segmentation fault

Comment 2 Paul Nasrat 2006-07-24 20:07:06 UTC

can you actually get a core, by setting either ulimit -c unlimited or editing
/etc/security/limits.conf appropriately so we can check with gdb.

Really we need a core to debug this.

Comment 3 Stewart Adam 2006-07-24 21:31:06 UTC

Alright - How would I get one? Is simply running that commmand and then
strace-ing it enough?

Comment 4 Stewart Adam 2006-07-24 21:35:39 UTC

Well, I was fooling around strace-ing yum and managed to reproduce something
very similar to what I see when strace the haning RPM command:

getuid32()                              = 0
stat64("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/var/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/var/lib/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat64("/var/lib/rpm", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
access("/var/lib/rpm", W_OK)            = 0
access("/var/lib/rpm/__db.001", F_OK)   = 0
access("/var/lib/rpm/Packages", F_OK)   = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/messages.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/messages.mo", O_RDONLY) = -1
ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT
(No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT
(No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT
(No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No
such file or directory)
open("/proc/stat", O_RDONLY)            = 3
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) =
0xb7b19000
read(3, "cpu  1169177 2 42030 394711 2982"..., 4096) = 695
read(3, "", 4096)                       = 0
close(3)                                = 0
munmap(0xb7b19000, 4096)                = 0
stat64("/var/lib/rpm/DB_CONFIG", 0xbfcc638c) = -1 ENOENT (No such file or directory)
open("/var/lib/rpm/DB_CONFIG", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file
or directory)
stat64("/var/lib/rpm/__db.001", {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0
open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
fstat64(3, {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0
close(3)                                = 0
open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 24576, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0xb7b14000
close(3)                                = 0
stat64("/var/lib/rpm/__db.002", {st_mode=S_IFREG|0644, st_size=1318912, ...}) = 0
open("/var/lib/rpm/__db.002", O_RDWR|O_LARGEFILE) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
mmap2(NULL, 1318912, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0xb79d2000
close(3)                                = 0
futex(0xb7b19ef8, FUTEX_WAIT, 2, NULL

And it just stops there at futex.

Comment 5 Stewart Adam 2006-07-24 21:43:26 UTC

Created attachment 132948 [details]
The output of an strace on RPM

Comment 6 Stewart Adam 2006-07-24 21:47:06 UTC

Created attachment 132949 [details]
The 'core' file requested - It's from this command: yum install test

Comment 7 Jeff Johnson 2006-08-04 11:23:02 UTC

There are 2 problems here, a segfault and the subsequent hang.

The hang is fixed by doing
    rm -f /var/lib/rpm/__db*
to remove stale locks.

Stale locks must be removed after *EVERY* unusual event, like a segfault or
a kill -9, or a hang will occur.

Comment 8 Stewart Adam 2006-08-04 14:34:46 UTC

That seemed to have solved it. Now that I remember it, there was an issue once
with Yumex where it closed unexpectedly in the middle of updating pacakges -
That must have been the left around stale locks that made everything else
freeze... The segfaults seem to have gone after update my yum and python, and
now there's no more hangs. Thanks!
Stewart

Comment 9 Stewart Adam 2006-08-10 14:27:00 UTC

Hello,
I just did a clean install of Fedora Core 6 Test 2, and there have been no
incidents of bad yum, yumex or RPM runs, yet it's happened many times now that
RPM freezes and I have to rebuild the DB. Is it possible that there's a minor
bug in this version (rpm-4.4.2-31)?

Comment 10 Paul Nasrat 2006-08-10 14:42:00 UTC

Stewart - please document the exact steps you took from a FC6T2 to seeing the
error, can you provide a reproducible test case.

Can you also run memtest86+ on your machine and also check dmesg for drive errors.

Comment 11 Stewart Adam 2006-08-10 14:48:13 UTC

I check dmesg, nothing came up.. I'll run memtest ASAP but I've ran it before,
it came up clean...

I feel bad taking your time because I can't reproduce it - It's just random
lockups... Do you think I should continue rebuilding my DB for now (it works
fine after a rebuild) and see if it still occurs in FC6 final?

Comment 12 Stewart Adam 2006-08-14 15:25:28 UTC

I've run memtest for about 15 minutes, nothing came up.

It's odd, because I do python programming and even when using other python 
programs - No segfaults. It only happens with RPM and/or yum.

It's already happened twice today - And both times was for installing the same 2
packages. Maybe it's triggered by input?

Comment 13 Paul Dickson 2006-08-28 11:34:21 UTC

The last couple of yum updates to devel which included glibc-common have
halted immediately after updating that package.  It's done it again.  The
yum process is currently waiting on futex.  The only way to stop those
processes accessing this futex is to kill -9.  Even after killing yum,
neither yum nor rpm can gain the futex.  The only way to reset this
condition is to reboot the system.

# ps l 15921
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY       TIME COMMAND
4     0 15921  2604  18   0  91684 60872 futex  S+   pts/10   23:30
/usr/bin/python /usr/bin/yum update

I'm running this on a 300 MHz Pentium II notebook.

I've also seen seg faults on successive rpm database access attempts, but not
this time.

My hypothesis includes either kernel problems or glibs update residue.  I've
done an rpm --rebuilddb, but the problem continues.

Currently running kernel-2.6.17-1.2571.fc6.
I saw seg faults in 2566 and one earlier (possibly 2548).

Comment 14 Paul Dickson 2006-08-28 11:55:03 UTC

"glibs update residue" should read "glibc update residue".

The "rm -f /var/lib/rpm/__db*" allowed me to access the rpm database and start
the yum update again, all without rebooting.

Comment 15 Stewart Adam 2006-08-28 16:54:15 UTC

Yes - That's what I'm experiencing, and everything still works after removing
stale locks (the __db* files) and rebuilding the database - It's just annoying
to keep having to do it.

Comment 16 Erwin Rol 2006-09-06 15:58:08 UTC

I am still seeing these problems, on my dual core cpu, i never saw them on my
AMD64 single core CPU. RPM will crash with random segfaults, and after that the
database can not be accesed anymore. The single threaded RPM will hang waiting
for a mutex, which is a bit strange when it has only 1 thread. As described
removing the __db* files fixes the problem until the next crash. 

this is with ;
kernel-fc6PAE 2.6.17-1.2617.2.1.fc6
glibc 2.4.90-28
rpm 4.4.2-31

Comment 17 Stewart Adam 2006-09-06 21:21:09 UTC

I noticed that it happens more often when yum's updating 'glibc' - I've stayed
away from glibc updates for a while, I'm on version glibc-2.4.90-26 and it's
been pretty stable. Mind you, I also switched to the FC5 kernel to fix a Wine
bug I was experiencing.

Comment 18 Stewart Adam 2006-11-21 22:25:21 UTC

Yup, I just updated to development's glibc, and I've had two hangs since in a row.
I'm using FC6's rpm, kernel but development's glibc.

The first time it hung, however, something different happened... The terminal
into a wierd character set, like what happens when you 'cat' a binary file - But
before that I saw a quick error that went something like:
-3 ERROR_DB_PANIC: run DB recovery
-3 ERROR_DB_PANIC: run DB recovery
-3 ERROR_DB_PANIC: run DB recovery
The second time was just 'Segmentation fault'.

Comment 19 Jeff Johnson 2006-12-03 18:34:00 UTC

Segafualts and loss of data are likely due to removing an rpmdb environment
without correcting other problems in the rpmdb.

FYI: Most rpmdb "hangs" are now definitely fixed by purging stale read locks when opening
a database environment in rpm-4.4.8-0.4. There's more todo, but I'm quite sure that a
large class of problems with symptoms of "hang" are now corrected.

Detecting damaged by verifying when needed is well automated in rpm-4.4.8-0.4. Automatically 
correcting all possible damage is going to take more work, but a large class of problems is likely
already fixed in rpm-4.4.8-0.8 as well.

UPSTREAM

Comment 20 Stewart Adam 2006-12-04 00:12:54 UTC

I hope Fedora decides to use the upstream RPM soon - I've been reading around,
it keeps getting better and better...

Note You need to log in before you can comment on or make changes to this bug.