Description of problem: When using RPM, seemingly at random times (and yum, too) it will exit with: Segmentation Fault and from that point on, I can't use any rpm command anymore. RPM will just hang and I can't use < ctrl + c > to kill it, I have to open another terminal and type $ killall -SIGKILL rpm to kill the process. The funny part is once I reboot, it's perfectly fine again until it decides to segfault... Version-Release number of selected component (if applicable): rpm-libs-4.4.2-31 rpm-4.4.2-31 How reproducible: Sometimes (it's random) Steps to Reproduce: 1. Start using yum, rpmbuild or rpm a lot 2. Wait for a segmentation fault to occur 3. Watch how from now on any command using RPM (such as yum) or rpm itself will hang until a reboot is performed Actual results: RPM or the program using it hangs after these seeminly random segfaults Expected results: RPM, and there for the programs using it, function as normal Additional info: It's been occurring since the recent rpm and rpm-libs update I did. It hasn't happened this boot yet, but when it does I'll provide a strace of the hung rpm command.
OK here's a yum output when it segfaults: [user@host ~]$ sudo yum install scribes gnome-translate Password: Setting up Install Process Setting up repositories development 100% |=========================| 1.1 kB 00:00 livna-development 100% |=========================| 951 B 00:00 extras-development 100% |=========================| 1.1 kB 00:00 rpmforge 100% |=========================| 951 B 00:00 Reading repository metadata in from local files primary.xml.gz 100% |=========================| 770 kB 00:02 ################################################## 2211/2211 Segmentation fault
can you actually get a core, by setting either ulimit -c unlimited or editing /etc/security/limits.conf appropriately so we can check with gdb. Really we need a core to debug this.
Alright - How would I get one? Is simply running that commmand and then strace-ing it enough?
Well, I was fooling around strace-ing yum and managed to reproduce something very similar to what I see when strace the haning RPM command: getuid32() = 0 stat64("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat64("/var/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat64("/var/lib/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 stat64("/var/lib/rpm", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 access("/var/lib/rpm", W_OK) = 0 access("/var/lib/rpm/__db.001", F_OK) = 0 access("/var/lib/rpm/Packages", F_OK) = 0 open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_US.utf8/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en_US/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.UTF-8/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en.utf8/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/usr/share/locale/en/LC_MESSAGES/messages.mo", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/stat", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7b19000 read(3, "cpu 1169177 2 42030 394711 2982"..., 4096) = 695 read(3, "", 4096) = 0 close(3) = 0 munmap(0xb7b19000, 4096) = 0 stat64("/var/lib/rpm/DB_CONFIG", 0xbfcc638c) = -1 ENOENT (No such file or directory) open("/var/lib/rpm/DB_CONFIG", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory) stat64("/var/lib/rpm/__db.001", {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0 open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 3 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 fstat64(3, {st_mode=S_IFREG|0644, st_size=24576, ...}) = 0 close(3) = 0 open("/var/lib/rpm/__db.001", O_RDWR|O_LARGEFILE) = 3 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 mmap2(NULL, 24576, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0xb7b14000 close(3) = 0 stat64("/var/lib/rpm/__db.002", {st_mode=S_IFREG|0644, st_size=1318912, ...}) = 0 open("/var/lib/rpm/__db.002", O_RDWR|O_LARGEFILE) = 3 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 mmap2(NULL, 1318912, PROT_READ|PROT_WRITE, MAP_SHARED, 3, 0) = 0xb79d2000 close(3) = 0 futex(0xb7b19ef8, FUTEX_WAIT, 2, NULL And it just stops there at futex.
Created attachment 132948 [details] The output of an strace on RPM
Created attachment 132949 [details] The 'core' file requested - It's from this command: yum install test
There are 2 problems here, a segfault and the subsequent hang. The hang is fixed by doing rm -f /var/lib/rpm/__db* to remove stale locks. Stale locks must be removed after *EVERY* unusual event, like a segfault or a kill -9, or a hang will occur.
That seemed to have solved it. Now that I remember it, there was an issue once with Yumex where it closed unexpectedly in the middle of updating pacakges - That must have been the left around stale locks that made everything else freeze... The segfaults seem to have gone after update my yum and python, and now there's no more hangs. Thanks! Stewart
Hello, I just did a clean install of Fedora Core 6 Test 2, and there have been no incidents of bad yum, yumex or RPM runs, yet it's happened many times now that RPM freezes and I have to rebuild the DB. Is it possible that there's a minor bug in this version (rpm-4.4.2-31)?
Stewart - please document the exact steps you took from a FC6T2 to seeing the error, can you provide a reproducible test case. Can you also run memtest86+ on your machine and also check dmesg for drive errors.
I check dmesg, nothing came up.. I'll run memtest ASAP but I've ran it before, it came up clean... I feel bad taking your time because I can't reproduce it - It's just random lockups... Do you think I should continue rebuilding my DB for now (it works fine after a rebuild) and see if it still occurs in FC6 final?
I've run memtest for about 15 minutes, nothing came up. It's odd, because I do python programming and even when using other python programs - No segfaults. It only happens with RPM and/or yum. It's already happened twice today - And both times was for installing the same 2 packages. Maybe it's triggered by input?
The last couple of yum updates to devel which included glibc-common have halted immediately after updating that package. It's done it again. The yum process is currently waiting on futex. The only way to stop those processes accessing this futex is to kill -9. Even after killing yum, neither yum nor rpm can gain the futex. The only way to reset this condition is to reboot the system. # ps l 15921 F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 4 0 15921 2604 18 0 91684 60872 futex S+ pts/10 23:30 /usr/bin/python /usr/bin/yum update I'm running this on a 300 MHz Pentium II notebook. I've also seen seg faults on successive rpm database access attempts, but not this time. My hypothesis includes either kernel problems or glibs update residue. I've done an rpm --rebuilddb, but the problem continues. Currently running kernel-2.6.17-1.2571.fc6. I saw seg faults in 2566 and one earlier (possibly 2548).
"glibs update residue" should read "glibc update residue". The "rm -f /var/lib/rpm/__db*" allowed me to access the rpm database and start the yum update again, all without rebooting.
Yes - That's what I'm experiencing, and everything still works after removing stale locks (the __db* files) and rebuilding the database - It's just annoying to keep having to do it.
I am still seeing these problems, on my dual core cpu, i never saw them on my AMD64 single core CPU. RPM will crash with random segfaults, and after that the database can not be accesed anymore. The single threaded RPM will hang waiting for a mutex, which is a bit strange when it has only 1 thread. As described removing the __db* files fixes the problem until the next crash. this is with ; kernel-fc6PAE 2.6.17-1.2617.2.1.fc6 glibc 2.4.90-28 rpm 4.4.2-31
I noticed that it happens more often when yum's updating 'glibc' - I've stayed away from glibc updates for a while, I'm on version glibc-2.4.90-26 and it's been pretty stable. Mind you, I also switched to the FC5 kernel to fix a Wine bug I was experiencing.
Yup, I just updated to development's glibc, and I've had two hangs since in a row. I'm using FC6's rpm, kernel but development's glibc. The first time it hung, however, something different happened... The terminal into a wierd character set, like what happens when you 'cat' a binary file - But before that I saw a quick error that went something like: -3 ERROR_DB_PANIC: run DB recovery -3 ERROR_DB_PANIC: run DB recovery -3 ERROR_DB_PANIC: run DB recovery The second time was just 'Segmentation fault'.
Segafualts and loss of data are likely due to removing an rpmdb environment without correcting other problems in the rpmdb. FYI: Most rpmdb "hangs" are now definitely fixed by purging stale read locks when opening a database environment in rpm-4.4.8-0.4. There's more todo, but I'm quite sure that a large class of problems with symptoms of "hang" are now corrected. Detecting damaged by verifying when needed is well automated in rpm-4.4.8-0.4. Automatically correcting all possible damage is going to take more work, but a large class of problems is likely already fixed in rpm-4.4.8-0.8 as well. UPSTREAM
I hope Fedora decides to use the upstream RPM soon - I've been reading around, it keeps getting better and better...