Bug 211254

Summary:

apt-get segfault

Product:

[Fedora] Fedora

Reporter:

Eli Wapniarski <eli>

Component:

apt

Assignee:

Axel Thimm <Axel.Thimm>

Status:

CLOSED DUPLICATE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

CC:

davej, extras-qa, imtiaz.rahi, jerome.benoit, pierre-bugzilla, pmatilai

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

0.5.15lorg3.2-9

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2007-02-05 17:29:11 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
apt sources	none
/var/lib/apt and /var/cache/apt	none
Band-aid patch for the cache corruption	none

Description Eli Wapniarski 2006-10-18 05:22:40 UTC

Description of problem:

apt-get update or upgrade segfaults

Version-Release number of selected component (if applicable):

0.5.15lorg3.2-7.fc5

How reproducible:

Always happens

Steps to Reproduce:
1. run apt-get update
2.
3.
  
Actual results:

segfault

Expected results:

List of repositories updating

Additional info:

Rebuilding the source code seems to have fixed the problem.

Comment 1 Eli Wapniarski 2006-10-18 05:43:40 UTC

I've rebuilt the source RPMs for 4 computers. Three of them seem to have no problem 
with apt-get the fourth one continues to segfault.

I expect that in the not to distant future, the others will conintue to segfault. I will let you 
know/

Comment 2 Axel Thimm 2006-10-18 06:17:35 UTC

Please add some more information about the segfault, for example backtraces with
the debuginfo package installed. For self-rebuilt packages you need to use the
debuginfo package that you built, otherwise use that from the repos.

Comment 3 Eli Wapniarski 2006-10-18 08:06:06 UTC

OK.. I have the debug package installed. But when running apt-get update I still
get a segmentation fault but, without additional info. How do I generate or find
the backtrace info?

Comment 4 panu.matilainen 2006-10-18 09:44:45 UTC

With debuginfo package installed, you need to run it under gdb:

# gdb apt-get 
(gdb) run update
... and when it crashes, get the backtrace:
(gdb) bt

Then copy-paste the full gdb session here. Oh and btw, I'd prefer getting the
backtrace from FE built package rather than self-rebuilt versions to eliminate
unnecessary variables from the equation.

Comment 5 Eli Wapniarski 2006-10-18 09:49:57 UTC

I would be very happy to comply with the request to get debug info from the
prebuilt FE packages, but the debug packages are not supplied. I will let you
know soon how things go.

Comment 6 Eli Wapniarski 2006-10-18 09:59:48 UTC

I would be very happy to comply with the request to get debug info from the
prebuilt FE packages, but the debug packages are not supplied. I will let you
know soon how things go.

Comment 7 Eli Wapniarski 2006-10-18 10:01:48 UTC

Here is the backtrace

Reading Package Lists... 0%
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread -1208555024 (LWP 24180)]
0x0054eb43 in strlen () from /lib/libc.so.6
(gdb) bt
#0  0x0054eb43 in strlen () from /lib/libc.so.6
#1  0x0095d7d6 in std::string::compare () from /usr/lib/libstdc++.so.6
#2  0x00c4a4e1 in rpmPkgListIndex::FindInCache (this=0x9526f38,
    Cache=@0xbf8a761c)
    at
/usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/bits/basic_string.h:2200
#3  0x00cb4fa0 in CheckValidity (CacheFile=Variable "CacheFile" is not available.
) at pkgcachegen.cc:654
#4  0x00cb55b9 in pkgMakeStatusCache (List=@0xbf8a7884, Progress=@0xbf8a7918,
    OutMap=0xbf8a7a18, AllowMem=false) at pkgcachegen.cc:789
#5  0x00c9e1e1 in pkgCacheFile::BuildCaches (this=0xbf8a7a18,
    Progress=@0xbf8a7918, WithLock=false) at cachefile.cc:74
#6  0x00c9e304 in pkgCacheFile::Open (this=0xbf8a7a18, Progress=@0xbf8a7918,
    WithLock=true) at cachefile.cc:94
#7  0x0806543a in CacheFile::Open (this=0xbf8a7a18, WithLock=true)
    at apt-get.cc:100
#8  0x080566c1 in DoUpdate (CmdL=@0xbf8a829c) at apt-get.cc:1748
#9  0x00c2506b in CommandLine::DispatchArg (this=0xbf8a829c, Map=0xbf8a8210,
    NoMatch=true) at contrib/cmndline.cc:340
#10 0x0805dbc5 in main (argc=2, argv=Cannot access memory at address 0x4
) at apt-get.cc:3312
#11 0x004fa4e4 in __libc_start_main () from /lib/libc.so.6
#12 0x0804d221 in _start ()

Comment 8 Eli Wapniarski 2006-10-20 19:10:15 UTC

OK.... Just blind as a bat. I did not see the debug folder in extras. So I reinstalled all 
packages from FE including the debug folder. I got exactly the same results.

Comment 9 Eli Wapniarski 2006-10-21 07:50:26 UTC

OK --- I just downloaded and compiled and installed the latest development release 
apt-0.5.15lorg3.90 from apt-rpm.org.

On the machines that were giving me trouble, apt-get seemed to work as expected at 
least once. I will let you know next weekend if it continues to run trouble free, unless 
new packages are provided. If that's happens, I will report on how apt-get works with 
the new packages when provided.

I appreciate very much the work that you are doing

Comment 10 Axel Thimm 2006-10-21 08:33:52 UTC

I cannot reproduce it, neither on FC5, nor on FC6. I wonder if your metadata
under /var has some trouble/corruption. Could you nuke that? Best done by
uninstalling apt, removing /var/cache/apt and /var/state/apt and reinstalling apt.

Comment 11 Panu Matilainen 2006-10-21 09:34:38 UTC

I can't reproduce it either, and the traceback suggests like Axel says, that
there's something very strange about the metadata. 

What repositories are in use on the systems where apt crashes? (sources.list and
sources.list.d contents)

Comment 12 Eli Wapniarski 2006-10-21 09:49:56 UTC

Funny... Axel, I did as suggested, everything seemed to work at least once. As I'm sure 
there will ample opportunity to test things out with kde-redhat updating from 3.5.4 to 
3.5.5. I will let you know by the end of next week if things continue to function as 
expected. The question remains, how do three computers out of five have problematic 
metadata? hmmm.

Panu - It doesn't crash while downloading package data. It crashes either at the very 
beginning, (before reading the data or during the "Reading Package Lists" stage or 
"Building Dependency Tree" stage.

Comment 13 Eli Wapniarski 2006-10-21 09:51:58 UTC

Oh.. one other thing maybe I should mention, I am getting a few packages from Livna 
as required by kde-redhat. They use repmod exclusively having dropped support for 
traditional apt-get.

Comment 14 Panu Matilainen 2006-10-21 10:38:18 UTC

Eli, can you attach (or make somehow accessible) the exact contents of
/etc/apt/sources.list and sources.list.d directory on a system where apt crashes?
According to the traceback, the crash occurs on an old-style apt-rpm repository,
so that rules out all the repomd repositories such as FC+FE, Livna and kde-redhat.

Comment 15 Eli Wapniarski 2006-10-21 10:46:04 UTC

Created attachment 139053 [details]
apt sources

Here you go

Comment 16 Axel Thimm 2006-10-21 11:10:24 UTC

Panu, don't you also need a tarball of Eli's /var/*/apt contents to reproduce
it? If yes, then Eli please make them available through some URL instead of
attaching them to the bug :)
(but wait to see if Panu really needs them, maybe he doesn't)

Comment 17 Panu Matilainen 2006-10-21 11:51:50 UTC

I was basically hoping for an easy reproducer with just the info about
repositories :) Alas, no such luck.

So yes, to futher track this I'd need the following bits from a system that crashes:
/var/cache/apt/*.bin
/var/lib/apt/ (can be at /var/state/apt in some cases) contents in their
entirety, although the problem is most likely in the *.bin files.

The *.bin files are the key to pretty much everything in apt and quite often
just removing them fixes various more-or-less mysterious problems, would seem to
be the case here as well according to comment #12. So, after backing up the
current cached files, do the cleanup steps on each problematic box and lets see
if that helps. Even if that cures the problem, the corrupted cache data is
interesting to me as garbage data shouldn't segfault, just error out cleanly.

Comment 18 Eli Wapniarski 2006-10-21 12:12:03 UTC

Sorry Panu. Those files have been cleaned way back by comment #12. If this should 
happen again, I will indeed send the files.

Axel. could you keep this open for a week. Like I said, in about weeks time, if everything 
is OK, I will post a comment to indicate all is well or not.

Comment 19 Axel Thimm 2006-10-21 13:55:37 UTC

No, I won't close it. I'm lowering the severity as it seems to only affect your
system (and there are many apt users still). If you find that there is no issue
anymore you can close it, too. If the issue doesn't pop up again or only pops on
on one system I suspect that you may have some bad ram somewhere corrupting the
cache data.

BTW some people sync their contents of /var/*/apt to save some bandwidth, don't
do it otherwise if something eats up your bits on one system you mirror the
issue to the other healthy systems, too.

Comment 20 Eli Wapniarski 2006-10-22 10:00:09 UTC

Created attachment 139074 [details]
/var/lib/apt and /var/cache/apt

It happened again

Comment 21 Eli Wapniarski 2006-10-24 05:19:30 UTC

I'm getting pretty convinced that this is a memory issue and it may not be an apt 
problem per se. However, I'm seeing this happen only with apt. And, its only been 
happening since the last kernel update. So, I suspect that there is something buggy 
about the latest kernel release. But how do I figure out where?

I'm dealing with 4 FC5 boxes. On three of them I have 512 MByte RAM on the 4th, I 
have 1 GIG. On the 1 Gig box I have yet to see this problem. On the other three boxes. 
This is a continuing and on going nag. Running RPM commands by themselves does 
not seem to be a problem. Only, when running apt.

Comment 22 Panu Matilainen 2006-10-31 11:28:47 UTC

Apt uses quite a bit of memory, especially with repomd repositories, so it's
possible it triggers something more easily than others. Do you remember at which
kernel update this started happening and if you downgrade the kernel back to
some older version does it actually stop this from happening? Also, is there
anything even remotely relevant in /var/log/messages from the time when this
corruption happens?

Sorry, I haven't had a chance to look at your cache data yet, been a bit busy
with other things :-/

Comment 23 Eli Wapniarski 2006-10-31 20:07:18 UTC

It'll be awhile before I can get back to this. I'm trying to upgrade my main desktop from 
fc5 i386 to fc6 x86_64. Its a real nightmare. I will try to get back to this within a week, 
that is if I don't need to reinstall my desktop.

The kernel that is currently installed on the machines that are giving me trouble, are 
2.6.18-1.2200.fc5 i686.

One correction to make is that one of the computers has less than 1/2Gig but rather 
256MBytes. And before I can use apt-get successfully on this machine, I have to restart 
the computer, run

rm -f /var/lib/__db*
rpm --rebuilddb

because the rpmdb gets corrupted.

also, I have to remove the bin files from /var/cache/apt and sometimes the files 
in /var/lib/apt/lists

This can't be good for this machine.

Eli

Comment 24 Eli Wapniarski 2006-11-23 05:32:38 UTC

OK... I'm back. Had a bit of trouble getting the upgraded machine to function properly 
but that's another bug.

In the meantime I've had a little time to do some research, and found that the problem 
does not effect apt-get get, but rpm and yum as well. A cursive search at this bugzilla 
sight on rpm segfault or yum segfault reveals several reports.

An example would be https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213963

I really would like to help getting this problem solved.

Thanks for your patience

Comment 25 Axel Thimm 2006-11-23 10:32:29 UTC

There are known bugs in yum on FC6 due to opening/closing rpmdb too often, or
one can argue that yum only exhibits bugs that were previously in the rpmlib
code but undetected, either way you look at it, one would have to uninstall yum,
fix up any broken rpm metadata (e.g. rm -f /var/lib/rpm/__db*; rpm --rebuilddb)
and only use apt for a while.

Note that yum is automatically invoced by applets, cron jobs, daemons and the
like if present, so you really need to uninstall it for the sake of testing. Can
you please do so for say a week and report back with positive or negative
results? Thanks!

Comment 26 Eli Wapniarski 2006-11-24 06:10:34 UTC

OK... Here's the situation. I don't like yum and if I can avoid using it I do. And I have
been avoiding it for several years. I don't invoke the automatic scripts, because I like to
do kernel updates manually. And, on occasion I like to change runlevels in order to
ensure that I get trouble free updates (another story).

I deal with 5 fc computers. Three of them are fc6 and 2 are fc5. The fc6 computers
function as desktops while the fc5 function as servers.

FC6 computers Memory Composistion
------------------------------------------------
1 Gig
512 MBytes
256 MBytes

FC5 computers Memory Composition
----------------------------------------------
512 MBytes
256 MBytes

The only computer that is giving me consistent headaches is the 256 FC5 machine. It is
utilizing serveral services. It provides gateway, dns, mail gateway, ftp, http, and ssh
services. So, it utilizes a considerable amount of memory, but, I have never, until
recently had a problem with apt-get. In order to fix the problems, for the last month or
so, I've had to:

rm -f /var/lib/rpm/__db*
rm -f /var/lib/apt/*.bin
rm -f /var/cache/apt/lists/*.*

Then, reboot the computer run rpm --rebuilddb

then

apt-get update
apt-get upgrade

Usually worked but a real pain in the butt.

A couple of days ago, that didn't work either. So, I tried yum, because I need to keep
the system patched, especially with security updates since this computer constantly
faces the internet.

First invocation, yum segfaulted. After simply running

rm -f /var/lib/rpm/__db*

And ran yum update again, everything worked the way it was supposed to. However, I
started getting packages from repos I did not want the packages to come from. And
this is why I prefer apt-get to yum. I love the pinning feature available in apt-get. It
saves my brain a great deal of confusion.

Anyway. On the other machines I am still able to run apt-get. When I see the segfault,
simply running

rm -f /var/cache/rpm/*.bin fixes the problem.

apt-get update
apt-get upgrade

works.

I have never had a problem like this until recently. Which makes me suspect the
upgrading of critical libraries as the main culprit. This brings me back to my suspicion
that the main instigator is a kernel update. Why, because, if I recall correctly, one of the
updates to the 2.6.18 kernel changed the way the kernel manages memory and if I'm
not mistaken, a segfault is somekind of screwup in memory.

Comment 27 Eli Wapniarski 2006-11-24 06:13:36 UTC

Oh... I've copied this message over to the yum bug.

Comment 28 Eli Wapniarski 2006-11-24 07:10:07 UTC

OK .... I just installed  apt 0.5.15lorg3.2-8 for fc5 and fc6 so far so good. On the 
machine giving me the most trouble, apt-get worked on the first run. If I continue to run 
trouble free, I will let you know. If of course I continue to get segfaults, then I will let you 
know sooner.

Thank you so much for your work.

Comment 29 Eli Wapniarski 2006-11-24 07:14:40 UTC

Oh... one other thing. Looks like synaptic needs to be rebuilt on fc6 platforms.

Comment 30 Axel Thimm 2006-11-24 11:02:13 UTC

(In reply to comment #28)
> OK .... I just installed  apt 0.5.15lorg3.2-8 for fc5 and fc6 so far so good. 

There were only ppc related changes in this release, this will not work
better/worse in your context than the previous one. :/

Please do the testing with only rpm involved as mentioned bug #213963.

Comment 31 Eli Wapniarski 2006-11-24 11:45:06 UTC

Sorry, Axel, but this is about as far as I can go with the previous version. I have 
provided a traceback, requested information and all the observations that I know how 
to provide. There is nothing for me left to test. The previous version under current 
circumstances is useless for me.

I have to move on. Hopefully the current version fixes the problem since, no doubt it 
was compiled against the most recent libraries. Like I said, I will let you know how 
things go.

Comment 32 Axel Thimm 2006-11-24 12:55:52 UTC

You indicated in this and other reports that this seems to stem from other
non-apt related parts of the system (yum and/or rpm/kernel), which suggested
that apt may never have been at fault. Therefore the isolated testing is
neccessary to see whether apt was at fault ever.

I completely understand lack of time, so if you can't proceed with debugging,
let's close this as WORKSFORME, as there hasn't been anyone (including Panu and
myself) that could reproduce it in the sense of apt being responsible for
corrupting rpmdb. I'll put it into NEEDINFO for now.

Comment 33 Eli Wapniarski 2006-11-24 18:51:03 UTC

Well Axel,

No Joy. I attempted a second run on the "more trouble than its worth" running apt-get 
and I get segfaults and the appearance of a corrupt rpm database. I guess I will have 
to use yum (yech) on that computer.

Please, you, or anyone else, let me know if you have anything else to suggest that I 
can do to troubleshoot or help to resolve this issue.

Comment 34 Axel Thimm 2006-11-24 19:02:02 UTC

I already suggested a way to help in bug #213963 comment #4.

Should the rpm-stress test fail, then the bug is in rpm/kernel. If it succeeds,
then one needs to check one depsolver at a time (e.g. only apt-get installed or
only yum installed) by a similar stress test.

Also did you check your memory hardware?

Comment 35 Eli Wapniarski 2006-11-25 05:58:11 UTC

What I mean to suggest, is that there may be a relationship between the new memory
management scheme in the newer kernels and the problems that I have experienced.
They appear to be significantly more severe when using apt-get.

I did not test memory, because I find it improbable that I am experiencing a hardware
difficulty when the problem appears on five seperate computers. All of which
experience the same difficulty to varying degrees depending on the amount of memory
and the load on memory. The most severe is a gateway server with 256 MBytes RAM
running FC5.

I have already sent all the relevant information, and they are attached to this bug
report. I have yet to hear about the results from Panu.

I do not believe that the RPM database is corrupted by apt-get but it appears that way
to apt-get. The reason that I say this, is that if I experience the rare segfault with yum
then removing the lock files (__db*) as suggested in the link I provided about the
problem in yum and then run yum update everything works OK. Only, I'm getting
packages from repos that I don't want to get them from. This is the reason that I prefer
apt-get (pinning).

I have tried to run apt-get with only headers coming from Freshrpms (core, updates,
extras, freshrpms). Same results. If I use the repmod configuration linking to Fedora
itself, then I may as well use yum, because then all pinning goes out the window,
because, as far as I know, pinning is not supported with repmod data.

Its not that I don't have the time. to continue. I'm terrirfied that I may actually do
irreperable damage to the RPM database if I continue to use apt-get on that machine.

I'm willing to continue to troubleshoot, but, lets not go around in circles. Making me
write the same detailed report over and over again. I do not have the time for that :).

Comment 36 Axel Thimm 2006-11-25 13:13:34 UTC

Eli, it is important to remove yum, if you want to do any testing with rpm
and/or apt-get. yum is called by applets, daemons, cron jobs and who knows what
else. So you may think that you're not using it, but in fact you do. That's why
the instructions on bug #213963 comment #4 asked you to remove even both yum and
apt.

And grepping this bug report for yum shows that you've been using it to
cross-check the results/failures you have been encountering all along, so the
results you quote are a mixed use of rpm/apt/yum and we can't put a finger on
one of them. In order to see which component is at fault, rpm/kernel, apt or
yum, you need to isolate the problem.

You suspect kernel/rpm, then please follow the instructions and start with a
good rpmdb and w/o any apt/yum/etc. tools around. If you manage to break the
rpmdb, then you'll have proved that it's rpm/kernel that is the rogue character.
Otherwise you would have to add apt to it and repeat. If it break now it's apt.
And if it doesn't it was yum all along.

Comment 37 Eli Wapniarski 2006-11-25 13:33:29 UTC

Please explain to me how in the world will I be able to maintain the computer without 
yum or apt-get?

How in the world am I to determine if the rpmdb is good or not?

I have used apt-get exclusively for several years. I have never had yum-updateonboot 
and yum does not exist in any of the cron jobs. On the most troublesome computer. No 
other  process, that I'm aware of does an automatic update. I don't like and don't use 
gnome except for a few applications. Primarily synaptic.

Which daemons call yum? Maybe I can configure them out of running automatically?

Comment 38 Axel Thimm 2006-11-25 15:04:08 UTC

Eli, I didn't imply staying w/o yum/apt for the rest of your life :)

The rpmdb stress tests wouldn't take longer than 5 minutes each in any testing
of yours, since the bug seems to hit you so often with regular updates.

Anyway, we're not really pushing this any further, maybe Panu will have
something to say when he looks at the apt cache, or maybe the bug will vanish
once the pure yum bugs elsewhere in this bugzilla get fixed. If a yum/kernel
update makes your problem vanish, please note this in this bug.

Comment 39 Axel Thimm 2006-11-25 15:21:03 UTC

For reference here are some bugs in rpm/yum of which this may be a duplicate:

bug #203233
bug #206275
bug #211254
bug #212504
bug #213963
bug #214129

Comment 40 Panu Matilainen 2006-11-26 12:24:19 UTC

The long and the short of this is that this is not an easy "ahhah, there's the
NULL pointer dereference" type of bug, don't expect it to be fixed "just like
that". Something in causing corruption in apt main datastructure (which is a
memorymapped cachefile) and whether it's the combination of 2.6.18 kernel + low
memory + apt-rpm's mmap() usage patters or something else remains to be seen.

I'm reinstalling my 32bit testbox now and try to see if I can (eventually)
reproduce it by limiting available memory.

Comment 41 Eli Wapniarski 2006-11-27 04:41:18 UTC

Thank you Panu. I'm looking forward to hearing your results.

Comment 42 Axel Thimm 2006-11-29 19:31:46 UTC

*** Bug 214846 has been marked as a duplicate of this bug. ***

Comment 43 Axel Thimm 2006-11-29 19:32:14 UTC

*** Bug 217707 has been marked as a duplicate of this bug. ***

Comment 44 panu.matilainen 2006-11-30 07:09:26 UTC

Since this is now the central bug for tracking this issue, here are my findings
so far:

I managed to reproduce the second backtrace here. The steps:
- install fresh FC6-i386, pretty much default installation
- boot with mem=256M, disable swap
- # apt-get update
- # apt-get dist-upgrade
- the dist-upgrade died in middle of transaction after first package upgrade
- consecutive apt-get dist-upgrade runs crashes in FindInCache 

After rebuilding apt cache it's not segfaulting anymore (and can't reproduce
that at will, so it doesn't happen *always*) but dist-upgrade with over hundred
packages keeps exiting "normally" after just one package upgrade (rpmlib calls
exit(0) at some signal apparently, I'll need to talk to JBJ about that). After
re-enabling swap dist-upgrade appears to continue normally now.

So, it would appear that this is at least related to systems being tight on
memory. Why this has only appeared now ... is it a matter of repositories
getting bigger, kernel changes or what remains to be seen. I should have time to
look properly at this today with wife and the kid out for the evening :)

Now, couple of things that *might* help, and on which I'd like to hear test
results (after clearing the various potentially corrupted caches):
1) Try adding (temporarily) more swap to the system, for example just double
what you have now. Swapfile will do just fine as it's intended for just a
temporary check/bandaid.
2) Try setting 'RPM::PM "external";' in /etc/apt/apt.conf. That causes apt to
use external rpm process to run the transactions which has the side-effect of
essentially splitting the memory usage between two processes, making kernel's
OOM killer less trigger happy to terminate the upgrade process.

1) is the test I'm more interested in.

Comment 45 Panu Matilainen 2006-11-30 07:27:49 UTC

(duh, previous post while logged in to "wrong" account, sorry about that)

One thing I forgot to mention: try to keep an eye on how apt installs/upgrades
finish - you should always see "Done." at the end of "Commiting changes..."
output, if you don't, then it has died abnormally in middle of transaction
(because rpmlib has called it quits without giving a chance for apt to do
anything about it). That abnormal exit is at least one possible cause for this
problem.

Comment 46 Eli Wapniarski 2006-11-30 18:56:14 UTC

Thanks Panu for what sounds like very reasonable suggestions and tests. I will be 
trying these things first thing in the morning. Its been a very long week and I need 
some food and sleep. I will let you know.

Comment 47 Eli Wapniarski 2006-12-01 07:22:45 UTC

I just ran test 1) as per comment 44. Apt-get worked and exited normally as done. I 
added the temporary swap file (512MBytes) to fstab for the time being. This was a first 
run.

One more bit of behavior that I have noticed; after cleaning out the caches and lock 
files, then rebuilding the rpm database, I usually get things to work for one run. After 
that run, if I immediately run apt-get update / upgrade no segfaults. Mind you, there is 
nothing to upgrade. However when the next set of packages are ready to be 
upgraded, then apt-get will segfault.

Comment 48 Eli Wapniarski 2006-12-02 06:00:37 UTC

As per test 1). I had the opportunity last night to make the second consecutive run with 
apt-get. The problem continues. So providing more swap memory didn't help. I will be 
testing 2) next.

Comment 49 Eli Wapniarski 2006-12-02 06:13:27 UTC

Test 2) better. I first had to

rm -f /var/lib/rpm/__db*
rm -f /var/lib/apt/listsl/*.*
rm -f /var/cache/apt/*.bin

I was able to run apt-get update / update without a segfault and without having to 
reboot and run rpm --rebuilddb

Mind you, there were no packages to upgrade. So, the test is incomplete. I will let you 
know how things go once there are new packages available.

Comment 50 Eli Wapniarski 2006-12-03 05:33:18 UTC

As per test 2), The problem persists.

Comment 51 Panu Matilainen 2006-12-03 12:42:21 UTC

Ok, pretty much expected, I have been able to reproduce the problem with gobs of
memory available so it apparently wasn't related to that after all. I've gotten
a bit futher in my investigations now, it IS related to rpmdb, but just exactly
how is a bit of a mystery. The crash occurs because something causes a pointer
in apt's cache to what should be a string containing rpm database path to be
NULL, but the rpm database itself seems to be intact. That's a side-effect of
*something* - what exactly I dunno yet.

The nasty thing here is that the segmentation fault happens on the run *after*
the damage has been done already, so debugging it is somewhat like post-mortem
analysis :)

Comment 52 Panu Matilainen 2006-12-03 14:20:12 UTC

Well well well, this also seems to be happening on Debian apt:
http://ubuntuforums.org/showthread.php?t=266566
https://launchpad.net/distros/ubuntu/+source/apt/+bug/61708
http://bugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=383223
http://bugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=355047

All of those are reasonably recent and crash occurs in the very same place (if
you ignore the used repository type) - something has corrupted (one of) the
index file names in cache. Could well be a long standing bug in apt cache
handling, only triggered by some of the changes in latest kernels. One
possibility could be some of the address-space randomization things, just a wild
guess though.

Comment 53 Eli Wapniarski 2006-12-03 16:26:17 UTC

Looking forward to hearing that you've found and fixed the problem. As always, I 
remain available for testing purposes.

Comment 54 Axel Thimm 2006-12-11 20:50:59 UTC

*** Bug 219134 has been marked as a duplicate of this bug. ***

Comment 55 Panu Matilainen 2006-12-21 19:59:47 UTC

Dunno, but this sounds eeriely familiar:
http://article.gmane.org/gmane.linux.kernel/477324

- Linus' test program on FC6 2.6.18-1.2849.fc6 kernel behaves the way he expects
2.6.19 to work
- For 2.6.18 you need to be unlucky and under memory pressure
- "Some data on mmaped file appears zeroed" is exactly the kind of corruption
that triggers this crash

I would really appreciate if the people who are hitting this on any sort of
regularity could try downgrading their kernel to something older (for example
2.6.17-1.2187_FC5 with Linus' test doesn't exhibit zeroes in the middle) for a
while and see if you can still reproduce the crash eventually. 

Meanwhile I'll have a look at apt's mmap code and see if there's anything
resembling the "trigger pattern."

Comment 56 Eli Wapniarski 2006-12-22 04:41:01 UTC

Sorry Panu, downgrading the kernel is the one thing that I cannot comply with :( The 
computer that's giving me the most problems is a server facing the internet and I 
cannot, in good faith to the company that I'm working for, compromise kernel security 
for this.

Comment 57 Panu Matilainen 2006-12-22 06:31:20 UTC

Sure, I'm not expecting anybody to mess around with production enviroment to get
this sorted out. If others can test older kernels that would be much appreciated.

Comment 58 Imtiaz Rahi 2006-12-22 16:14:13 UTC

Hi Panu,
I am going to test this issue with a older kernel.
I am using FC6 and going to use the 2.6.17-1.2187_FC5 kernel as you have mentioned.
I hope to come up with some results within few days.

Comment 59 Panu Matilainen 2006-12-22 22:01:42 UTC

Ok, thanks.

I've been reading through the long, long thread on linux-kernel mailinglist
about the mmap file corruption issue I mentioned in comment #55, and oh boy is
it hairy. Nobody really knows whether it's really just an application bug, only
triggered by recent kernels, a kernel bug triggered by some rare application
usage patterns or combination of both.

The short summary however is that there are basically two applications people
have seen corruption with: (Debian) apt and a bittorrent client. The mmap code
is identical in Debian apt and apt-rpm (it's been unchanged for years AFAIK), so
that kind of confirms that this has indeed been triggered by something in recent
kernels like Eli suspected early on.

Comment 60 Panu Matilainen 2006-12-22 22:43:28 UTC

Created attachment 144319 [details]
Band-aid patch for the cache corruption

While looking for the real cause and solution, here's a band-aid patch to help
the situation. The patch does NOT fix the real issue, it only detects a
specific symptom and forces a cache rebuild when corruption is detected and
issues warnings. That effectively cures the segfault unless you're very
unlucky.

The corruption seems to always happen in the area regarding rpmdb itself, which
is special in the sense that it's always the last one of all "repositories" to
be processed. That's another hint towards some of the findings/speculations on
lkml.

BTW if somebody can capture a full strace of an 'apt-get update' run where the
segfault *initially* happens (the crashes afterwards aren't that interesting)
that might have some interesting data in it.

Comment 61 Axel Thimm 2006-12-22 23:11:58 UTC

Do you want this to be added to the package?

Perhaps the mmap issue is also present in rpm itself? That could explain the
yum/rpm bug in fc6.

Comment 62 Panu Matilainen 2006-12-23 11:00:38 UTC

Might as well add it to the package, besides avoiding crashes in rpm-related
code (which is always a bit nasty) it should help collect people seeing the
problem to this bug :)

Berkeley DB does use mmap so I suppose it's at least possible the same thing
affects rpm itself as well.

Comment 63 Imtiaz Rahi 2006-12-24 16:02:38 UTC

Today I used apt-get and synaptic on my kernel "2.6.18-1.2798.fc6" and
astonishingly they did not crashed.
Looks like the extras and updates repo files of Fedora are good (ok !!!).
Previously, apt-get update crashed while working on extras / updates repo.
So, now I am going to wait for new crash happening on "2.6.18-1.2798.fc6" and
then will test with the "2.6.17-1.2187_FC6". That kernel is ready now.

Comment 64 Eli Wapniarski 2006-12-24 19:26:32 UTC

So far so good. The bandaid patch allowed me to get at least one run with apt-get. I will 
let you know if things don't work out.

Comment 65 Panu Matilainen 2006-12-24 20:22:56 UTC

Do note that with the bandaid patch, you'll get loud warnings if the corruption
triggers, it just doesn't (or shouldn't ;) crash anymore because of it. So
people who used to see the crash should see "Cache corruption detected, band-aid
applied" now just as often as they did see the crashes. 

Yet another thing people can try: some folks on lkml reported that mounting the
filesystem in question (in apt's case wherever /var is located) with
data=writeback option (assuming ext3 filesystem is used) seems to cure the
corruption issue. If people can try that and see if they still get crashes (or
with the bandaid patch, warnings about corruption) or not, that'd be an useful
datapoint as well. Check 'man mount' for what the option does in detail and if
on production environment, whether the implications matters to you or not.

Oh and remember, a single successful run (meaning no crashes and no warnings)
doesn't mean anything at all, this doesn't trigger anywhere near 100% reliably
so it's going to take quite a bit of time to be convinced it (be it the mount
option or whatever) made a real difference.

Comment 66 Jérôme Benoit 2006-12-30 00:08:01 UTC

Ok, baind aid patch helped to have apt-rpm running on my AMD Duron test box but
maybe this bug is related to a VM kernel bug (i don't have the knowledge to
evaluate this). 

See
http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7658cc289288b8ae7dd2c2224549a048431222b3

Thanks.

Comment 67 Panu Matilainen 2006-12-30 14:16:13 UTC

Yup, I'm fairly convinced by now it's that kernel VM bug what's been hitting
apt(-rpm). Now we just need to verify the above kernel patch cures the crashes
(or with the bandaid patch to apt-rpm, the warnings).

If somebody can test it, that'd be great :)

Comment 68 Panu Matilainen 2007-01-05 07:42:15 UTC

Just FYI, there's a kernel update coming out "next week or so" including a fix
for the mmap file corruption issue we have here, so this should get resolved
soonish:
https://www.redhat.com/archives/fedora-devel-list/2007-January/msg00084.html

Comment 69 Panu Matilainen 2007-01-24 07:05:32 UTC

I haven't been able to reproduce this since updating to the latest kernel
(2.6.19-1.2895.fc6). Dunno if that's available for FC5 though.

Mind you, it's possible you'll see the warning *once* after rebooting to the
updated kernel: if the previous run on old kernel has corrupted the cache it'll
hit you the next time you run apt, fixed kernel or no. I'd say it's best to
force the cache rebuild ('rm -f /var/cache/apt/*.bin') after booting to the new
kernel just in case.

From my POV I consider this case closed. Axel, I suggest you leave the bandaid
patch in place for FC5 and 6 as there could be lots and lots of people running
those with older kernels, for rawhide it can go at this point I think.

Comment 70 Eli Wapniarski 2007-01-24 10:37:17 UTC

I can pretty much confirm that the kernel fixed the problem on my fc6 machines. 
However, as Panu pointed out, there is yet to be a kernel update for fc5. And one in 
particular is giving me no end of headaches.

Anyone have any idea when there will be a kernel update for fc5 incorporating the fix?

Comment 71 Panu Matilainen 2007-01-24 11:41:37 UTC

Dave, any idea when FC5 will get an updated kernel fixing the mmap corruption
thingy (which this bug is all about)?

Comment 72 Axel Thimm 2007-01-24 12:22:20 UTC

> From my POV I consider this case closed. Axel, I suggest you leave the bandaid
> patch in place for FC5 and 6 as there could be lots and lots of people running
> those with older kernels, for rawhide it can go at this point I think.

I'll keep the patch, wouldn't it even be nice to keep it upstream? It's a
failsafe path that is usually not taken unless something skrews up and such a
net is nice :)

Now, how do I close this bug? It wasn't FIXED in apt, but it's also not NOTABUG.
Since it was was fixed elsewhere, it's also not WONTFIX or CANTFIX. Technically
I'd move it to the kernel and close it there, but I don't want the kernel guys
to be confused.

I'll try fixed in CURRENTRELEASE, since there is the bandaid fix for FC5, too.

Comment 73 Eli Wapniarski 2007-01-24 15:30:40 UTC

Axel. The "not a bug" shouldn't be closed quite yet. Maybe it should be transferred over 
to the kernel boys since we still do not have a fix for fc5. Which is why I opened the 
bug in the first place.

Comment 74 Axel Thimm 2007-01-24 15:36:59 UTC

Doesn't the bandaid patch fix any issues with FC5? Agreed, it is not fixing the
cause, but it is a workaround fixing the outcome, e.g. the bug is dealt with.

Comment 75 Eli Wapniarski 2007-01-24 16:05:13 UTC

As Panu wrote in an earlier post, it depends on how memory stressed the system is. 
And I can confirm that this indeed is the case. One of my fc5 boxes sometimes requires 
several

rm -f /var/cache/apt/*.bin
rm -f /var/lib/apt/lists/*.*
rm -f /var/lib/apt/lists/lock
rm -f /var/lib/rpm/__db*

before apt-get will complete its cycle successfully. The thing about the band aid, is 
eventually, apt-get will work on the box giving me my biggest headache without me 
having to reboot the system (most of the time).

Comment 76 Axel Thimm 2007-01-24 16:53:25 UTC

In other words, you still get the bug on your FC5 system even though the bandaid
is supposed to workaround/fix that on the fly? Perhaps the bandaid patch does
not always detect the corruption. Panu, can really something slip past the
bandaid fix?

If you have installed the latest apt (that contains the bandaid) and the system
still gets chewed (which comment #75 suggests) please reopen the bug.

Comment 77 Panu Matilainen 2007-01-24 19:28:48 UTC

Whether the bandaid patch reliably detects and corrects the problem is
irrelevant (it's called bandaid for a reason :) There's a real fix to the
problem, getting an updated kernel to the users is the only thing that matters
anymore. That's what I meant with the "from my POV this case is closed" comment,
no amount of bandaid in apt is going to make it reliable if the kernel can't be
trusted to keep our data intact.

Eli, either the bandaid simply isn't working for you or there's a
misunderstanding here: you only need to do the rm -f stuff if you get segfaults
(which means the bandaid didn't help), otherwise the warning is just that: a
warning about this issue being present on the system.

Comment 78 Eli Wapniarski 2007-02-03 06:09:21 UTC

Hi guys, I made a request at

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227194

to find out when a new kernel release will be available for fc5.

Comment 79 Axel Thimm 2007-02-05 17:29:11 UTC


*** This bug has been marked as a duplicate of 214495 ***

Comment 80 Eli Wapniarski 2007-02-10 05:42:13 UTC

OK... I just noticed that there are 2.6.19 kernels in fc5 update testing. I just installed it 
in one of my fc5 boxes and will it booted OK. I will be seeing how it goes for a couple of 
days before I try to get it installed in the pain in the ass box unless of course new 
kernels are made generally available.

For those of you wanting to intall it remember, the kernel is a "testing" kernel so 
rpm -ivh is in order just in case you need to fall back to the older kernel.

Comment 81 Eli Wapniarski 2007-02-14 19:21:24 UTC

Even better, installed new, generally available 2.6.19 kernel for fc5. Things seem to be 
working. I'd give it a couple of more real updates, and if there are no more problems 
then I think that we can call this genuinely done.

Comment 82 Marco Nadal 2007-04-09 04:05:57 UTC

(In reply to comment #36)
> Eli, it is important to remove yum, if you want to do any testing with rpm
> and/or apt-get. yum is called by applets, daemons, cron jobs and who knows what
> else. So you may think that you're not using it, but in fact you do. That's why
> the instructions on bug #213963 comment #4 asked you to remove even both yum and
> apt.

If I try to remove yum using synaptic, it says that the following are dependent
on yum and need to be removed also:

docbook-dtds
ekiga
gdm
gnome-panel
gnome-pilot
kyum
pirut
scrollkeeper
synaptic
yum-utils
yumex

Since I want to keep synaptic, how should I go about this? Remove synaptic with
apt CLI, then add it back later.

I'm switching to synaptic because yum has never worked properly in FC5, and I
have had good experiences with synaptic in the past.