Bug 211254
Summary: | apt-get segfault | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Eli Wapniarski <eli> | ||||||||
Component: | apt | Assignee: | Axel Thimm <axel.thimm> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 5 | CC: | davej, extras-qa, imtiaz.rahi, jerome.benoit, pierre-bugzilla, pmatilai | ||||||||
Target Milestone: | --- | Keywords: | Reopened | ||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | 0.5.15lorg3.2-9 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2007-02-05 17:29:11 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Eli Wapniarski
2006-10-18 05:22:40 UTC
I've rebuilt the source RPMs for 4 computers. Three of them seem to have no problem with apt-get the fourth one continues to segfault. I expect that in the not to distant future, the others will conintue to segfault. I will let you know/ Please add some more information about the segfault, for example backtraces with the debuginfo package installed. For self-rebuilt packages you need to use the debuginfo package that you built, otherwise use that from the repos. OK.. I have the debug package installed. But when running apt-get update I still get a segmentation fault but, without additional info. How do I generate or find the backtrace info? With debuginfo package installed, you need to run it under gdb: # gdb apt-get (gdb) run update ... and when it crashes, get the backtrace: (gdb) bt Then copy-paste the full gdb session here. Oh and btw, I'd prefer getting the backtrace from FE built package rather than self-rebuilt versions to eliminate unnecessary variables from the equation. I would be very happy to comply with the request to get debug info from the prebuilt FE packages, but the debug packages are not supplied. I will let you know soon how things go. I would be very happy to comply with the request to get debug info from the prebuilt FE packages, but the debug packages are not supplied. I will let you know soon how things go. Here is the backtrace Reading Package Lists... 0% Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208555024 (LWP 24180)] 0x0054eb43 in strlen () from /lib/libc.so.6 (gdb) bt #0 0x0054eb43 in strlen () from /lib/libc.so.6 #1 0x0095d7d6 in std::string::compare () from /usr/lib/libstdc++.so.6 #2 0x00c4a4e1 in rpmPkgListIndex::FindInCache (this=0x9526f38, Cache=@0xbf8a761c) at /usr/lib/gcc/i386-redhat-linux/4.1.1/../../../../include/c++/4.1.1/bits/basic_string.h:2200 #3 0x00cb4fa0 in CheckValidity (CacheFile=Variable "CacheFile" is not available. ) at pkgcachegen.cc:654 #4 0x00cb55b9 in pkgMakeStatusCache (List=@0xbf8a7884, Progress=@0xbf8a7918, OutMap=0xbf8a7a18, AllowMem=false) at pkgcachegen.cc:789 #5 0x00c9e1e1 in pkgCacheFile::BuildCaches (this=0xbf8a7a18, Progress=@0xbf8a7918, WithLock=false) at cachefile.cc:74 #6 0x00c9e304 in pkgCacheFile::Open (this=0xbf8a7a18, Progress=@0xbf8a7918, WithLock=true) at cachefile.cc:94 #7 0x0806543a in CacheFile::Open (this=0xbf8a7a18, WithLock=true) at apt-get.cc:100 #8 0x080566c1 in DoUpdate (CmdL=@0xbf8a829c) at apt-get.cc:1748 #9 0x00c2506b in CommandLine::DispatchArg (this=0xbf8a829c, Map=0xbf8a8210, NoMatch=true) at contrib/cmndline.cc:340 #10 0x0805dbc5 in main (argc=2, argv=Cannot access memory at address 0x4 ) at apt-get.cc:3312 #11 0x004fa4e4 in __libc_start_main () from /lib/libc.so.6 #12 0x0804d221 in _start () OK.... Just blind as a bat. I did not see the debug folder in extras. So I reinstalled all packages from FE including the debug folder. I got exactly the same results. OK --- I just downloaded and compiled and installed the latest development release apt-0.5.15lorg3.90 from apt-rpm.org. On the machines that were giving me trouble, apt-get seemed to work as expected at least once. I will let you know next weekend if it continues to run trouble free, unless new packages are provided. If that's happens, I will report on how apt-get works with the new packages when provided. I appreciate very much the work that you are doing I cannot reproduce it, neither on FC5, nor on FC6. I wonder if your metadata under /var has some trouble/corruption. Could you nuke that? Best done by uninstalling apt, removing /var/cache/apt and /var/state/apt and reinstalling apt. I can't reproduce it either, and the traceback suggests like Axel says, that there's something very strange about the metadata. What repositories are in use on the systems where apt crashes? (sources.list and sources.list.d contents) Funny... Axel, I did as suggested, everything seemed to work at least once. As I'm sure there will ample opportunity to test things out with kde-redhat updating from 3.5.4 to 3.5.5. I will let you know by the end of next week if things continue to function as expected. The question remains, how do three computers out of five have problematic metadata? hmmm. Panu - It doesn't crash while downloading package data. It crashes either at the very beginning, (before reading the data or during the "Reading Package Lists" stage or "Building Dependency Tree" stage. Oh.. one other thing maybe I should mention, I am getting a few packages from Livna as required by kde-redhat. They use repmod exclusively having dropped support for traditional apt-get. Eli, can you attach (or make somehow accessible) the exact contents of /etc/apt/sources.list and sources.list.d directory on a system where apt crashes? According to the traceback, the crash occurs on an old-style apt-rpm repository, so that rules out all the repomd repositories such as FC+FE, Livna and kde-redhat. Created attachment 139053 [details]
apt sources
Here you go
Panu, don't you also need a tarball of Eli's /var/*/apt contents to reproduce it? If yes, then Eli please make them available through some URL instead of attaching them to the bug :) (but wait to see if Panu really needs them, maybe he doesn't) I was basically hoping for an easy reproducer with just the info about repositories :) Alas, no such luck. So yes, to futher track this I'd need the following bits from a system that crashes: /var/cache/apt/*.bin /var/lib/apt/ (can be at /var/state/apt in some cases) contents in their entirety, although the problem is most likely in the *.bin files. The *.bin files are the key to pretty much everything in apt and quite often just removing them fixes various more-or-less mysterious problems, would seem to be the case here as well according to comment #12. So, after backing up the current cached files, do the cleanup steps on each problematic box and lets see if that helps. Even if that cures the problem, the corrupted cache data is interesting to me as garbage data shouldn't segfault, just error out cleanly. Sorry Panu. Those files have been cleaned way back by comment #12. If this should happen again, I will indeed send the files. Axel. could you keep this open for a week. Like I said, in about weeks time, if everything is OK, I will post a comment to indicate all is well or not. No, I won't close it. I'm lowering the severity as it seems to only affect your system (and there are many apt users still). If you find that there is no issue anymore you can close it, too. If the issue doesn't pop up again or only pops on on one system I suspect that you may have some bad ram somewhere corrupting the cache data. BTW some people sync their contents of /var/*/apt to save some bandwidth, don't do it otherwise if something eats up your bits on one system you mirror the issue to the other healthy systems, too. Created attachment 139074 [details]
/var/lib/apt and /var/cache/apt
It happened again
I'm getting pretty convinced that this is a memory issue and it may not be an apt problem per se. However, I'm seeing this happen only with apt. And, its only been happening since the last kernel update. So, I suspect that there is something buggy about the latest kernel release. But how do I figure out where? I'm dealing with 4 FC5 boxes. On three of them I have 512 MByte RAM on the 4th, I have 1 GIG. On the 1 Gig box I have yet to see this problem. On the other three boxes. This is a continuing and on going nag. Running RPM commands by themselves does not seem to be a problem. Only, when running apt. Apt uses quite a bit of memory, especially with repomd repositories, so it's possible it triggers something more easily than others. Do you remember at which kernel update this started happening and if you downgrade the kernel back to some older version does it actually stop this from happening? Also, is there anything even remotely relevant in /var/log/messages from the time when this corruption happens? Sorry, I haven't had a chance to look at your cache data yet, been a bit busy with other things :-/ It'll be awhile before I can get back to this. I'm trying to upgrade my main desktop from fc5 i386 to fc6 x86_64. Its a real nightmare. I will try to get back to this within a week, that is if I don't need to reinstall my desktop. The kernel that is currently installed on the machines that are giving me trouble, are 2.6.18-1.2200.fc5 i686. One correction to make is that one of the computers has less than 1/2Gig but rather 256MBytes. And before I can use apt-get successfully on this machine, I have to restart the computer, run rm -f /var/lib/__db* rpm --rebuilddb because the rpmdb gets corrupted. also, I have to remove the bin files from /var/cache/apt and sometimes the files in /var/lib/apt/lists This can't be good for this machine. Eli OK... I'm back. Had a bit of trouble getting the upgraded machine to function properly but that's another bug. In the meantime I've had a little time to do some research, and found that the problem does not effect apt-get get, but rpm and yum as well. A cursive search at this bugzilla sight on rpm segfault or yum segfault reveals several reports. An example would be https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=213963 I really would like to help getting this problem solved. Thanks for your patience There are known bugs in yum on FC6 due to opening/closing rpmdb too often, or one can argue that yum only exhibits bugs that were previously in the rpmlib code but undetected, either way you look at it, one would have to uninstall yum, fix up any broken rpm metadata (e.g. rm -f /var/lib/rpm/__db*; rpm --rebuilddb) and only use apt for a while. Note that yum is automatically invoced by applets, cron jobs, daemons and the like if present, so you really need to uninstall it for the sake of testing. Can you please do so for say a week and report back with positive or negative results? Thanks! OK... Here's the situation. I don't like yum and if I can avoid using it I do. And I have been avoiding it for several years. I don't invoke the automatic scripts, because I like to do kernel updates manually. And, on occasion I like to change runlevels in order to ensure that I get trouble free updates (another story). I deal with 5 fc computers. Three of them are fc6 and 2 are fc5. The fc6 computers function as desktops while the fc5 function as servers. FC6 computers Memory Composistion ------------------------------------------------ 1 Gig 512 MBytes 256 MBytes FC5 computers Memory Composition ---------------------------------------------- 512 MBytes 256 MBytes The only computer that is giving me consistent headaches is the 256 FC5 machine. It is utilizing serveral services. It provides gateway, dns, mail gateway, ftp, http, and ssh services. So, it utilizes a considerable amount of memory, but, I have never, until recently had a problem with apt-get. In order to fix the problems, for the last month or so, I've had to: rm -f /var/lib/rpm/__db* rm -f /var/lib/apt/*.bin rm -f /var/cache/apt/lists/*.* Then, reboot the computer run rpm --rebuilddb then apt-get update apt-get upgrade Usually worked but a real pain in the butt. A couple of days ago, that didn't work either. So, I tried yum, because I need to keep the system patched, especially with security updates since this computer constantly faces the internet. First invocation, yum segfaulted. After simply running rm -f /var/lib/rpm/__db* And ran yum update again, everything worked the way it was supposed to. However, I started getting packages from repos I did not want the packages to come from. And this is why I prefer apt-get to yum. I love the pinning feature available in apt-get. It saves my brain a great deal of confusion. Anyway. On the other machines I am still able to run apt-get. When I see the segfault, simply running rm -f /var/cache/rpm/*.bin fixes the problem. apt-get update apt-get upgrade works. I have never had a problem like this until recently. Which makes me suspect the upgrading of critical libraries as the main culprit. This brings me back to my suspicion that the main instigator is a kernel update. Why, because, if I recall correctly, one of the updates to the 2.6.18 kernel changed the way the kernel manages memory and if I'm not mistaken, a segfault is somekind of screwup in memory. Oh... I've copied this message over to the yum bug. OK .... I just installed apt 0.5.15lorg3.2-8 for fc5 and fc6 so far so good. On the machine giving me the most trouble, apt-get worked on the first run. If I continue to run trouble free, I will let you know. If of course I continue to get segfaults, then I will let you know sooner. Thank you so much for your work. Oh... one other thing. Looks like synaptic needs to be rebuilt on fc6 platforms. (In reply to comment #28) > OK .... I just installed apt 0.5.15lorg3.2-8 for fc5 and fc6 so far so good. There were only ppc related changes in this release, this will not work better/worse in your context than the previous one. :/ Please do the testing with only rpm involved as mentioned bug #213963. Sorry, Axel, but this is about as far as I can go with the previous version. I have provided a traceback, requested information and all the observations that I know how to provide. There is nothing for me left to test. The previous version under current circumstances is useless for me. I have to move on. Hopefully the current version fixes the problem since, no doubt it was compiled against the most recent libraries. Like I said, I will let you know how things go. You indicated in this and other reports that this seems to stem from other non-apt related parts of the system (yum and/or rpm/kernel), which suggested that apt may never have been at fault. Therefore the isolated testing is neccessary to see whether apt was at fault ever. I completely understand lack of time, so if you can't proceed with debugging, let's close this as WORKSFORME, as there hasn't been anyone (including Panu and myself) that could reproduce it in the sense of apt being responsible for corrupting rpmdb. I'll put it into NEEDINFO for now. Well Axel, No Joy. I attempted a second run on the "more trouble than its worth" running apt-get and I get segfaults and the appearance of a corrupt rpm database. I guess I will have to use yum (yech) on that computer. Please, you, or anyone else, let me know if you have anything else to suggest that I can do to troubleshoot or help to resolve this issue. I already suggested a way to help in bug #213963 comment #4. Should the rpm-stress test fail, then the bug is in rpm/kernel. If it succeeds, then one needs to check one depsolver at a time (e.g. only apt-get installed or only yum installed) by a similar stress test. Also did you check your memory hardware? What I mean to suggest, is that there may be a relationship between the new memory management scheme in the newer kernels and the problems that I have experienced. They appear to be significantly more severe when using apt-get. I did not test memory, because I find it improbable that I am experiencing a hardware difficulty when the problem appears on five seperate computers. All of which experience the same difficulty to varying degrees depending on the amount of memory and the load on memory. The most severe is a gateway server with 256 MBytes RAM running FC5. I have already sent all the relevant information, and they are attached to this bug report. I have yet to hear about the results from Panu. I do not believe that the RPM database is corrupted by apt-get but it appears that way to apt-get. The reason that I say this, is that if I experience the rare segfault with yum then removing the lock files (__db*) as suggested in the link I provided about the problem in yum and then run yum update everything works OK. Only, I'm getting packages from repos that I don't want to get them from. This is the reason that I prefer apt-get (pinning). I have tried to run apt-get with only headers coming from Freshrpms (core, updates, extras, freshrpms). Same results. If I use the repmod configuration linking to Fedora itself, then I may as well use yum, because then all pinning goes out the window, because, as far as I know, pinning is not supported with repmod data. Its not that I don't have the time. to continue. I'm terrirfied that I may actually do irreperable damage to the RPM database if I continue to use apt-get on that machine. I'm willing to continue to troubleshoot, but, lets not go around in circles. Making me write the same detailed report over and over again. I do not have the time for that :). Eli, it is important to remove yum, if you want to do any testing with rpm and/or apt-get. yum is called by applets, daemons, cron jobs and who knows what else. So you may think that you're not using it, but in fact you do. That's why the instructions on bug #213963 comment #4 asked you to remove even both yum and apt. And grepping this bug report for yum shows that you've been using it to cross-check the results/failures you have been encountering all along, so the results you quote are a mixed use of rpm/apt/yum and we can't put a finger on one of them. In order to see which component is at fault, rpm/kernel, apt or yum, you need to isolate the problem. You suspect kernel/rpm, then please follow the instructions and start with a good rpmdb and w/o any apt/yum/etc. tools around. If you manage to break the rpmdb, then you'll have proved that it's rpm/kernel that is the rogue character. Otherwise you would have to add apt to it and repeat. If it break now it's apt. And if it doesn't it was yum all along. Please explain to me how in the world will I be able to maintain the computer without yum or apt-get? How in the world am I to determine if the rpmdb is good or not? I have used apt-get exclusively for several years. I have never had yum-updateonboot and yum does not exist in any of the cron jobs. On the most troublesome computer. No other process, that I'm aware of does an automatic update. I don't like and don't use gnome except for a few applications. Primarily synaptic. Which daemons call yum? Maybe I can configure them out of running automatically? Eli, I didn't imply staying w/o yum/apt for the rest of your life :) The rpmdb stress tests wouldn't take longer than 5 minutes each in any testing of yours, since the bug seems to hit you so often with regular updates. Anyway, we're not really pushing this any further, maybe Panu will have something to say when he looks at the apt cache, or maybe the bug will vanish once the pure yum bugs elsewhere in this bugzilla get fixed. If a yum/kernel update makes your problem vanish, please note this in this bug. For reference here are some bugs in rpm/yum of which this may be a duplicate: bug #203233 bug #206275 bug #211254 bug #212504 bug #213963 bug #214129 The long and the short of this is that this is not an easy "ahhah, there's the NULL pointer dereference" type of bug, don't expect it to be fixed "just like that". Something in causing corruption in apt main datastructure (which is a memorymapped cachefile) and whether it's the combination of 2.6.18 kernel + low memory + apt-rpm's mmap() usage patters or something else remains to be seen. I'm reinstalling my 32bit testbox now and try to see if I can (eventually) reproduce it by limiting available memory. Thank you Panu. I'm looking forward to hearing your results. *** Bug 214846 has been marked as a duplicate of this bug. *** *** Bug 217707 has been marked as a duplicate of this bug. *** Since this is now the central bug for tracking this issue, here are my findings so far: I managed to reproduce the second backtrace here. The steps: - install fresh FC6-i386, pretty much default installation - boot with mem=256M, disable swap - # apt-get update - # apt-get dist-upgrade - the dist-upgrade died in middle of transaction after first package upgrade - consecutive apt-get dist-upgrade runs crashes in FindInCache After rebuilding apt cache it's not segfaulting anymore (and can't reproduce that at will, so it doesn't happen *always*) but dist-upgrade with over hundred packages keeps exiting "normally" after just one package upgrade (rpmlib calls exit(0) at some signal apparently, I'll need to talk to JBJ about that). After re-enabling swap dist-upgrade appears to continue normally now. So, it would appear that this is at least related to systems being tight on memory. Why this has only appeared now ... is it a matter of repositories getting bigger, kernel changes or what remains to be seen. I should have time to look properly at this today with wife and the kid out for the evening :) Now, couple of things that *might* help, and on which I'd like to hear test results (after clearing the various potentially corrupted caches): 1) Try adding (temporarily) more swap to the system, for example just double what you have now. Swapfile will do just fine as it's intended for just a temporary check/bandaid. 2) Try setting 'RPM::PM "external";' in /etc/apt/apt.conf. That causes apt to use external rpm process to run the transactions which has the side-effect of essentially splitting the memory usage between two processes, making kernel's OOM killer less trigger happy to terminate the upgrade process. 1) is the test I'm more interested in. (duh, previous post while logged in to "wrong" account, sorry about that) One thing I forgot to mention: try to keep an eye on how apt installs/upgrades finish - you should always see "Done." at the end of "Commiting changes..." output, if you don't, then it has died abnormally in middle of transaction (because rpmlib has called it quits without giving a chance for apt to do anything about it). That abnormal exit is at least one possible cause for this problem. Thanks Panu for what sounds like very reasonable suggestions and tests. I will be trying these things first thing in the morning. Its been a very long week and I need some food and sleep. I will let you know. I just ran test 1) as per comment 44. Apt-get worked and exited normally as done. I added the temporary swap file (512MBytes) to fstab for the time being. This was a first run. One more bit of behavior that I have noticed; after cleaning out the caches and lock files, then rebuilding the rpm database, I usually get things to work for one run. After that run, if I immediately run apt-get update / upgrade no segfaults. Mind you, there is nothing to upgrade. However when the next set of packages are ready to be upgraded, then apt-get will segfault. As per test 1). I had the opportunity last night to make the second consecutive run with apt-get. The problem continues. So providing more swap memory didn't help. I will be testing 2) next. Test 2) better. I first had to rm -f /var/lib/rpm/__db* rm -f /var/lib/apt/listsl/*.* rm -f /var/cache/apt/*.bin I was able to run apt-get update / update without a segfault and without having to reboot and run rpm --rebuilddb Mind you, there were no packages to upgrade. So, the test is incomplete. I will let you know how things go once there are new packages available. As per test 2), The problem persists. Ok, pretty much expected, I have been able to reproduce the problem with gobs of memory available so it apparently wasn't related to that after all. I've gotten a bit futher in my investigations now, it IS related to rpmdb, but just exactly how is a bit of a mystery. The crash occurs because something causes a pointer in apt's cache to what should be a string containing rpm database path to be NULL, but the rpm database itself seems to be intact. That's a side-effect of *something* - what exactly I dunno yet. The nasty thing here is that the segmentation fault happens on the run *after* the damage has been done already, so debugging it is somewhat like post-mortem analysis :) Well well well, this also seems to be happening on Debian apt: http://ubuntuforums.org/showthread.php?t=266566 https://launchpad.net/distros/ubuntu/+source/apt/+bug/61708 http://bugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=383223 http://bugs.donarmstrong.com/cgi-bin/bugreport.cgi?bug=355047 All of those are reasonably recent and crash occurs in the very same place (if you ignore the used repository type) - something has corrupted (one of) the index file names in cache. Could well be a long standing bug in apt cache handling, only triggered by some of the changes in latest kernels. One possibility could be some of the address-space randomization things, just a wild guess though. Looking forward to hearing that you've found and fixed the problem. As always, I remain available for testing purposes. *** Bug 219134 has been marked as a duplicate of this bug. *** Dunno, but this sounds eeriely familiar: http://article.gmane.org/gmane.linux.kernel/477324 - Linus' test program on FC6 2.6.18-1.2849.fc6 kernel behaves the way he expects 2.6.19 to work - For 2.6.18 you need to be unlucky and under memory pressure - "Some data on mmaped file appears zeroed" is exactly the kind of corruption that triggers this crash I would really appreciate if the people who are hitting this on any sort of regularity could try downgrading their kernel to something older (for example 2.6.17-1.2187_FC5 with Linus' test doesn't exhibit zeroes in the middle) for a while and see if you can still reproduce the crash eventually. Meanwhile I'll have a look at apt's mmap code and see if there's anything resembling the "trigger pattern." Sorry Panu, downgrading the kernel is the one thing that I cannot comply with :( The computer that's giving me the most problems is a server facing the internet and I cannot, in good faith to the company that I'm working for, compromise kernel security for this. Sure, I'm not expecting anybody to mess around with production enviroment to get this sorted out. If others can test older kernels that would be much appreciated. Hi Panu, I am going to test this issue with a older kernel. I am using FC6 and going to use the 2.6.17-1.2187_FC5 kernel as you have mentioned. I hope to come up with some results within few days. Ok, thanks. I've been reading through the long, long thread on linux-kernel mailinglist about the mmap file corruption issue I mentioned in comment #55, and oh boy is it hairy. Nobody really knows whether it's really just an application bug, only triggered by recent kernels, a kernel bug triggered by some rare application usage patterns or combination of both. The short summary however is that there are basically two applications people have seen corruption with: (Debian) apt and a bittorrent client. The mmap code is identical in Debian apt and apt-rpm (it's been unchanged for years AFAIK), so that kind of confirms that this has indeed been triggered by something in recent kernels like Eli suspected early on. Created attachment 144319 [details]
Band-aid patch for the cache corruption
While looking for the real cause and solution, here's a band-aid patch to help
the situation. The patch does NOT fix the real issue, it only detects a
specific symptom and forces a cache rebuild when corruption is detected and
issues warnings. That effectively cures the segfault unless you're very
unlucky.
The corruption seems to always happen in the area regarding rpmdb itself, which
is special in the sense that it's always the last one of all "repositories" to
be processed. That's another hint towards some of the findings/speculations on
lkml.
BTW if somebody can capture a full strace of an 'apt-get update' run where the
segfault *initially* happens (the crashes afterwards aren't that interesting)
that might have some interesting data in it.
Do you want this to be added to the package? Perhaps the mmap issue is also present in rpm itself? That could explain the yum/rpm bug in fc6. Might as well add it to the package, besides avoiding crashes in rpm-related code (which is always a bit nasty) it should help collect people seeing the problem to this bug :) Berkeley DB does use mmap so I suppose it's at least possible the same thing affects rpm itself as well. Today I used apt-get and synaptic on my kernel "2.6.18-1.2798.fc6" and astonishingly they did not crashed. Looks like the extras and updates repo files of Fedora are good (ok !!!). Previously, apt-get update crashed while working on extras / updates repo. So, now I am going to wait for new crash happening on "2.6.18-1.2798.fc6" and then will test with the "2.6.17-1.2187_FC6". That kernel is ready now. So far so good. The bandaid patch allowed me to get at least one run with apt-get. I will let you know if things don't work out. Do note that with the bandaid patch, you'll get loud warnings if the corruption triggers, it just doesn't (or shouldn't ;) crash anymore because of it. So people who used to see the crash should see "Cache corruption detected, band-aid applied" now just as often as they did see the crashes. Yet another thing people can try: some folks on lkml reported that mounting the filesystem in question (in apt's case wherever /var is located) with data=writeback option (assuming ext3 filesystem is used) seems to cure the corruption issue. If people can try that and see if they still get crashes (or with the bandaid patch, warnings about corruption) or not, that'd be an useful datapoint as well. Check 'man mount' for what the option does in detail and if on production environment, whether the implications matters to you or not. Oh and remember, a single successful run (meaning no crashes and no warnings) doesn't mean anything at all, this doesn't trigger anywhere near 100% reliably so it's going to take quite a bit of time to be convinced it (be it the mount option or whatever) made a real difference. Ok, baind aid patch helped to have apt-rpm running on my AMD Duron test box but maybe this bug is related to a VM kernel bug (i don't have the knowledge to evaluate this). See http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7658cc289288b8ae7dd2c2224549a048431222b3 Thanks. Yup, I'm fairly convinced by now it's that kernel VM bug what's been hitting apt(-rpm). Now we just need to verify the above kernel patch cures the crashes (or with the bandaid patch to apt-rpm, the warnings). If somebody can test it, that'd be great :) Just FYI, there's a kernel update coming out "next week or so" including a fix for the mmap file corruption issue we have here, so this should get resolved soonish: https://www.redhat.com/archives/fedora-devel-list/2007-January/msg00084.html I haven't been able to reproduce this since updating to the latest kernel (2.6.19-1.2895.fc6). Dunno if that's available for FC5 though. Mind you, it's possible you'll see the warning *once* after rebooting to the updated kernel: if the previous run on old kernel has corrupted the cache it'll hit you the next time you run apt, fixed kernel or no. I'd say it's best to force the cache rebuild ('rm -f /var/cache/apt/*.bin') after booting to the new kernel just in case. From my POV I consider this case closed. Axel, I suggest you leave the bandaid patch in place for FC5 and 6 as there could be lots and lots of people running those with older kernels, for rawhide it can go at this point I think. I can pretty much confirm that the kernel fixed the problem on my fc6 machines. However, as Panu pointed out, there is yet to be a kernel update for fc5. And one in particular is giving me no end of headaches. Anyone have any idea when there will be a kernel update for fc5 incorporating the fix? Dave, any idea when FC5 will get an updated kernel fixing the mmap corruption thingy (which this bug is all about)? > From my POV I consider this case closed. Axel, I suggest you leave the bandaid
> patch in place for FC5 and 6 as there could be lots and lots of people running
> those with older kernels, for rawhide it can go at this point I think.
I'll keep the patch, wouldn't it even be nice to keep it upstream? It's a
failsafe path that is usually not taken unless something skrews up and such a
net is nice :)
Now, how do I close this bug? It wasn't FIXED in apt, but it's also not NOTABUG.
Since it was was fixed elsewhere, it's also not WONTFIX or CANTFIX. Technically
I'd move it to the kernel and close it there, but I don't want the kernel guys
to be confused.
I'll try fixed in CURRENTRELEASE, since there is the bandaid fix for FC5, too.
Axel. The "not a bug" shouldn't be closed quite yet. Maybe it should be transferred over to the kernel boys since we still do not have a fix for fc5. Which is why I opened the bug in the first place. Doesn't the bandaid patch fix any issues with FC5? Agreed, it is not fixing the cause, but it is a workaround fixing the outcome, e.g. the bug is dealt with. As Panu wrote in an earlier post, it depends on how memory stressed the system is. And I can confirm that this indeed is the case. One of my fc5 boxes sometimes requires several rm -f /var/cache/apt/*.bin rm -f /var/lib/apt/lists/*.* rm -f /var/lib/apt/lists/lock rm -f /var/lib/rpm/__db* before apt-get will complete its cycle successfully. The thing about the band aid, is eventually, apt-get will work on the box giving me my biggest headache without me having to reboot the system (most of the time). In other words, you still get the bug on your FC5 system even though the bandaid is supposed to workaround/fix that on the fly? Perhaps the bandaid patch does not always detect the corruption. Panu, can really something slip past the bandaid fix? If you have installed the latest apt (that contains the bandaid) and the system still gets chewed (which comment #75 suggests) please reopen the bug. Whether the bandaid patch reliably detects and corrects the problem is irrelevant (it's called bandaid for a reason :) There's a real fix to the problem, getting an updated kernel to the users is the only thing that matters anymore. That's what I meant with the "from my POV this case is closed" comment, no amount of bandaid in apt is going to make it reliable if the kernel can't be trusted to keep our data intact. Eli, either the bandaid simply isn't working for you or there's a misunderstanding here: you only need to do the rm -f stuff if you get segfaults (which means the bandaid didn't help), otherwise the warning is just that: a warning about this issue being present on the system. Hi guys, I made a request at https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=227194 to find out when a new kernel release will be available for fc5. *** This bug has been marked as a duplicate of 214495 *** OK... I just noticed that there are 2.6.19 kernels in fc5 update testing. I just installed it in one of my fc5 boxes and will it booted OK. I will be seeing how it goes for a couple of days before I try to get it installed in the pain in the ass box unless of course new kernels are made generally available. For those of you wanting to intall it remember, the kernel is a "testing" kernel so rpm -ivh is in order just in case you need to fall back to the older kernel. Even better, installed new, generally available 2.6.19 kernel for fc5. Things seem to be working. I'd give it a couple of more real updates, and if there are no more problems then I think that we can call this genuinely done. (In reply to comment #36) > Eli, it is important to remove yum, if you want to do any testing with rpm > and/or apt-get. yum is called by applets, daemons, cron jobs and who knows what > else. So you may think that you're not using it, but in fact you do. That's why > the instructions on bug #213963 comment #4 asked you to remove even both yum and > apt. If I try to remove yum using synaptic, it says that the following are dependent on yum and need to be removed also: docbook-dtds ekiga gdm gnome-panel gnome-pilot kyum pirut scrollkeeper synaptic yum-utils yumex Since I want to keep synaptic, how should I go about this? Remove synaptic with apt CLI, then add it back later. I'm switching to synaptic because yum has never worked properly in FC5, and I have had good experiences with synaptic in the past. |