Bug 476203
Summary: | Filesystem/Memory corruption (broken shared libs after a while) | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Tim Niemueller <tim> | ||||
Component: | xorg-x11-drv-ati | Assignee: | X/OpenGL Maintenance List <xgl-maint> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 10 | CC: | airlied, awilliam, bugzilla.redhat.com, camilo, fbijlsma, fdc, jan.kratochvil, jfrieben, kernel-maint, mcepl, pizza, quintela, savelov, xgl-maint | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | xorg-x11-drv-ati-6.9.0-62.fc10 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-12-21 01:13:28 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Tim Niemueller
2008-12-12 13:30:30 UTC
How do you know it's filesystem corruption? Do you get broken checksums when verifying RPM packages or some other indication that files are corrupted? I am seeing the same now. Have not tracked it down though... 2.6.27.7-134.fc10.i686 Dec 14 12:35:26 aachen kernel: metacity[3128]: segfault at 0 ip 080aa093 sp bfd196a0 error 4 in metacity[8048000+82000] Dec 14 12:35:35 aachen gnome-session[3120]: WARNING: Application 'metacity.desktop' failed to register before timeout It's getting more confusing... I switched to runlevel 1 and did a fsck. It didn't show a single problem. I've also tried the -117 kernel and this has the same problem. The last time thunderbird would not start anymore (glibc detected a double free while scanning through the dictionaries or so, just had a short strace but then needed the system...) and yum would not work (Bad marshal data or so, again, only had a quick glance). A reboot fixed both of these problem. My new suspicion is that not the filesystem is corrupted, but rather the memory. I'm not sure but I think I have seen this after a suspend/resume cycle. Could that have something to do with it? Additionally often my system wakes up twice. This means it wakes up, goes back to sleep by itself, and when I wake it up again it stays on (put to sleep with Fn-F4). Another symptom or unrelated? After a wakeup I just had crashes of Firefox and Thunderbird. Additionally a kernel oops was reported (see http://www.kerneloops.org/submitresult.php?number=140467). Is there a way I can recover all the oopses I sent, is that logged somewhere? I had more in the recent past, different ones. I'm also seeing frequent panics, shared library corruption and other symptoms similar to comment 3 on my x86_64 install (upgraded from F9). No suspend/resume involved, although they appeared after I shut down for the first time since upgrading (after running with no problems for about a week). It's gotten to a point where the desktop is unusable: I can't run for more than 10 minutes without running into some corruption. The first sign that things are going wrong is usually that characters on the X desktop become corrupted, which also lends itself to the suggestion that memory is being corrupted. After this starts even shutting down is a questionable process, as there are errors and stack traces all over the console. Now that you say it: check boxes and radio buttons are corrupted in Firefox on my machine. Coincidence? It's also getting worse here. It's not every 10 minutes, but usually multiple times a day and it seems to be almost certain after wakeup. Updating summary to new hypthesis so others can find it more easily. The newest one in the series of kernel oopses: http://www.kerneloops.org/submitresult.php?number=141735. Now this looks like it could cause trouble, ext3 and kswapd involved. Any suggestions on this? What can be done to narrow it down? Just after I posted the recent comment #7 my system froze. No more mouse movement, no nothing. The very last entry in the log was the following: Dec 16 10:54:03 evilgenius kernel: mtrr: base(0xd9d1c000) is not aligned on a size(0xb48000) boundary Any idea? *** Bug 476403 has been marked as a duplicate of this bug. *** I also experienced similar symptoms (shared libraries corruption, detected by rpm -V), and I got an oops at ext3_discard_reservation+0x27/0x8b http://www.kerneloops.org/submitresult.php?number=139504 Confirming crashes on F-10.x86_64, I had 30+ days uptime before and now it survives only several hours. It is not a kernel, it crashes even with long-term stable/verified kernel-2.6.25.10-86.fc9.x86_64 on my box (but crashes even with F-10 kernels). Machine is T-60, ran there memtest86+ (F-10 CD) and no problem (3 passes OK). 01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility X1400 Using LUFS as in the Comment 0. Using T-60 as in the Comment 0, suspecting the ati driver, trying vesa now. xorg-x11-drv-ati-6.9.0-61.fc10.x86_64 *** Bug 476668 has been marked as a duplicate of this bug. *** For a fully updated F10 x86_64 system plus kernel-2.6.27.9-155.fc10.x86_64, I have noted random crashes, kernel oopses, and system freezes which I hadn't seen with recent "rawhide". Affected applications include epiphany, tomboy, and clock applet. The system / desktop froze a couple of times or even started over. I am using Adobe's x86_64 flash plugin, but I had problems after removing it, too. After updating to xorg-x11-server-Xorg-1.5.3-6.fc10.x86_64, problems seem to have ceased though (maybe a coincidence ..). Interestingly, this system is also sporting an ATI video card, namely X800 (R430) PCIe. Except for /boot and swap, all partitions/volumes are formatted as ext4. Crashes happened here on xorg-x11-server-Xorg-1.5.3-6.fc10.x86_64 (Bug 476668). So far no crash with xorg-x11-drv-vesa-2.0.0-1.fc10.x86_64 although the testing period is not representative enough. I've seen strange problems that I haven't found a root cause for yet. This is on a Dell Inspiron 6000 with Radeon x300 graphics. At boot time and when unsuspending it sometimes reports: Dec 14 19:27:56 pricey kernel: Uhhuh. NMI received for unknown reason b0 on CPU 0. Dec 14 19:27:56 pricey kernel: You have some hardware problem, likely on the PCI bus. Dec 14 19:27:56 pricey kernel: Dazed and confused, but trying to continue Firefox shows corruption of graphics (some character glyphs have 'noise') other apps seem OK. Twice the disk subsystem seems to have locked up, with any attempt to load new commands giving IO error messages, although running applications like Firefox and vncviewer have continued working. Sometimes the wireless seems more temperamental than usual (having to rmmod and modprobe the ipw2200 module and restart NM to get a connection) Because suspend / resume wasn't working I have been booting with 'nomodeset' and 'noapic' which have improved things slightly. xorg-x11-drv-vesa-2.0.0-1.fc10.x86_64 works really reliably now, xorg-x11-drv-ati-6.9.0-61.fc10.x86_64 very randomly crashes the machine. Checked now that xorg-x11-drv-ati-6.9.0-38.fc10.x86_64 was running for me stable for a month+ but it is a pretty old release. kernel definitely was not the right component according to the tests of mine. I'm seeing this problem on my Acer Ferrari 4000 laptop (x86_64); it is rare the machine survives more than a day of use without random app crashes. Even xterms and gnome applets are dying on me. The last cycle involved rpm/yum corruption but the cycle before that involved anything that used sound. Rebooting always solves the crashes. So far nothing on disk (except the rpm database/tmpfiles) has ended up corrupted that I can tell. I have two other F10 desktops machines (one x86_64, the other i686) that have been up for nearly two weeks with no problems at all. They're both running the proprietary nvidia driver, while the crashy laptop uses the xorg radeon driver for its ATI x700 PCIe adapter. Kernel modesetting is disabled due to the mass instability and slowdowns it caused. With thanks to FD Cami who pointed me in the direction of Bug 475555 (font corruption in Firefox) and the xorg-x11-drv-ati-6.9.0-62.fc10 build in Koji (http://koji.fedoraproject.org/koji/buildinfo?buildID=73971). That package seems to cure the problems of font corruption and suspend / resume (fast reliable resume again!). With that package and the kernel 2.6.27.9-155 (http://koji.fedoraproject.org/koji/packageinfo?packageID=8) kernel modesetting seems to work well on my hardware. Hi Tim, Could you test both kernel-2.6.27.9-159.fc10.x86_64 and xorg-x11-drv-ati-6.9.0-62.fc10 ? Those are available in Koji : http://koji.fedoraproject.org/koji/buildinfo?buildID=73971 http://koji.fedoraproject.org/koji/buildinfo?buildID=8 Thanks I've tried the mentioned kernel, but it wouldn't boot. It seems that after LVM initialization it couldn't mount my (crypto) root. I'm trying -155 now as this was reported working. No luck with -155 as well, I get: mount: error mounting /dev/root on /sysroot as ext3: Invalid Argument I have to investigate what strikes me here. Any idea? Kernel is booting now. Problem was caused by the relatime option that I added to /etc/fstab for the root entry (someone knows why that breaks?). * Suspend/resume with KMS does not work. * Suspend/resume w/o KMS worked. Will now test for a day if I see more crashes. * Still see screen corruption around checkboxes. As soon as I try to take a screenshot the screen is refreshed and the corruption is gone. Usually is seen after scrolling. Only seen in Firefox up to now. Tim, Solomon, Jan, Could you try the xorg-x11-drv-ati update at : http://koji.fedoraproject.org/koji/buildinfo?buildID=73971 and report back ? It contains a fix for font problems and is stable here on both my Radeons. Just to chime in - it may be a "well, duh, OBVIOUSLY I did that!" kinda thing, but on a skim read of the initial report and all comments, I don't see any indications that: a) you're all actually definitely suffering the same bug b) any of you have done anything to eliminate hardware problems as a possible cause: for instance, running a basic memtestx86 check c) any of you have done a test as simple as booting a Fedora 9 or Ubuntu or something live CD to see whether the issue is in fact truly something caused specifically by Fedora 10 in cases as vague and asymptomatic as this, I think the above set of tests is a fairly sensible thing to do. Addressing items b) and c) in comment 24: at the time of my comment I had run several full passes of memtest86+, as well as replacing the video card (with another of the same model) and running various hardware diagnostics for CPU and other components in an attempt to rule out the more straightforward hardware problems. My system was upgraded from Fedora 9, which had been working flawlessly; I've since abandoned F10 and reinstalled F9, which has returned the OS to its former stable state. Addressing item a): the two common elements among most of the "crashy" reports in this bug appear to be ATI graphics cards (which I also have but didn't note in my comment, since it appeared to be a kernel issue at the time) and power-related operations. While they may not be caused by the same bug, the shared library corruption and other symptoms appear to be similar enough that they've led to the suggestion of kernel and driver patches which may have addressed the issues. Because I've reverted to Fedora 9 I'm not in a situation where I can try the patched RPMs, unfortunately, but if it will help I can provide details of the system hardware. That definitely suggests it's something genuinely in the software and specific to F10, then - but up till this point, there was nothing to confirm that :). Which is why I asked. where to get kernel-2.6.27.9-159.fc10.x86_64? I could not find it on the link that was provided thanks (In reply to comment #24) > b) any of you have done anything to eliminate hardware problems as a possible > cause: for instance, running a basic memtestx86 check Tested in my comment #11: > ran there memtest86+ (F-10 CD) and no problem (3 passes OK). > c) any of you have done a test as simple as booting a Fedora 9 or Ubuntu or > something live CD to see whether the issue is in fact truly something caused > specifically by Fedora 10 I can keep the same F-10, sufficient was for me to change /etc/X11/xorg.conf: Section "Device" :: Driver "ati" to Section "Device" :: Driver "vesa" which was described in other words in my comment #16: > xorg-x11-drv-vesa-2.0.0-1.fc10.x86_64 works really reliably now, > xorg-x11-drv-ati-6.9.0-61.fc10.x86_64 very randomly crashes the machine. Re #24 -- This system worked fine with F9 for many months. memtest86+ turned up no errors, and the common thread seems to be be that we all use an an ATI graphics adapter. Mine is an M26/Mobility X700/RV410 PCIe. I'm typing this now with the new xorg-x11-drv-ati-6.9.0-62.fc10.x86_64 and kernel-2.6.27.9-159.fc10.x86_64 packages. kernel modesetting is disabled. So far so good but it's only been about an hour of uptime with a single suspend/resume cycle. Amusingly enough, while installing these packages the memory corruption hit and trashed both the rpmdb and the filesystem itself. (with 6.9.0-61 and 2.6.27.7-134 running at the time) The system uptime was approx 20 hours, with three suspend/resume cycles. updated kernel to kernel-2.6.27.9-159.fc10.x86_64 packages (http://koji.fedoraproject.org/koji/buildinfo?buildID=74993) and radeon driver to xorg-x11-drv-ati-6.9.0-62.fc10.x86_64 working stable so far (radeon xpress 1100-RS485). previously had memory/libs corruption Solomon, Eugene, If kernel-2.6.27.9-159.fc10.x86_64 works for you, please say so in the comment box there : https://admin.fedoraproject.org/updates/kernel-2.6.27.9-159.fc10 and choose the "works for me" checkbox. Please also report here so that we can properly triage or close the bug. I tend to agree with Adam Williamson, we're probably looking at two different bugs there. I'd like to know if that kernel and the -62 xorg-x11-drv-ati build works for everyone with nomodeset, at least. Thanks Francois, is it a bug or feature of new radeon driver that I can not switch to text console (e.g. Ctrl-Alt-F1)? I both tried in kernel modeset and nomodeset mode. After 29 hours of uptime, the kernel-2.6.27.9-159 + xorg-x11-drv-ati-6.9.0-62 combo has held up pretty well. I can't say that the problem is resolved, but it has usually bombed by this time. (Kernel modesetting is still disabled) Also, Eugene, in F10 X is on VT1 by default -- try one of the others, and you should have a text console. I can confirm comment #33. Uptime is now almost two days, several sleep cycles, working find (KMS disabled). I think this bug can be closed after the updates went to stable. 53 hours of uptime and a good half-dozen sleep cycles later, things are still stable. This is the longest this system has survived since the F10 upgrade. I think once the kernel-2.6.27.9-159 and xorg-x11-drv-ati-6.9.0-62 updates are pushed to stable, we can consider this problem resolved. all is ok now, but I had so many libraries corruption before this update (even /etc/logrotate.d directory was corrupted), so I am afraid this bug needs to be documented in release notes/known bugs for radeon hardware, with some details provided how to check and fix the system corruption afterwards (In reply to comment #35) > I think once the kernel-2.6.27.9-159 and xorg-x11-drv-ati-6.9.0-62 updates are > pushed to stable, we can consider this problem resolved. Happened for ati driver. Closing as CURRENTRELEASE (kernel is still in testing, but this bug is against ATI driver). Created attachment 328084 [details] test application I see a similar corruption, with kernel-2.6.27.9-159.fc10.x86_64 xorg-x11-drv-ati-6.9.0-63.fc10.x86_64 Could one of you try to compile&run the attached test app? It creates a multi-MB file in /tmp and continuously reads&verifies the file. I get a reliable crash within seconds when I run ./iotest 500 and the move the mouse. - memtest86: ok - test app from console (initlevel 3): ok - test app without mouse movements: ok See also https://bugzilla.redhat.com/show_bug.cgi?id=478622 Ran ioload here for more than 500 loops without a problem under active system usage (desktop, compiling etc., mouse was moved) and the very same package versions and architecture you reported. |