Bug 476203

Summary: Filesystem/Memory corruption (broken shared libs after a while)
Product: [Fedora] Fedora Reporter: Tim Niemueller <tim>
Component: xorg-x11-drv-atiAssignee: X/OpenGL Maintenance List <xgl-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: high    
Version: 10CC: airlied, awilliam, bugzilla.redhat.com, camilo, fbijlsma, fdc, jan.kratochvil, jfrieben, kernel-maint, mcepl, pizza, quintela, savelov, xgl-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: xorg-x11-drv-ati-6.9.0-62.fc10 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-12-21 01:13:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test application none

Description Tim Niemueller 2008-12-12 13:30:30 UTC
Description of problem:
I've seen file system corruption three times now on two systems running F-10. Every time a shared libraries or binaries started segfaulting randomly. The ones that I saw crashing are sqlite (causing crashes of yum and firefox for instance), metacity and gtk2. The latter one just a few minutes back. Any keypress that was issued not within my terminal would crash the application (like nautilus, see bug #476200), gnome-screen-saver dialog would not accept a password but only my fingerprint, gdm login not possible as username could not be typed in.

Since this happened now a couple of times, on different machines (IBM ThinkPad T40 and T60) and architectures (i386 and x86_64), and with different packages I suspect something corrupts the files on the disk.

The system uses a crypto root device with a LVM inside (what anaconda created for F-9).

Version-Release number of selected component (if applicable):
kernel-2.6.27.7-134.fc10.x86_64

How reproducible:
From time to time.

Steps to Reproduce:
Unfortunately I cannot reproduce this reliably. It just keeps happening every second day or so.

Comment 1 Chuck Ebbert 2008-12-13 18:28:31 UTC
How do you know it's filesystem corruption? Do you get broken checksums when verifying RPM packages or some other indication that files are corrupted?

Comment 2 Frederik Bijlsma 2008-12-14 11:44:14 UTC
I am seeing the same now. Have not tracked it down though...

2.6.27.7-134.fc10.i686

Dec 14 12:35:26 aachen kernel: metacity[3128]: segfault at 0 ip 080aa093 sp bfd196a0 error 4 in metacity[8048000+82000]
Dec 14 12:35:35 aachen gnome-session[3120]: WARNING: Application 'metacity.desktop' failed to register before timeout

Comment 3 Tim Niemueller 2008-12-14 16:08:39 UTC
It's getting more confusing...

I switched to runlevel 1 and did a fsck. It didn't show a single problem. I've also tried the -117 kernel and this has the same problem. The last time thunderbird would not start anymore (glibc detected a double free while scanning through the dictionaries or so, just had a short strace but then needed the system...) and yum would not work (Bad marshal data or so, again, only had a quick glance).

A reboot fixed both of these problem. My new suspicion is that not the filesystem is corrupted, but rather the memory. I'm not sure but I think I have seen this after a suspend/resume cycle. Could that have something to do with it? Additionally often my system wakes up twice. This means it wakes up, goes back to sleep by itself, and when I wake it up again it stays on (put to sleep with Fn-F4). Another symptom or unrelated?

Comment 4 Tim Niemueller 2008-12-15 10:54:33 UTC
After a wakeup I just had crashes of Firefox and Thunderbird. Additionally a kernel oops was reported (see http://www.kerneloops.org/submitresult.php?number=140467). Is there a way I can recover all the oopses I sent, is that logged somewhere? I had more in the recent past, different ones.

Comment 5 Peter Janes 2008-12-15 16:49:55 UTC
I'm also seeing frequent panics, shared library corruption and other symptoms similar to comment 3 on my x86_64 install (upgraded from F9).  No suspend/resume involved, although they appeared after I shut down for the first time since upgrading (after running with no problems for about a week).

It's gotten to a point where the desktop is unusable: I can't run for more than 10 minutes without running into some corruption.  The first sign that things are going wrong is usually that characters on the X desktop become corrupted, which also lends itself to the suggestion that memory is being corrupted.  After this starts even shutting down is a questionable process, as there are errors and stack traces all over the console.

Comment 6 Tim Niemueller 2008-12-15 16:55:11 UTC
Now that you say it: check boxes and radio buttons are corrupted in Firefox on my machine. Coincidence?

It's also getting worse here. It's not every 10 minutes, but usually multiple times a day and it seems to be almost certain after wakeup. Updating summary to new hypthesis so others can find it more easily.

Comment 7 Tim Niemueller 2008-12-16 09:48:46 UTC
The newest one in the series of kernel oopses: http://www.kerneloops.org/submitresult.php?number=141735. Now this looks like it could cause trouble, ext3 and kswapd involved. Any suggestions on this? What can be done to narrow it down?

Comment 8 Tim Niemueller 2008-12-16 10:07:15 UTC
Just after I posted the recent comment #7 my system froze. No more mouse movement, no nothing. The very last entry in the log was the following:

Dec 16 10:54:03 evilgenius kernel: mtrr: base(0xd9d1c000) is not aligned on a size(0xb48000) boundary

Any idea?

Comment 9 Eugene Savelov 2008-12-16 17:54:38 UTC
*** Bug 476403 has been marked as a duplicate of this bug. ***

Comment 10 Eugene Savelov 2008-12-16 18:00:24 UTC
I also experienced similar symptoms (shared libraries corruption, detected by rpm -V), and I got an oops at ext3_discard_reservation+0x27/0x8b

http://www.kerneloops.org/submitresult.php?number=139504

Comment 11 Jan Kratochvil 2008-12-16 18:16:41 UTC
Confirming crashes on F-10.x86_64, I had 30+ days uptime before and now it survives only several hours.
It is not a kernel, it crashes even with long-term stable/verified kernel-2.6.25.10-86.fc9.x86_64 on my box (but crashes even with F-10 kernels).
Machine is T-60, ran there memtest86+ (F-10 CD) and no problem (3 passes OK).
01:00.0 VGA compatible controller: ATI Technologies Inc Radeon Mobility X1400

Using LUFS as in the Comment 0.
Using T-60 as in the Comment 0, suspecting the ati driver, trying vesa now.
xorg-x11-drv-ati-6.9.0-61.fc10.x86_64

Comment 12 Jan Kratochvil 2008-12-16 18:17:36 UTC
*** Bug 476668 has been marked as a duplicate of this bug. ***

Comment 13 Joachim Frieben 2008-12-16 20:08:13 UTC
For a fully updated F10 x86_64 system plus kernel-2.6.27.9-155.fc10.x86_64, I have noted random crashes, kernel oopses, and system freezes which I hadn't seen with recent "rawhide".
Affected applications include epiphany, tomboy, and clock applet. The system / desktop froze a couple of times or even started over. I am using Adobe's x86_64 flash plugin, but I had problems after removing it, too.
After updating to xorg-x11-server-Xorg-1.5.3-6.fc10.x86_64, problems seem to have ceased though (maybe a coincidence ..).
Interestingly, this system is also sporting an ATI video card, namely X800 (R430) PCIe. Except for /boot and swap, all partitions/volumes are formatted as ext4.

Comment 14 Jan Kratochvil 2008-12-16 20:17:03 UTC
Crashes happened here on xorg-x11-server-Xorg-1.5.3-6.fc10.x86_64 (Bug 476668).
So far no crash with xorg-x11-drv-vesa-2.0.0-1.fc10.x86_64 although the testing period is not representative enough.

Comment 15 cam 2008-12-16 23:55:20 UTC
I've seen strange problems that I haven't found a root cause for yet. This is on a Dell Inspiron 6000 with Radeon x300 graphics. At boot time and when unsuspending it sometimes reports:

Dec 14 19:27:56 pricey kernel: Uhhuh. NMI received for unknown reason b0 on CPU 
0.
Dec 14 19:27:56 pricey kernel: You have some hardware problem, likely on the PCI
 bus.
Dec 14 19:27:56 pricey kernel: Dazed and confused, but trying to continue

Firefox shows corruption of graphics (some character glyphs have 'noise') other apps seem OK.

Twice the disk subsystem seems to have locked up, with any attempt to load new commands giving IO error messages, although running applications like Firefox and vncviewer have continued working. Sometimes the wireless seems more temperamental than usual (having to rmmod and modprobe the ipw2200 module and restart NM to get a connection)

Because suspend / resume wasn't working I have been booting with 'nomodeset' and 'noapic' which have improved things slightly.

Comment 16 Jan Kratochvil 2008-12-17 16:23:38 UTC
xorg-x11-drv-vesa-2.0.0-1.fc10.x86_64 works really reliably now,
xorg-x11-drv-ati-6.9.0-61.fc10.x86_64 very randomly crashes the machine.

Checked now that xorg-x11-drv-ati-6.9.0-38.fc10.x86_64 was running for me stable for a month+ but it is a pretty old release.

kernel definitely was not the right component according to the tests of mine.

Comment 17 Solomon Peachy 2008-12-17 16:55:47 UTC
I'm seeing this problem on my Acer Ferrari 4000 laptop (x86_64); it is rare the machine survives more than a day of use without random app crashes.  Even xterms and gnome applets are dying on me.  The last cycle involved rpm/yum corruption but the cycle before that involved anything that used sound.  Rebooting always solves the crashes.  So far nothing on disk (except the rpm database/tmpfiles) has ended up corrupted that I can tell.

I have two other F10 desktops machines (one x86_64, the other i686) that have been up for nearly two weeks with no problems at all.  They're both running the proprietary nvidia driver, while the crashy laptop uses the xorg radeon driver for its ATI x700 PCIe adapter.  Kernel modesetting is disabled due to the mass instability and slowdowns it caused.

Comment 18 cam 2008-12-17 17:29:04 UTC
With thanks to FD Cami who pointed me in the direction of Bug 475555 (font corruption in Firefox) and the xorg-x11-drv-ati-6.9.0-62.fc10 build in Koji (http://koji.fedoraproject.org/koji/buildinfo?buildID=73971). That package seems to cure the problems of font corruption and suspend / resume (fast reliable resume again!). With that package and the kernel 2.6.27.9-155 (http://koji.fedoraproject.org/koji/packageinfo?packageID=8) kernel modesetting seems to work well on my hardware.

Comment 19 François Cami 2008-12-17 20:12:32 UTC
Hi Tim,

Could you test both kernel-2.6.27.9-159.fc10.x86_64 and xorg-x11-drv-ati-6.9.0-62.fc10 ?
Those are available in Koji :
http://koji.fedoraproject.org/koji/buildinfo?buildID=73971
http://koji.fedoraproject.org/koji/buildinfo?buildID=8
Thanks

Comment 20 Tim Niemueller 2008-12-17 22:20:14 UTC
I've tried the mentioned kernel, but it wouldn't boot. It seems that after LVM initialization it couldn't mount my (crypto) root. I'm trying -155 now as this was reported working.

Comment 21 Tim Niemueller 2008-12-17 22:31:47 UTC
No luck with -155 as well, I get:

mount: error mounting /dev/root on /sysroot as ext3: Invalid Argument

I have to investigate what strikes me here. Any idea?

Comment 22 Tim Niemueller 2008-12-17 22:57:51 UTC
Kernel is booting now. Problem was caused by the relatime option that I added to /etc/fstab for the root entry (someone knows why that breaks?).

* Suspend/resume with KMS does not work.
* Suspend/resume w/o KMS worked. Will now test for a day if I see more crashes.
* Still see screen corruption around checkboxes. As soon as I try to take a screenshot the screen is refreshed and the corruption is gone. Usually is seen after scrolling. Only seen in Firefox up to now.

Comment 23 François Cami 2008-12-18 00:34:58 UTC
Tim, Solomon, Jan,

Could you try the xorg-x11-drv-ati update at :
http://koji.fedoraproject.org/koji/buildinfo?buildID=73971
and report back ?
It contains a fix for font problems and is stable here on both my Radeons.

Comment 24 Adam Williamson 2008-12-18 01:51:11 UTC
Just to chime in - it may be a "well, duh, OBVIOUSLY I did that!" kinda thing, but on a skim read of the initial report and all comments, I don't see any indications that:

a) you're all actually definitely suffering the same bug
b) any of you have done anything to eliminate hardware problems as a possible cause: for instance, running a basic memtestx86 check
c) any of you have done a test as simple as booting a Fedora 9 or Ubuntu or something live CD to see whether the issue is in fact truly something caused specifically by Fedora 10

in cases as vague and asymptomatic as this, I think the above set of tests is a fairly sensible thing to do.

Comment 25 Peter Janes 2008-12-18 02:14:59 UTC
Addressing items b) and c) in comment 24: at the time of my comment I had run several full passes of memtest86+, as well as replacing the video card (with another of the same model) and running various hardware diagnostics for CPU and other components in an attempt to rule out the more straightforward hardware problems.  My system was upgraded from Fedora 9, which had been working flawlessly; I've since abandoned F10 and reinstalled F9, which has returned the OS to its former stable state.

Addressing item a): the two common elements among most of the "crashy" reports in this bug appear to be ATI graphics cards (which I also have but didn't note in my comment, since it appeared to be a kernel issue at the time) and power-related operations.  While they may not be caused by the same bug, the shared library corruption and other symptoms appear to be similar enough that they've led to the suggestion of kernel and driver patches which may have addressed the issues.

Because I've reverted to Fedora 9 I'm not in a situation where I can try the patched RPMs, unfortunately, but if it will help I can provide details of the system hardware.

Comment 26 Adam Williamson 2008-12-18 02:28:40 UTC
That definitely suggests it's something genuinely in the software and specific to F10, then - but up till this point, there was nothing to confirm that :). Which is why I asked.

Comment 27 Eugene Savelov 2008-12-18 07:54:17 UTC
where to get kernel-2.6.27.9-159.fc10.x86_64? I could not find it on the link that was provided

thanks

Comment 28 Jan Kratochvil 2008-12-18 08:24:47 UTC
(In reply to comment #24)
> b) any of you have done anything to eliminate hardware problems as a possible
> cause: for instance, running a basic memtestx86 check
Tested in my comment #11:
> ran there memtest86+ (F-10 CD) and no problem (3 passes OK).

> c) any of you have done a test as simple as booting a Fedora 9 or Ubuntu or
> something live CD to see whether the issue is in fact truly something caused
> specifically by Fedora 10
I can keep the same F-10, sufficient was for me to change /etc/X11/xorg.conf:
  Section "Device" :: Driver "ati"
to
  Section "Device" :: Driver "vesa"
which was described in other words in my comment #16:
> xorg-x11-drv-vesa-2.0.0-1.fc10.x86_64 works really reliably now,
> xorg-x11-drv-ati-6.9.0-61.fc10.x86_64 very randomly crashes the machine.

Comment 29 Solomon Peachy 2008-12-18 14:36:24 UTC
Re #24 -- This system worked fine with F9 for many months.  memtest86+ turned up no errors, and the common thread seems to be be that we all use an an ATI graphics adapter.  Mine is an M26/Mobility X700/RV410 PCIe.

I'm typing this now with the new xorg-x11-drv-ati-6.9.0-62.fc10.x86_64 and kernel-2.6.27.9-159.fc10.x86_64 packages.  kernel modesetting is disabled.  So far so good but it's only been about an hour of uptime with a single suspend/resume cycle.

Amusingly enough, while installing these packages the memory corruption hit and trashed both the rpmdb and the filesystem itself.  (with 6.9.0-61 and 2.6.27.7-134 running at the time)  The system uptime was approx 20 hours, with three suspend/resume cycles.

Comment 30 Eugene Savelov 2008-12-18 18:42:40 UTC
updated kernel to kernel-2.6.27.9-159.fc10.x86_64 packages (http://koji.fedoraproject.org/koji/buildinfo?buildID=74993) and radeon driver to xorg-x11-drv-ati-6.9.0-62.fc10.x86_64
working stable so far (radeon xpress 1100-RS485). previously had memory/libs corruption

Comment 31 François Cami 2008-12-18 18:59:36 UTC
Solomon, Eugene,
If kernel-2.6.27.9-159.fc10.x86_64 works for you, please say so in the comment box there : 
https://admin.fedoraproject.org/updates/kernel-2.6.27.9-159.fc10
and choose the "works for me" checkbox.
Please also report here so that we can properly triage or close the bug.
I tend to agree with Adam Williamson, we're probably looking at two different bugs there. I'd like to know if that kernel and the -62 xorg-x11-drv-ati build works for everyone with nomodeset, at least.
Thanks

Comment 32 Eugene Savelov 2008-12-19 04:38:12 UTC
Francois, is it a bug or feature of new radeon driver that I can not switch to text console (e.g. Ctrl-Alt-F1)? I both tried in kernel modeset and nomodeset mode.

Comment 33 Solomon Peachy 2008-12-19 18:39:55 UTC
After 29 hours of uptime, the kernel-2.6.27.9-159 + xorg-x11-drv-ati-6.9.0-62 combo has held up pretty well.  I can't say that the problem is resolved, but it has usually bombed by this time.  

(Kernel modesetting is still disabled)

Also, Eugene, in F10 X is on VT1 by default -- try one of the others, and you should have a text console.

Comment 34 Tim Niemueller 2008-12-19 21:22:41 UTC
I can confirm comment #33. Uptime is now almost two days, several sleep cycles, working find (KMS disabled). I think this bug can be closed after the updates went to stable.

Comment 35 Solomon Peachy 2008-12-20 16:44:41 UTC
53 hours of uptime and a good half-dozen sleep cycles later, things are still stable.  This is the longest this system has survived since the F10 upgrade.  I think once the kernel-2.6.27.9-159 and xorg-x11-drv-ati-6.9.0-62 updates are pushed to stable, we can consider this problem resolved.

Comment 36 Eugene Savelov 2008-12-20 18:15:03 UTC
all is ok now, but I had so many libraries corruption before this update (even /etc/logrotate.d directory was corrupted), so I am afraid this bug needs to be documented in release notes/known bugs for radeon hardware, with some details provided how to check and fix the system corruption afterwards

Comment 37 Matěj Cepl 2008-12-21 01:13:28 UTC
(In reply to comment #35)
> I think once the kernel-2.6.27.9-159 and xorg-x11-drv-ati-6.9.0-62 updates are
> pushed to stable, we can consider this problem resolved.

Happened for ati driver. Closing as CURRENTRELEASE (kernel is still in testing, but this bug is against ATI driver).

Comment 38 Manfred Spraul 2009-01-02 17:39:03 UTC
Created attachment 328084 [details]
test application

I see a similar corruption, with

  kernel-2.6.27.9-159.fc10.x86_64
  xorg-x11-drv-ati-6.9.0-63.fc10.x86_64

Could one of you try to compile&run the attached test app?
It creates a multi-MB file in /tmp and continuously reads&verifies the file.

I get a reliable crash within seconds when I run

  ./iotest 500

and the move the mouse.

- memtest86: ok
- test app from console (initlevel 3): ok
- test app without mouse movements: ok

See also https://bugzilla.redhat.com/show_bug.cgi?id=478622

Comment 39 Tim Niemueller 2009-01-03 11:02:24 UTC
Ran ioload here for more than 500 loops without a problem under active system usage (desktop, compiling etc., mouse was moved) and the very same package versions and architecture you reported.