Bug 733376
Summary: | Reboots lead to ATA drive errors and ext3 file system corruption | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | ell1e <el> | ||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 15 | CC: | gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i386 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2012-06-06 22:48:12 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
(In reply to comment #1) > [ 1418.029152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 > [ 1418.032714] ata1.00: BMDMA stat 0x25 > [ 1418.036078] ata1.00: failed command: READ DMA EXT > [ 1418.039440] ata1.00: cmd 25/00:06:56:03:2d/00:00:12:00:00/e0 tag 0 dma 3072 > in > [ 1418.039443] res 51/40:00:56:03:2d/40:00:12:00:00/e0 Emask 0x9 (media error) > [ 1418.046305] ata1.00: status: { DRDY ERR } > [ 1418.049716] ata1.00: error { UNC } > [ 1418.062550] end_request: I/O error, dev sda, sector 304939862 > [ 1418.066023] Buffer I/O error on device dm-2, logical block 122683401 > [ 1418.069467] Buffer I/O error on device dm-2, logical block 122683402 > [ 1418.072831] Buffer I/O error on device dm-2, logical block 122683403 > It is reporting an uncorrectable read error. Does the error always happen on the same sector? The smartctl output is mostly good, except that the drive was operated over maximum temperature sometime in the past and the seek error rate is marginal (it's within the limits, but healthy drives usually have rates much lower than that.) I actually did not check back then, sorry. I would have checked now, but I already rebooted twice _without_ those read errors or any file system corruption turning up again. So strangely enough, it seems to have disappeared around the time I wrote this bug report. That is my current kernel (maybe a kernel update fixed this issue?): bash-4.2$ uname -a Linux jth 2.6.40.3-0.fc15.i686 #1 SMP Tue Aug 16 04:24:09 UTC 2011 i686 i686 i386 GNU/Linux I really hope it wasn't caused by my hard disk struggling under heat or something like that. Would hate to see it die (and cannot do incredibly much about the high temperature either, it's a netbook and not a usual easily moddable desktop computer). I now gave up the practise of suspending instead of rebooting to test more thoroughly if this is really gone, but it seems it is for now. It came back on me now with kernel 3.1.0 on Fedora 16 beta. The sector was now 309667757, so apparently it's changing. Also I haven't run into this issue for some months as stated above, no idea why it worked fine for such a long time. The moment I usually notice this bug has hit me again is if the system suddenly stalls/hangs completely for 10 seconds or so. I will then proceed to check dmesg to see those read errors again, and I assume now on next reboot and filesystem check I might also see some new corruption that has popped up. I will report later whether that was actually the case this time or not. Did this ever resolve itself? Still seeing it with 2.6.43/3.3? For me, it has again disappeared for many months. Also the hard drive had a rather gross seek error count in smart (although not many read errors), so for me I cannot rule out hardware issues. OK. We'll close this out for now. If you see it again, please reopen. |
Created attachment 519905 [details] smartctl output Description of problem: Recently I started getting errors similar to this one in dmesg, often when starting a particular program, Mozilla Firefox. When forsing fsck on the next reboot, it diagnosed a broken ext3 file system (already happened twice) and when then fixing it (which included some data loss in both cases), those ATA errors always went away until some more reboots happened which made the problem reoccur. [ 1418.029152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 [ 1418.032714] ata1.00: BMDMA stat 0x25 [ 1418.036078] ata1.00: failed command: READ DMA EXT [ 1418.039440] ata1.00: cmd 25/00:06:56:03:2d/00:00:12:00:00/e0 tag 0 dma 3072 in [ 1418.039443] res 51/40:00:56:03:2d/40:00:12:00:00/e0 Emask 0x9 (media error) [ 1418.046305] ata1.00: status: { DRDY ERR } [ 1418.049716] ata1.00: error { UNC } [ 1418.062550] end_request: I/O error, dev sda, sector 304939862 [ 1418.066023] Buffer I/O error on device dm-2, logical block 122683401 [ 1418.069467] Buffer I/O error on device dm-2, logical block 122683402 [ 1418.072831] Buffer I/O error on device dm-2, logical block 122683403 The ATA errors always come together with heavy system lag and freezes. I suspected the file system corruption to be caused by a failing unmount and filed #728723 but it seems that unmount issue is not directly involved. For the filesystem corruption, please note that rebooting *always* gives me some journal notes on the file systems, so I guess there might be still some sort of unidentified problem with cleanly flushing the file system or something on shutdown. Also, it seems possible to fully avoid this issue (the filesystem corruption and the apparently subsequent drive errors above) by not rebooting and using suspend and other things. I already tried that with a few weeks of uptime and it seems to work fine. I attached smartctl output which might help judging whether this is a driver issue or a dying hard disk I suppose. Version-Release number of selected component (if applicable): bash-4.2$ uname -a Linux jth 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386 GNU/Linux bash-4.2$ How reproducible: 100% when rebooting often (tried twice, needs maybe 2+ reboots to actually trigger the drive errors and break the filesystem) Steps to Reproduce: 1. Reboot multiple times 2. Avoid fsck checks on reboot 3. Open firefox and notice system freezes and find the ATA error spammed on dmesg 4. Reboot with /forcefsck on all file systems 5. Discover filesystem is badly broken and repair it with data loss 6. After fresh boot with repaired file system, open firefox again: works fine now without any drive errors Actual results: Rebooting eventually triggers the drive errors and a corrupted file system (no idea how exactly those two are related) Expected results: Rebooting works fine and doesn't eventually make me run into drive errors or a corrupted file system Additional info: Please note I tested it only twice and more or less involuntarily. Since this is a machine I use daily, I was not exactly keen on totally ruining the file system by triggering this explicitely.