Bug 733376

Summary: Reboots lead to ATA drive errors and ext3 file system corruption
Product: [Fedora] Fedora Reporter: ell1e <el>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WORKSFORME QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 15CC: gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone: ---   
Target Release: ---   
Hardware: i386   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-06 22:48:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
smartctl output none

Description ell1e 2011-08-25 15:17:15 UTC
Created attachment 519905 [details]
smartctl output

Description of problem:
Recently I started getting errors similar to this one in dmesg, often when starting a particular program, Mozilla Firefox. When forsing fsck on the next reboot, it diagnosed a broken ext3 file system (already happened twice) and when then fixing it (which included some data loss in both cases), those ATA errors always went away until some more reboots happened which made the problem reoccur.

[ 1418.029152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[ 1418.032714] ata1.00: BMDMA stat 0x25
[ 1418.036078] ata1.00: failed command: READ DMA EXT
[ 1418.039440] ata1.00: cmd 25/00:06:56:03:2d/00:00:12:00:00/e0 tag 0 dma 3072
in
[ 1418.039443] res 51/40:00:56:03:2d/40:00:12:00:00/e0 Emask 0x9 (media error)
[ 1418.046305] ata1.00: status: { DRDY ERR }
[ 1418.049716] ata1.00: error { UNC }
[ 1418.062550] end_request: I/O error, dev sda, sector 304939862
[ 1418.066023] Buffer I/O error on device dm-2, logical block 122683401
[ 1418.069467] Buffer I/O error on device dm-2, logical block 122683402
[ 1418.072831] Buffer I/O error on device dm-2, logical block 122683403

The ATA errors always come together with heavy system lag and freezes.

I suspected the file system corruption to be caused by a failing unmount and filed #728723 but it seems that unmount issue is not directly involved.

For the filesystem corruption, please note that rebooting *always* gives me some journal notes on the file systems, so I guess there might be still some sort of unidentified problem with cleanly flushing the file system or something on shutdown.

Also, it seems possible to fully avoid this issue (the filesystem corruption and the apparently subsequent drive errors above) by not rebooting and using suspend and other things. I already tried that with a few weeks of uptime and it seems to work fine.

I attached smartctl output which might help judging whether this is a driver issue or a dying hard disk I suppose.

Version-Release number of selected component (if applicable):
bash-4.2$ uname -a
Linux jth 2.6.40-4.fc15.i686 #1 SMP Fri Jul 29 18:54:39 UTC 2011 i686 i686 i386
GNU/Linux
bash-4.2$

How reproducible:
100% when rebooting often (tried twice, needs maybe 2+ reboots to actually trigger the drive errors and break the filesystem)

Steps to Reproduce:
1. Reboot multiple times
2. Avoid fsck checks on reboot
3. Open firefox and notice system freezes and find the ATA error spammed on dmesg
4. Reboot with /forcefsck on all file systems
5. Discover filesystem is badly broken and repair it with data loss
6. After fresh boot with repaired file system, open firefox again: works fine now without any drive errors
  
Actual results:
Rebooting eventually triggers the drive errors and a corrupted file system (no idea how exactly those two are related)

Expected results:
Rebooting works fine and doesn't eventually make me run into drive errors or a corrupted file system

Additional info:
Please note I tested it only twice and more or less involuntarily. Since this is a machine I use daily, I was not exactly keen on totally ruining the file system by triggering this explicitely.

Comment 1 Chuck Ebbert 2011-08-29 22:18:54 UTC
(In reply to comment #1)
> [ 1418.029152] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
> [ 1418.032714] ata1.00: BMDMA stat 0x25
> [ 1418.036078] ata1.00: failed command: READ DMA EXT
> [ 1418.039440] ata1.00: cmd 25/00:06:56:03:2d/00:00:12:00:00/e0 tag 0 dma 3072
> in
> [ 1418.039443] res 51/40:00:56:03:2d/40:00:12:00:00/e0 Emask 0x9 (media error)
> [ 1418.046305] ata1.00: status: { DRDY ERR }
> [ 1418.049716] ata1.00: error { UNC }
> [ 1418.062550] end_request: I/O error, dev sda, sector 304939862
> [ 1418.066023] Buffer I/O error on device dm-2, logical block 122683401
> [ 1418.069467] Buffer I/O error on device dm-2, logical block 122683402
> [ 1418.072831] Buffer I/O error on device dm-2, logical block 122683403
> 

It is reporting an uncorrectable read error. Does the error always happen on the same sector? The smartctl output is mostly good, except that the drive was operated over maximum temperature sometime in the past and the seek error rate is marginal (it's within the limits, but healthy drives usually have rates much lower than that.)

Comment 2 ell1e 2011-08-30 06:36:12 UTC
I actually did not check back then, sorry. I would have checked now, but I already rebooted twice _without_ those read errors or any file system corruption turning up again. So strangely enough, it seems to have disappeared around the time I wrote this bug report.

That is my current kernel (maybe a kernel update fixed this issue?):
bash-4.2$ uname -a
Linux jth 2.6.40.3-0.fc15.i686 #1 SMP Tue Aug 16 04:24:09 UTC 2011 i686 i686 i386 GNU/Linux

I really hope it wasn't caused by my hard disk struggling under heat or something like that. Would hate to see it die (and cannot do incredibly much about the high temperature either, it's a netbook and not a usual easily moddable desktop computer).

I now gave up the practise of suspending instead of rebooting to test more thoroughly if this is really gone, but it seems it is for now.

Comment 3 ell1e 2011-10-25 22:32:14 UTC
It came back on me now with kernel 3.1.0 on Fedora 16 beta.

The sector was now 309667757, so apparently it's changing. Also I haven't run into this issue for some months as stated above, no idea why it worked fine for such a long time.

Comment 4 ell1e 2011-10-25 22:38:47 UTC
The moment I usually notice this bug has hit me again is if the system suddenly stalls/hangs completely for 10 seconds or so. I will then proceed to check dmesg to see those read errors again, and I assume now on next reboot and filesystem check I might also see some new corruption that has popped up. I will report later whether that was actually the case this time or not.

Comment 5 Josh Boyer 2012-06-06 19:03:54 UTC
Did this ever resolve itself?  Still seeing it with 2.6.43/3.3?

Comment 6 ell1e 2012-06-06 21:32:35 UTC
For me, it has again disappeared for many months. Also the hard drive had a rather gross seek error count in smart (although not many read errors), so for me I cannot rule out hardware issues.

Comment 7 Josh Boyer 2012-06-06 22:48:12 UTC
OK.  We'll close this out for now.  If you see it again, please reopen.