Bug 496675

Summary: Regression: SATA link freeze/reset
Product: [Fedora] Fedora Reporter: Karl Pickett <karl.pickett>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: low    
Version: 11CC: itamar, john.mora, johnparmitage, kernel-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-28 12:07:19 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Karl Pickett 2009-04-20 16:19:53 UTC
Description of problem:
Fedora 11 can't go a day without getting a sata link freeze and remount / in ro mode.  This machine worked a year with fedora 9/10 and never had this issue.

Version-Release number of selected component (if applicable):
2.6.29.1-70.fc11.x86_64


How reproducible:
Varies.  Happened once in the middle of a giant yum transaction, seems to happen for no apparent reason too.


Steps to Reproduce:
1. Boot f11, do tasks (yum upgrade maybe)
3.
  
Actual results:
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: configured for UDMA/133
end_request: I/O error, dev sda, sector 212089703
Aborting journal on device sda4:8.
ata1: EH complete
sd 0:0:0:0: [sda] 312500000 512-byte hardware sectors: (160 GB/149 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ext4_abort called.
EXT4-fs error (device sda4): ext4_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
EXT4-fs error (device sda4) in ext4_reserve_inode_write: Journal has aborted



Expected results:
working disks should stay working.

Additional info:
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE Controller (rev 01)

ata1.00: ATA-7: SAMSUNG HD160JJ/P, ZM100-34, max UDMA7
scsi 0:0:0:0: Direct-Access     ATA      SAMSUNG HD160JJ/ ZM10 PQ: 0 ANSI: 5

[karl@karl ~]$ free
             total       used       free     shared    buffers     cached
Mem:       3951944    3403136     548808          0     137340    2700088
-/+ buffers/cache:     565708    3386236
Swap:      2040244       2884    2037360

My / is on ext4, my home is on ext3 - thank god - it hasn't been mounted ro yet.

Comment 1 Karl Pickett 2009-04-20 16:21:16 UTC
By the way this was a clean install of f-11 beta x86-64 via the live cd.  (/home partition was kept).  I also yum upgraded last week and that did not fix the problem.

Comment 2 Chuck Ebbert 2009-04-21 19:42:06 UTC
There are two possible workarounds for this:

1. Add 'libata.force=1.5Gbps' to the kernel boot options
2. Use the 'nobarrier' option when mounting the filesystem (but this will leave 
   it more vulnerable to corruption on I/O errors..)

Also what was the exact version of the last working F10/F9 kernel? Some people had this problem with Samsung disks with kernels after 2.6.27.5

Comment 3 Karl Pickett 2009-04-21 20:19:31 UTC
Correction: machine never ran F10, it was F9.  Do not know the kernel version unfortunately, all I can say it was fairly up to date as I wanted the latest stable kernel for KVM support.

This poor machine can't get a break.  I couldn't even install f9 on it directly due to the infamous samsung bug.. I had to install f9 to a virtual image then un tar it over a different partition.

Comment 4 Karl Pickett 2009-04-21 20:22:13 UTC
fairly up to date = F9 686 kernel from updates repo.

Comment 5 Chuck Ebbert 2009-04-23 22:11:18 UTC
Do the workarounds fix the problem?

Comment 6 Karl Pickett 2009-04-29 13:30:02 UTC
1.5gbps does NOT fix the problem.

ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: hard resetting link
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
ata1.00: configured for UDMA/133
end_request: I/O error, dev sda, sector 212066543
ata1: EH complete
Aborting journal on device sda4:8.
sd 0:0:0:0: [sda] 312500000 512-byte hardware sectors: (160 GB/149 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ext4_abort called.
EXT4-fs error (device sda4): ext4_journal_start_sb: Detected aborted journal
Remounting filesystem read-only


I think this is an ext4 issue.  This never happens on my ext3 home which houses all of my virtual machines, and I've been running those just as heavily as /.

Comment 7 Karl Pickett 2009-04-29 13:33:44 UTC
The kernel that last dump was from is
 2.6.29.1-70.fc11.x86_64 #1 SMP Mon Apr 13 14:16:25 EDT

Comment 8 Karl Pickett 2009-04-29 13:45:06 UTC
One more thing about that last dump.  As you can see it remounted sda4 which is my /.  However, I wasn't even running a workload in / - I was running an IO workload in a virtual machine on sda2 which is my ext3 home partition.  But all of a sudden things in the VM "froze" for about 5 seconds and I then got out of the VM and did a dmesg on the host which also froze for a while.  Eventually things got unfrozen and I could get the dmesg error from the host.  

By the way, my host /var/log/messages goes up to this line which I thought was interesting:
Apr 29 09:24:55 karl kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 29 09:24:55 karl kernel: ata1.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Apr 29 09:24:55 karl kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 29 09:24:55 karl kernel: ata1.00: status: { DRDY }
Apr 29 09:24:55 karl kernel: ata1: hard resetting link
Apr 29 09:24:55 karl kernel: ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
< no more messages, filesystem is read only >

Comment 9 Karl Pickett 2009-05-04 20:29:00 UTC
Since I am tired of this happening I downgraded my / to ext3 last week.  I will report back in a week on the status.  As a reminder, no ext3 partition has triggered this.  I also upgraded the kernel to 2.6.29.1-111.

Comment 10 Chuck Ebbert 2009-05-05 05:43:17 UTC
(In reply to comment #6)
> 1.5gbps does NOT fix the problem.
> 

What about mounting the ext4 partition with the nobarrier option?

Comment 11 Karl Pickett 2009-05-05 13:19:10 UTC
If nobarrier eats my data it's a non starter.

Comment 12 Chuck Ebbert 2009-05-07 03:54:18 UTC
(In reply to comment #9)
> Since I am tired of this happening I downgraded my / to ext3 last week.  I will
> report back in a week on the status.  As a reminder, no ext3 partition has
> triggered this.  I also upgraded the kernel to 2.6.29.1-111.  

ext3 is probably working because it doesn't use barriers by default.

Comment 13 Karl Pickett 2009-05-22 18:45:14 UTC
update: no problems running / on ext3.

Comment 14 Bug Zapper 2009-06-09 14:13:26 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 11 development cycle.
Changing version to '11'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 15 Bug Zapper 2010-04-27 13:49:06 UTC
This message is a reminder that Fedora 11 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 11.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '11'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 11's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 11 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 16 Bug Zapper 2010-06-28 12:07:19 UTC
Fedora 11 changed to end-of-life (EOL) status on 2010-06-25. Fedora 11 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.