Bug 1130737 - intermittent SATA disconnects when Samsung XP941 PCI-express SSD is installed [NEEDINFO]
Summary: intermittent SATA disconnects when Samsung XP941 PCI-express SSD is installed
Keywords:
Status: CLOSED EOL
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 21
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-17 08:59 UTC by Dan Callaghan
Modified: 2015-12-02 16:12 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-12-02 03:24:40 UTC
kernel-team: needinfo?


Attachments (Terms of Use)
kernel messages with 3.16.1 (112.36 KB, text/plain)
2014-08-17 09:05 UTC, Dan Callaghan
no flags Details
kernel messages with 3.15.8 (111.36 KB, text/plain)
2014-08-17 09:06 UTC, Dan Callaghan
no flags Details

Description Dan Callaghan 2014-08-17 08:59:08 UTC
Description of problem:
I am seeing SATA disconnect kernel messages like these intermittently (on average a few times per hour, sometimes just a few minutes apart):

Aug 17 10:22:54.157169 cinnamon.djc.id.au kernel: ata2: exception Emask 0x50 SAct 0x0 SErr 0x4090800 action 0xe frozen
Aug 17 10:22:54.918160 cinnamon.djc.id.au kernel: ata2: irq_stat 0x00400040, connection status changed
Aug 17 10:22:54.918180 cinnamon.djc.id.au kernel: ata2: SError: { HostInt PHYRdyChg 10B8B DevExch }
Aug 17 10:22:54.918204 cinnamon.djc.id.au kernel: ata2: hard resetting link
Aug 17 10:22:54.918225 cinnamon.djc.id.au kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Aug 17 10:22:54.918244 cinnamon.djc.id.au kernel: ata2.00: configured for UDMA/100
Aug 17 10:22:54.918261 cinnamon.djc.id.au kernel: ata2: EH complete

There are no other symptoms when the disconnect happens. Even though there are mounted filesystems on the disk the kernel seems to recover fine.

I did some reading, there is a lot of info online suggesting that this is not a kernel problem but rather the disk itself is dropping the link intermittently due to a loose connection or insufficient power supply.

However in my case I think this is actually a kernel bug, because:
* it happens reliably on 3.16 kernels but when I boot 3.15.8 I do not see any disconnects (I have run 3.15.8 for several days now)
* I have triple-checked the SATA link and power connections for the disk
* the hardware is all near-new
* it's a Thinkstation E32 in (almost) stock configuration, so there's nothing dodgy involved like old disks, cheap cables, under-specced PSU, or anything like that

Actually there is one non-standard piece of hardware in the system, a Samsung XP491 PCI-express SSD which appears as a SATA disk. However that's probably not related to this problem (the SSD does not exhibit any SATA disconnects).

Version-Release number of selected component (if applicable):
observed on:
kernel-3.16.1-300.fc21.x86_64
kernel-3.16.0-1.fc21.x86_64
kernel-3.16.0-0.rc7.git4.1.fc21.x86_64
cannot reproduce on:
kernel-3.15.8-200.fc20.x86_64

How reproducible:
On my system, reliably reproducible given a few hours.

Steps to Reproduce:
1. Boot my system and let it run for a while.

Actual results:
SATA disconnect messages appear.

Expected results:
No SATA disconnects.

Additional info:
The motherboard chipset is Intel C226.

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)

The disk is a Seagate Barracuda 2TB SATA3, model ST2000DM001.

Comment 1 Dan Callaghan 2014-08-17 09:05:41 UTC
Created attachment 927423 [details]
kernel messages with 3.16.1

Attaching complete kernel messages from 3.16.1. I only left it running for around an hour and there were 5 SATA disconnects.

Comment 2 Dan Callaghan 2014-08-17 09:06:44 UTC
Created attachment 927424 [details]
kernel messages with 3.15.8

For comparison, also attaching complete kernel messages from 3.15.8, which does not exhibit the SATA disconnects.

Comment 3 Dan Callaghan 2014-08-19 11:16:57 UTC
Actually I have seen a few disconnects on 3.15.8 now as well, although they seem to be a bit less frequent.

Comment 4 Daniel Rindt 2015-01-08 06:59:33 UTC
I have exactly the same on this hardware:
00:1f.2 RAID bus controller: Intel Corporation 82801 Mobile SATA Controller [RAID mode] (rev 04)

But it looks this have no side effects. I have no damaged fs.

Comment 5 Dan Callaghan 2015-01-09 00:14:47 UTC
I discovered something interesting about this problem over the holidays...

I have seen the SATA disconnects with all kernel versions I've tried (3.15-3.17). In an effort to rule out hardware problems I swapped out the disk itself, the SATA cable, I swapped to a different SATA port on the motherboard, and I even ran the disk connected to a separate independent power supply to rule out power supply problems. In all cases the disconnects were still occurring. So if it was a hardware problem it must be the SATA controller on the motherboard itself.

In preparation for making a warranty claim I removed the Samsung XP941 PCI-express SSD which I had installed in the system after-market. But with the SSD removed, the disconnects are no longer occurring! The system has now been running for several weeks in that configuration.

So it seems that the presence of the PCI-express SSD, which appears as a regular SATA device, is somehow causing the *other* SATA controller on the motherboard to intermittently disconnect?

The disk is on ata2 (the onboard SATA controller), the SSD is ata6. Only ata2 experiences the disconnects, not ata6:

ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata2.00: ATA-9: ST2000DM001-1CH164, CC77, max UDMA/100
ata2.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
ata2.00: configured for UDMA/100
scsi 1:0:0:0: Direct-Access     ATA      ST2000DM001-1CH1 CC77 PQ: 0 ANSI: 5
sd 1:0:0:0: [sda] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
sd 1:0:0:0: [sda] 4096-byte physical blocks
sd 1:0:0:0: Attached scsi generic sg0 type 0
sd 1:0:0:0: [sda] Write Protect is off
sd 1:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 1:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata6.00: ATA-9: SAMSUNG MZHPU256HCGL-00004, UXM6501Q, max UDMA/133
ata6.00: 500118192 sectors, multi 16: LBA48 NCQ (depth 31/32), AA
ata6.00: configured for UDMA/133
usb 1-1: new high-speed USB device number 2 using ehci-pci
 sda: sda1 sda2 sda3
sd 1:0:0:0: [sda] Attached SCSI disk
usb 1-1: New USB device found, idVendor=8087, idProduct=8008
usb 1-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
hub 1-1:1.0: USB hub found
hub 1-1:1.0: 6 ports detected
usb 2-1: new high-speed USB device number 2 using ehci-pci
ata4: SATA link down (SStatus 0 SControl 300)
scsi 5:0:0:0: Direct-Access     ATA      SAMSUNG MZHPU256 501Q PQ: 0 ANSI: 5
sd 5:0:0:0: [sdb] 500118192 512-byte logical blocks: (256 GB/238 GiB)
sd 5:0:0:0: [sdb] Write Protect is off
sd 5:0:0:0: [sdb] Mode Sense: 00 3a 00 00
sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 5:0:0:0: Attached scsi generic sg1 type 0
 sdb: sdb2 sdb3 sdb4
sd 5:0:0:0: [sdb] Attached SCSI disk

Comment 6 Dan Callaghan 2015-01-09 00:16:11 UTC
(In reply to Daniel Rindt from comment #4)
> I have exactly the same on this hardware:
> 00:1f.2 RAID bus controller: Intel Corporation 82801 Mobile SATA Controller
> [RAID mode] (rev 04)

Daniel, do you also have a Samsung XP941, or some other PCI-express SSD in the system? Do you have multiple SATA controllers? If not, then I guess your problem might be unrelated.

Comment 7 Daniel Rindt 2015-01-09 08:06:32 UTC
(In reply to Dan Callaghan from comment #6)
> Daniel, do you also have a Samsung XP941, or some other PCI-express SSD in
> the system? Do you have multiple SATA controllers? If not, then I guess your
> problem might be unrelated.

I am not sure if its unrelated, i just found this bug with same symptoms. There are 2x SanDisk SD6SP1M128G1102 SSD connected to that above mentioned controller.

Comment 8 Justin M. Forbes 2015-01-27 14:58:33 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 21 kernel bugs.

Fedora 21 has now been rebased to 3.18.3-201.fc21.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 9 Dan Callaghan 2015-02-21 21:55:51 UTC
Still occurs with kernel-3.18.7-200.fc21.x86_64.

Comment 10 Dan Callaghan 2015-03-07 04:30:45 UTC
Any debug options I could try which might give some hints what is going wrong here? Or a debug kernel maybe?

Comment 11 Fedora Kernel Team 2015-04-28 18:28:59 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 21 kernel bugs.

Fedora 21 has now been rebased to 3.19.5-200.fc21.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 22, and are still experiencing this issue, please change the version to Fedora 22.

If you experience different issues, please open a new bug report for those.

Comment 12 Fedora End Of Life 2015-11-04 12:28:00 UTC
This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 13 Fedora End Of Life 2015-12-02 03:24:45 UTC
Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.