1213090 – Sporadic timeouts accessing shingled drives with 3.19.

Bug 1213090 - Sporadic timeouts accessing shingled drives with 3.19.

Summary: Sporadic timeouts accessing shingled drives with 3.19.

Keywords:
Status:	CLOSED EOL
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	21
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	fedora-kernel-block
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-18 19:24 UTC by Jason Tibbitts
Modified:	2015-12-02 19:23 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-12-02 19:23:34 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Journal with full boot info showing recoverable timeouts (219.75 KB, text/x-vhdl) 2015-04-18 19:24 UTC, Jason Tibbitts	no flags	Details
View All

Description Jason Tibbitts 2015-04-18 19:24:32 UTC

Created attachment 1015925 [details]
Journal with full boot info showing recoverable timeouts

Running kernel-3.19.4-200.fc21.x86_64 on a machine with four 8TB Seagate archive drives (which are the new shingled ones) I find that I have occasional drive timeouts which cause disks to fail out of the configured arrays.  Last night I found that all four disks in the machine failed out essentially at the same time.  The arrays were fine on a reboot.  Under kernel-3.18.9-200.fc21.x86_64 the machine survives what is basically three continuous days of array resyncing concurrent heavy writes (though performance for anything other than streaming writes is poor as expected for these drives).

I found a similar report of this on http://serverfault.com/questions/682061/very-irregular-disk-write-performance-and-repeated-sata-timeouts

Looking in git I see there have been some changes in libata for shingled drives, but as far as I can tell this is just reporting their existence up in the stack.  There doesn't appear to be any significant change in the driver for the disk interface (icsi).

lspci has:
06:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 4-Port SATA Storage Control Unit (rev 06)

dmesg shows:
[    2.004485] isci: Intel(R) C600 SAS Controller Driver - version 1.2.0
[    2.005093] isci 0000:06:00.0: driver configured for rev: 6 silicon
[    2.005691] isci 0000:06:00.0: OEM parameter table found in OROM
[    2.006386] isci 0000:06:00.0: OEM SAS parameters (version: 1.0) loaded (platform)
[    2.007691] isci 0000:06:00.0: SCU controller 0: phy 3-0 cables: {short, short, short, short}
[    2.011167] scsi host6: isci
[    2.011835] isci 0000:06:00.0: irq 29 for MSI/MSI-X
[    2.011842] isci 0000:06:00.0: irq 30 for MSI/MSI-X

The drives themselves are SATA:
[    4.559464] sas: Enter sas_scsi_recover_host busy: 0 failed: 0
[    4.559484] sas: ata7: end_device-6:0: dev error handler
[    5.499472] ata7.00: ATA-9: ST8000AS0002-1NA17Z, AR13, max UDMA/133
[    5.500054] ata7.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32)
[    5.502113] ata7.00: configured for UDMA/133
[    5.502726] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
[    5.513315] scsi 6:0:0:0: Direct-Access     ATA      ST8000AS0002-1NA AR13 PQ: 0 ANSI: 5

I've attached journalctl output for a boot with 3.19.4.  There are a couple of sets of timeouts, none of which actually kicked a disk out of the array (fortunately).  The last timeout actually happened while I was installing the 3.18 kernel to try and get the machine stable again.  I can provide a much larger journal from the boot where the disks all disappeared, but of course that journal isn't complete.

Note that I'm not able to run any 3.19 kernel earlier than .4 because of the issue where gssproxy won't start.  Note also that I'm not certain that 3.18 avoids this problem, only that so far I've not seen it crop up and I stressed the drives pretty hard.  (I swapped out four existing 4TB drives and grew the array, and then had to replace a drive which went bad, which meant five complete full-write resyncs which took about four continuous days to complete.  Plus the weekly full array check.)

Please let me know if there's any other information I can provide.

Comment 1 Josh Boyer 2015-04-20 18:44:18 UTC

I'm not overly aware of any in-kernel SMR handling yet.  Most of the drives available that I know of deal with all the complications in their firmware.

Let's see if the block gurus know of any major differences in in-kernel handling of these drives.

Comment 2 Jason Tibbitts 2015-04-20 18:49:20 UTC

All I could find was:

    Pull libata changes from Tejun Heo:
     "The only interesting piece is the support for shingled drives.  The
      changes in libata layer are minimal.  All it does is identifying the
      new class of device and report upwards accordingly"

Which sure doesn't seem like it would have any effect.  I have another drive on order which I can torture outside of a machine I'm trying to use.

Comment 3 Jeff Moyer 2015-04-20 19:25:35 UTC

Hi, Jason,

Your best bet would be to post your dmesg with the failing disks to linux-ide.org.

Comment 4 Fedora End Of Life 2015-11-04 10:37:40 UTC

This message is a reminder that Fedora 21 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 21. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '21'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 21 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 5 Jan Kurik 2015-12-02 19:23:40 UTC

Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.