Created attachment 1015925 [details] Journal with full boot info showing recoverable timeouts Running kernel-3.19.4-200.fc21.x86_64 on a machine with four 8TB Seagate archive drives (which are the new shingled ones) I find that I have occasional drive timeouts which cause disks to fail out of the configured arrays. Last night I found that all four disks in the machine failed out essentially at the same time. The arrays were fine on a reboot. Under kernel-3.18.9-200.fc21.x86_64 the machine survives what is basically three continuous days of array resyncing concurrent heavy writes (though performance for anything other than streaming writes is poor as expected for these drives). I found a similar report of this on http://serverfault.com/questions/682061/very-irregular-disk-write-performance-and-repeated-sata-timeouts Looking in git I see there have been some changes in libata for shingled drives, but as far as I can tell this is just reporting their existence up in the stack. There doesn't appear to be any significant change in the driver for the disk interface (icsi). lspci has: 06:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 4-Port SATA Storage Control Unit (rev 06) dmesg shows: [ 2.004485] isci: Intel(R) C600 SAS Controller Driver - version 1.2.0 [ 2.005093] isci 0000:06:00.0: driver configured for rev: 6 silicon [ 2.005691] isci 0000:06:00.0: OEM parameter table found in OROM [ 2.006386] isci 0000:06:00.0: OEM SAS parameters (version: 1.0) loaded (platform) [ 2.007691] isci 0000:06:00.0: SCU controller 0: phy 3-0 cables: {short, short, short, short} [ 2.011167] scsi host6: isci [ 2.011835] isci 0000:06:00.0: irq 29 for MSI/MSI-X [ 2.011842] isci 0000:06:00.0: irq 30 for MSI/MSI-X The drives themselves are SATA: [ 4.559464] sas: Enter sas_scsi_recover_host busy: 0 failed: 0 [ 4.559484] sas: ata7: end_device-6:0: dev error handler [ 5.499472] ata7.00: ATA-9: ST8000AS0002-1NA17Z, AR13, max UDMA/133 [ 5.500054] ata7.00: 15628053168 sectors, multi 0: LBA48 NCQ (depth 31/32) [ 5.502113] ata7.00: configured for UDMA/133 [ 5.502726] sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1 [ 5.513315] scsi 6:0:0:0: Direct-Access ATA ST8000AS0002-1NA AR13 PQ: 0 ANSI: 5 I've attached journalctl output for a boot with 3.19.4. There are a couple of sets of timeouts, none of which actually kicked a disk out of the array (fortunately). The last timeout actually happened while I was installing the 3.18 kernel to try and get the machine stable again. I can provide a much larger journal from the boot where the disks all disappeared, but of course that journal isn't complete. Note that I'm not able to run any 3.19 kernel earlier than .4 because of the issue where gssproxy won't start. Note also that I'm not certain that 3.18 avoids this problem, only that so far I've not seen it crop up and I stressed the drives pretty hard. (I swapped out four existing 4TB drives and grew the array, and then had to replace a drive which went bad, which meant five complete full-write resyncs which took about four continuous days to complete. Plus the weekly full array check.) Please let me know if there's any other information I can provide.
I'm not overly aware of any in-kernel SMR handling yet. Most of the drives available that I know of deal with all the complications in their firmware. Let's see if the block gurus know of any major differences in in-kernel handling of these drives.
All I could find was: Pull libata changes from Tejun Heo: "The only interesting piece is the support for shingled drives. The changes in libata layer are minimal. All it does is identifying the new class of device and report upwards accordingly" Which sure doesn't seem like it would have any effect. I have another drive on order which I can torture outside of a machine I'm trying to use.
Hi, Jason, Your best bet would be to post your dmesg with the failing disks to linux-ide.org.
This message is a reminder that Fedora 21 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 21. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '21'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 21 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 21 changed to end-of-life (EOL) status on 2015-12-01. Fedora 21 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.