Created attachment 1436072 [details] dmesg output for one instance of the problem Description of problem: I have a TerraMaster D5-300 array with a JMicron JMS567 controller and 5x3TB disks. One of the disks went bad, and when this occurred, I observed I/O errors on all disks, not just the one that went bad. I suspect (without having proven it) that the reset of the bad disk leads to some in-flight I/Os for other disks to be either lost or corrupt. Attachment 180512-dmesg.txt shows a relatively typical occurence, and below is my analysis. The same problem was seen initially on a BTRFS filesystem on hardware RAID, then partially reproduced on a Linux Ext4 filesystem with mdadm RAID (saw the same kernel messages), and then on freshly created ZFS filesystem (zpool status shown below). My analysis under "Additional info" below. Version-Release number of selected component (if applicable): BOOT_IMAGE=//vmlinuz-4.16.6-302.fc28.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap nomodeset intel_iommu=on iommu=pt modprobe.blacklist=nouveau pci-stub.ids=01de:1430,10de:0fba LANG=en_US.UTF-8 zfs.x86_64 0.7.9-1.fc28 @zfs Bus 003 Device 003: ID 152d:0567 JMicron Technology Corp. / JMicron USA Technology Corp. JMS567 SATA 6Gb/s bridge How reproducible: Always (under high I/O load) Steps to Reproduce: 1. Configure a RAID filesystem across 5 disks behind the same JMicron controller, with one faulty disk 2. Put some relatively heavy I/O load on the system, in order to trigger an I/O error on the faulty disk 3. Look for dmesg messages containing "uas_eh_device_reset_handler start" 4. Check if other disks under the same controller exhibit I/O errors Actual results: Disks placed under the same controller start exhibiting random I/O errors Expected results: I/O errors should be contained to the faulty disk Additional info: A. Analysis of events in 180512-dmesg.txt: A.1. An I/O error occurs on /dev/sdd (it was always the same disk that started the chain of events). [88904.779385] sd 10:0:0:2: [sdd] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [88904.779824] sd 10:0:0:2: [sdd] tag#3 Sense Key : Medium Error [current] [88904.780010] sd 10:0:0:2: [sdd] tag#3 Add. Sense: Unrecovered read error [88904.780198] sd 10:0:0:2: [sdd] tag#3 CDB: Read(16) 88 00 00 00 00 00 36 bc b8 18 00 00 00 40 00 00 [88904.780536] print_req_error: critical medium error, dev sdd, sector 918337560 A.2. Relatively shortly later, but not immediately (7 seconds in that case), uas_eh_abort_handler, and then something that looks like a device reset occurs. That may be intentional (attempting to flush pending I/Os to other disks before reset?) [95483.699567] sd 10:0:0:3: [sde] tag#4 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD IN [95483.699977] sd 10:0:0:3: [sde] tag#4 CDB: Read(16) 88 00 00 00 00 00 06 e8 27 58 00 00 00 80 00 00 [95483.700340] sd 10:0:0:0: [sdb] tag#8 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN [95483.700650] sd 10:0:0:0: [sdb] tag#8 CDB: Read(16) 88 00 00 00 00 01 12 28 bc 28 00 00 00 40 00 00 A.3. Then there is a sequence of reset: [95514.951410] scsi host10: uas_eh_device_reset_handler start [95515.066668] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd [95515.080167] scsi host10: uas_eh_device_reset_handler success [95515.080536] scsi host10: uas_eh_device_reset_handler start [95515.194651] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd [95515.208118] scsi host10: uas_eh_device_reset_handler success [95515.208584] scsi host10: uas_eh_device_reset_handler start [95515.322656] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd [95515.336089] scsi host10: uas_eh_device_reset_handler success [95515.339201] scsi host10: uas_eh_device_reset_handler start [95515.453667] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd [95515.467145] scsi host10: uas_eh_device_reset_handler success [95515.467635] scsi host10: uas_eh_device_reset_handler start [95515.581664] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd [95515.595155] scsi host10: uas_eh_device_reset_handler success A.4. Other stuff happens, /dev/sdd keeps issuing errors, but what is annoying is that after a while, other disks exhibit faults as well: [100903.405446] sd 10:0:0:4: [sdf] tag#6 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [100903.405741] print_req_error: I/O error, dev sdf, sector 0 [100903.405958] sd 10:0:0:4: [sdf] tag#7 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK [100903.406244] sd 10:0:0:4: [sdf] tag#7 CDB: Write(16) 8a 00 00 00 00 00 d6 66 af 50 00 00 00 40 00 00 [100903.406507] print_req_error: I/O error, dev sdf, sector 3597053776 B. Analysis of zpool output From the outside, here is what is observed using ZFS zpool status (it's not the exact same occurence if memory serves me right): B.1. Initially, ZFS sees only one read error on /dev/sdd, but reports checksum errors on other devices: pool: data state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 sdb ONLINE 0 0 0 sdc ONLINE 0 0 2 sdd ONLINE 1 0 1 sde ONLINE 0 0 1 sdf ONLINE 0 0 1 errors: No known data errors B.2. Relatively quickly, things degrade badly, and all drives start exhibiting numerous I/O errors, both read and write: pool: data state: ONLINE status: One or more devices are faulted in response to IO failures. action: Make sure the affected devices are connected, then run 'zpool clear'. see: http://zfsonlinux.org/msg/ZFS-8000-HC scan: none requested config: NAME STATE READ WRITE CKSUM data ONLINE 12 32 0 raidz1-0 ONLINE 32 65 0 sdb ONLINE 10 49 0 sdc ONLINE 39 45 2 sdd ONLINE 13 31 1 sde ONLINE 36 15 1 sdf ONLINE 32 35 1 errors: List of errors unavailable: pool I/O is currently suspended errors: 30 data errors, use '-v' for a list C. Other comments C.1. The faulty disk was replaced, and the system ran multiple VMs and data restoration tasks over the week-end without any I/O error. That, coupled with the fact that the sequence always started with the same disk (although sometimes called /dev/sdd, sometimes /dev/sde, etc), is what leads me to believe that the subsequent I/O errors are spurious, possibly a consequence of the UAS abort or reset handler, and that there is possibly a kernel bug introducing these I/O errors. C.2. The system was initially configured using hardware RAID, and the faulty disk showed up "red" on the enclosure front panel. C.3. Some additional data and context may be found on http://blackbox.dinechin.org/180509.html, http://blackbox.dinechin.org/180511.html and http://blackbox.dinechin.org/180512.html. In particular, 180509.html contains excerpts of the smartctl output for the bad disk.
Re-reading, noticed a typo, it's 7000 seconds, not 7 seconds (numlock off, I guess).
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs. Fedora 28 has now been rebased to 4.17.7-200.fc28. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.