Bug 1577836

Summary: JMicron JMS567 - I/O error on one disk cause I/O errors on other disks in array (possible data corruption)
Product: [Fedora] Fedora Reporter: Christophe de Dinechin <dinechin>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 28CC: airlied, bskeggs, ewk, hdegoede, ichavero, itamar, jarodwilson, jglisse, john.j5live, jonathan, josef, kernel-maint, linville, mchehab, mjg59, steved
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-29 15:14:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output for one instance of the problem none

Description Christophe de Dinechin 2018-05-14 08:46:34 UTC
Created attachment 1436072 [details]
dmesg output for one instance of the problem

Description of problem:

I have a TerraMaster D5-300 array with a JMicron JMS567 controller and 5x3TB disks. One of the disks went bad, and when this occurred, I observed I/O errors on all disks, not just the one that went bad. I suspect (without having proven it) that the reset of the bad disk leads to some in-flight I/Os for other disks to be either lost or corrupt.

Attachment 180512-dmesg.txt shows a relatively typical occurence, and below is my analysis. The same problem was seen initially on a BTRFS filesystem on hardware RAID, then partially reproduced on a Linux Ext4 filesystem with mdadm RAID (saw the same kernel messages), and then on freshly created ZFS filesystem (zpool status shown below). My analysis under "Additional info" below.


Version-Release number of selected component (if applicable):

BOOT_IMAGE=//vmlinuz-4.16.6-302.fc28.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap nomodeset intel_iommu=on iommu=pt modprobe.blacklist=nouveau pci-stub.ids=01de:1430,10de:0fba LANG=en_US.UTF-8

zfs.x86_64                                            0.7.9-1.fc28                                            @zfs

Bus 003 Device 003: ID 152d:0567 JMicron Technology Corp. / JMicron USA Technology Corp. JMS567 SATA 6Gb/s bridge


How reproducible: Always (under high I/O load)


Steps to Reproduce:
1. Configure a RAID filesystem across 5 disks behind the same JMicron controller, with one faulty disk
2. Put some relatively heavy I/O load on the system, in order to trigger an I/O error on the faulty disk
3. Look for dmesg messages containing "uas_eh_device_reset_handler start"
4. Check if other disks under the same controller exhibit I/O errors

Actual results:

Disks placed under the same controller start exhibiting random I/O errors


Expected results:

I/O errors should be contained to the faulty disk


Additional info:

A. Analysis of events in 180512-dmesg.txt:

A.1. An I/O error occurs on /dev/sdd (it was always the same disk that started the chain of events).

[88904.779385] sd 10:0:0:2: [sdd] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[88904.779824] sd 10:0:0:2: [sdd] tag#3 Sense Key : Medium Error [current]
[88904.780010] sd 10:0:0:2: [sdd] tag#3 Add. Sense: Unrecovered read error
[88904.780198] sd 10:0:0:2: [sdd] tag#3 CDB: Read(16) 88 00 00 00 00 00 36 bc b8 18 00 00 00 40 00 00
[88904.780536] print_req_error: critical medium error, dev sdd, sector 918337560

A.2. Relatively shortly later, but not immediately (7 seconds in that case), uas_eh_abort_handler, and then something that looks like a device reset occurs. That may be intentional (attempting to flush pending I/Os to other disks before reset?)

[95483.699567] sd 10:0:0:3: [sde] tag#4 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD IN
[95483.699977] sd 10:0:0:3: [sde] tag#4 CDB: Read(16) 88 00 00 00 00 00 06 e8 27 58 00 00 00 80 00 00
[95483.700340] sd 10:0:0:0: [sdb] tag#8 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD IN
[95483.700650] sd 10:0:0:0: [sdb] tag#8 CDB: Read(16) 88 00 00 00 00 01 12 28 bc 28 00 00 00 40 00 00

A.3. Then there is a sequence of reset:

[95514.951410] scsi host10: uas_eh_device_reset_handler start
[95515.066668] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd
[95515.080167] scsi host10: uas_eh_device_reset_handler success
[95515.080536] scsi host10: uas_eh_device_reset_handler start
[95515.194651] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd
[95515.208118] scsi host10: uas_eh_device_reset_handler success
[95515.208584] scsi host10: uas_eh_device_reset_handler start
[95515.322656] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd
[95515.336089] scsi host10: uas_eh_device_reset_handler success
[95515.339201] scsi host10: uas_eh_device_reset_handler start
[95515.453667] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd
[95515.467145] scsi host10: uas_eh_device_reset_handler success
[95515.467635] scsi host10: uas_eh_device_reset_handler start
[95515.581664] usb 4-6: reset SuperSpeed USB device number 2 using xhci_hcd
[95515.595155] scsi host10: uas_eh_device_reset_handler success

A.4. Other stuff happens, /dev/sdd keeps issuing errors, but what is annoying is that after a while, other disks exhibit faults as well:

[100903.405446] sd 10:0:0:4: [sdf] tag#6 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[100903.405741] print_req_error: I/O error, dev sdf, sector 0
[100903.405958] sd 10:0:0:4: [sdf] tag#7 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[100903.406244] sd 10:0:0:4: [sdf] tag#7 CDB: Write(16) 8a 00 00 00 00 00 d6 66 af 50 00 00 00 40 00 00
[100903.406507] print_req_error: I/O error, dev sdf, sector 3597053776


B. Analysis of zpool output

From the outside, here is what is observed using ZFS zpool status (it's not the exact same occurence if memory serves me right):

B.1. Initially, ZFS sees only one read error on /dev/sdd, but reports checksum errors on other devices:

  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

	NAME        STATE     READ WRITE CKSUM
	data        ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdc     ONLINE       0     0     2
	    sdd     ONLINE       1     0     1
	    sde     ONLINE       0     0     1
	    sdf     ONLINE       0     0     1

errors: No known data errors


B.2. Relatively quickly, things degrade badly, and all drives start exhibiting numerous I/O errors, both read and write:

  pool: data
 state: ONLINE
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: none requested
config:


	NAME        STATE     READ WRITE CKSUM
	data        ONLINE      12    32     0
	  raidz1-0  ONLINE      32    65     0
	    sdb     ONLINE      10    49     0
	    sdc     ONLINE      39    45     2
	    sdd     ONLINE      13    31     1
	    sde     ONLINE      36    15     1
	    sdf     ONLINE      32    35     1
errors: List of errors unavailable: pool I/O is currently suspended



errors: 30 data errors, use '-v' for a list


C. Other comments

C.1. The faulty disk was replaced, and the system ran multiple VMs and data restoration tasks over the week-end without any I/O error. That, coupled with the fact that the sequence always started with the same disk (although sometimes called /dev/sdd, sometimes /dev/sde, etc), is what leads me to believe that the subsequent I/O errors are spurious, possibly a consequence of the UAS abort or reset handler, and that there is possibly a kernel bug introducing these I/O errors.

C.2. The system was initially configured using hardware RAID, and the faulty disk showed up "red" on the enclosure front panel.

C.3. Some additional data and context may be found on http://blackbox.dinechin.org/180509.html, http://blackbox.dinechin.org/180511.html and http://blackbox.dinechin.org/180512.html. In particular, 180509.html contains excerpts of the smartctl output for the bad disk.

Comment 1 Christophe de Dinechin 2018-05-14 10:00:33 UTC
Re-reading, noticed a typo, it's 7000 seconds, not 7 seconds (numlock off, I guess).

Comment 2 Justin M. Forbes 2018-07-23 15:09:53 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.17.7-200.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 3 Justin M. Forbes 2018-08-29 15:14:44 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.