Red Hat Bugzilla – Bug 977899
dmraid failures with kernels newer than 3.9.2-200
Last modified: 2013-07-24 14:53:47 EDT
Description of problem:
Kernels newer than 3.9.2-200 (This is the latest officially released kernel that doesn't have the problem) will trigger raid failures and effectively crash my system.
System is configured with a single RAID10 dmraid across two disks and whenever a newer kernel is booted, shortly after logging in on the desktop (it seems to be triggered by heavy disk activity) my filesystem fails causing random problems (desktop environment crash, failure to read files/write files).
Unfortunately, since all my filesystems are on this single RAID10, I can't even dump the dmesg out put.
In this particular case, I was running kernel-3.9.6-200.fc18.x86_64 and the machine rebooted over the weekend due to a power outage and and the crash didn't occur until I logged in after coming back to work.
The dmesg output shows
md/raid10:md1: Disk failure on sdb3, disabling device.
md/raid10:md1: Operation continuing on 1 devices.
md/raid10:md1: Disk failure on sdc3, disabling device.
md/raid10:md1: Operation continuing on 0 devices.
dm-0: WRITE SAME failed. Manually zeroing.
This is followed by many many EXT4-fs i/o errors which sort of makes sense if the raid fell out from under the file system.
There is no evidence in the dmesg output of the disks actually failing. Usually I'd see SCSI/SATA errors when that's the case and reverting to kernel-3.9.2-200 prevents the issue from occurring again.
Created attachment 765135 [details]
picture of dmesg output taken
Since I can't actually pipe dmesg anywhere that is persistent (or mount anything since the filesystem no longer exists) I took a picture with my phone.
The problem seems specific to the LSI 1068e SAS/SATA controller on a Dell Precision T7400 workstation.
If I move the drives off to the onboard intel SATA controller the issue goes away.
There were some commits missing for WRITE SAME handling in the earlier 3.9.y kernels. Does 3.9.8 still show this issue?
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.