Bug 977899 - dmraid failures with kernels newer than 3.9.2-200
Summary: dmraid failures with kernels newer than 3.9.2-200
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 18
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: fedora-kernel-raid
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-06-25 14:39 UTC by Andy Wang
Modified: 2013-07-24 18:53 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2013-07-24 18:53:47 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
picture of dmesg output taken (3.63 MB, image/jpeg)
2013-06-25 15:18 UTC, Andy Wang
no flags Details

Description Andy Wang 2013-06-25 14:39:44 UTC
Description of problem:
Kernels newer than 3.9.2-200 (This is the latest officially released kernel that doesn't have the problem) will trigger raid failures and effectively crash my system.

System is configured with a single RAID10 dmraid across two disks and whenever a newer kernel is booted, shortly after logging in on the desktop (it seems to be triggered by heavy disk activity) my filesystem fails causing random problems (desktop environment crash, failure to read files/write files).

Unfortunately, since all my filesystems are on this single RAID10, I can't even dump the dmesg out put.

In this particular case, I was running kernel-3.9.6-200.fc18.x86_64 and the machine rebooted over the weekend due to a power outage and and the crash didn't occur until I logged in after coming back to work.

The dmesg output shows 
md/raid10:md1: Disk failure on sdb3, disabling device.
md/raid10:md1: Operation continuing on 1 devices.
md/raid10:md1: Disk failure on sdc3, disabling device.
md/raid10:md1: Operation continuing on 0 devices.
dm-0: WRITE SAME failed. Manually zeroing.

This is followed by many many EXT4-fs i/o errors which sort of makes sense if the raid fell out from under the file system.

There is no evidence in the dmesg output of the disks actually failing.  Usually I'd see SCSI/SATA errors when that's the case and reverting to kernel-3.9.2-200 prevents the issue from occurring again.

Comment 1 Andy Wang 2013-06-25 15:18:49 UTC
Created attachment 765135 [details]
picture of dmesg output taken

Since I can't actually pipe dmesg anywhere that is persistent (or mount anything since the filesystem no longer exists) I took a picture with my phone.

Comment 2 Andy Wang 2013-06-28 00:43:35 UTC
The problem seems specific to the LSI 1068e SAS/SATA controller on a Dell Precision T7400 workstation.

If I move the drives off to the onboard intel SATA controller the issue goes away.

Comment 3 Josh Boyer 2013-07-01 17:25:38 UTC
There were some commits missing for WRITE SAME handling in the earlier 3.9.y kernels.  Does 3.9.8 still show this issue?

Comment 4 Josh Boyer 2013-07-24 18:53:47 UTC
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.


Note You need to log in before you can comment on or make changes to this bug.