Bug 977899 - dmraid failures with kernels newer than 3.9.2-200
dmraid failures with kernels newer than 3.9.2-200
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
18
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: fedora-kernel-raid
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-25 10:39 EDT by Andy Wang
Modified: 2013-07-24 14:53 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-07-24 14:53:47 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
picture of dmesg output taken (3.63 MB, image/jpeg)
2013-06-25 11:18 EDT, Andy Wang
no flags Details

  None (edit)
Description Andy Wang 2013-06-25 10:39:44 EDT
Description of problem:
Kernels newer than 3.9.2-200 (This is the latest officially released kernel that doesn't have the problem) will trigger raid failures and effectively crash my system.

System is configured with a single RAID10 dmraid across two disks and whenever a newer kernel is booted, shortly after logging in on the desktop (it seems to be triggered by heavy disk activity) my filesystem fails causing random problems (desktop environment crash, failure to read files/write files).

Unfortunately, since all my filesystems are on this single RAID10, I can't even dump the dmesg out put.

In this particular case, I was running kernel-3.9.6-200.fc18.x86_64 and the machine rebooted over the weekend due to a power outage and and the crash didn't occur until I logged in after coming back to work.

The dmesg output shows 
md/raid10:md1: Disk failure on sdb3, disabling device.
md/raid10:md1: Operation continuing on 1 devices.
md/raid10:md1: Disk failure on sdc3, disabling device.
md/raid10:md1: Operation continuing on 0 devices.
dm-0: WRITE SAME failed. Manually zeroing.

This is followed by many many EXT4-fs i/o errors which sort of makes sense if the raid fell out from under the file system.

There is no evidence in the dmesg output of the disks actually failing.  Usually I'd see SCSI/SATA errors when that's the case and reverting to kernel-3.9.2-200 prevents the issue from occurring again.
Comment 1 Andy Wang 2013-06-25 11:18:49 EDT
Created attachment 765135 [details]
picture of dmesg output taken

Since I can't actually pipe dmesg anywhere that is persistent (or mount anything since the filesystem no longer exists) I took a picture with my phone.
Comment 2 Andy Wang 2013-06-27 20:43:35 EDT
The problem seems specific to the LSI 1068e SAS/SATA controller on a Dell Precision T7400 workstation.

If I move the drives off to the onboard intel SATA controller the issue goes away.
Comment 3 Josh Boyer 2013-07-01 13:25:38 EDT
There were some commits missing for WRITE SAME handling in the earlier 3.9.y kernels.  Does 3.9.8 still show this issue?
Comment 4 Josh Boyer 2013-07-24 14:53:47 EDT
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.