977899 – dmraid failures with kernels newer than 3.9.2-200

Bug 977899 - dmraid failures with kernels newer than 3.9.2-200

Summary: dmraid failures with kernels newer than 3.9.2-200

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	18
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	fedora-kernel-raid
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-25 14:39 UTC by Andy Wang
Modified:	2013-07-24 18:53 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-07-24 18:53:47 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
picture of dmesg output taken (3.63 MB, image/jpeg) 2013-06-25 15:18 UTC, Andy Wang	no flags	Details
View All

Description Andy Wang 2013-06-25 14:39:44 UTC

Description of problem:
Kernels newer than 3.9.2-200 (This is the latest officially released kernel that doesn't have the problem) will trigger raid failures and effectively crash my system.

System is configured with a single RAID10 dmraid across two disks and whenever a newer kernel is booted, shortly after logging in on the desktop (it seems to be triggered by heavy disk activity) my filesystem fails causing random problems (desktop environment crash, failure to read files/write files).

Unfortunately, since all my filesystems are on this single RAID10, I can't even dump the dmesg out put.

In this particular case, I was running kernel-3.9.6-200.fc18.x86_64 and the machine rebooted over the weekend due to a power outage and and the crash didn't occur until I logged in after coming back to work.

The dmesg output shows 
md/raid10:md1: Disk failure on sdb3, disabling device.
md/raid10:md1: Operation continuing on 1 devices.
md/raid10:md1: Disk failure on sdc3, disabling device.
md/raid10:md1: Operation continuing on 0 devices.
dm-0: WRITE SAME failed. Manually zeroing.

This is followed by many many EXT4-fs i/o errors which sort of makes sense if the raid fell out from under the file system.

There is no evidence in the dmesg output of the disks actually failing.  Usually I'd see SCSI/SATA errors when that's the case and reverting to kernel-3.9.2-200 prevents the issue from occurring again.

Comment 1 Andy Wang 2013-06-25 15:18:49 UTC

Created attachment 765135 [details]
picture of dmesg output taken

Since I can't actually pipe dmesg anywhere that is persistent (or mount anything since the filesystem no longer exists) I took a picture with my phone.

Comment 2 Andy Wang 2013-06-28 00:43:35 UTC

The problem seems specific to the LSI 1068e SAS/SATA controller on a Dell Precision T7400 workstation.

If I move the drives off to the onboard intel SATA controller the issue goes away.

Comment 3 Josh Boyer 2013-07-01 17:25:38 UTC

There were some commits missing for WRITE SAME handling in the earlier 3.9.y kernels.  Does 3.9.8 still show this issue?

Comment 4 Josh Boyer 2013-07-24 18:53:47 UTC

This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 2 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.