Bug 861160

Summary: Meta device failing on random drive after more than half space used (RAID5)
Product: [Fedora] Fedora Reporter: Ryan <ryan.redhat>
Component: mdadmAssignee: Jes Sorensen <Jes.Sorensen>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 17CC: agk, dledford, Jes.Sorensen, ryan.redhat
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-08-01 01:08:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Output of my lshw
none
Output of my lshw(02-28-13) none

Description Ryan 2012-09-27 16:33:03 UTC
Created attachment 618202 [details]
Output of my lshw

Description of problem:

It appears that after one disk has been filled 1/3 of 3 x 3TB RAID 5 MD that the RAID starts to fail after seeing high disk rate transfers.


Version-Release number of selected component (if applicable):

[ryan@sherwood ~]$ mdadm --version
mdadm - v3.2.5 - 18th May 2012
[root@sherwood ~]# uname -a
Linux sherwood 3.5.3-1.fc17.x86_64 #1 SMP Wed Aug 29 18:46:34 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
[root@sherwood ~]# cat /etc/redhat-release
Fedora release 17 (Beefy Miracle)


How reproducible:

Any high volume transfers seems to cause the problem after one of disk had been filled. In my case ~3TB, recently I was uncompressing a tar.bz2 of 27GB to the MD device, about 10min into it one disk reported as failing.



Actual results:

Sep 27 08:44:33 sherwood kernel: [1767194.995739] ata3.01: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x0
Sep 27 08:44:33 sherwood kernel: [1767194.995745] ata3.01: SError: { RecovData UnrecovData Handshk }
Sep 27 08:44:33 sherwood kernel: [1767194.995748] ata3.01: failed command: WRITE DMA EXT
Sep 27 08:44:33 sherwood kernel: [1767194.995754] ata3.01: cmd 35/00:00:c0:f9:86/00:02:c6:00:00/f0 tag 0 dma 262144 out
Sep 27 08:44:33 sherwood kernel: [1767194.995754]          res 51/84:f0:d0:fa:86/84:00:c6:00:00/16 Emask 0x30 (host bus error)
Sep 27 08:44:33 sherwood kernel: [1767194.995757] ata3.01: status: { DRDY ERR }
Sep 27 08:44:33 sherwood kernel: [1767194.995759] ata3.01: error: { ICRC ABRT }
Sep 27 08:44:33 sherwood kernel: [1767194.995766] ata3.00: hard resetting link
Sep 27 08:44:34 sherwood kernel: [1767195.300021] ata3.01: hard resetting link
Sep 27 08:44:34 sherwood kernel: [1767195.756079] ata3.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 27 08:44:34 sherwood kernel: [1767195.756094] ata3.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 27 08:44:34 sherwood kernel: [1767195.766259] ata3.00: configured for UDMA/133
Sep 27 08:44:34 sherwood kernel: [1767195.775295] ata3.01: configured for UDMA/133
Sep 27 08:44:34 sherwood kernel: [1767195.775341] sd 2:0:1:0: [sdb]
Sep 27 08:44:34 sherwood kernel: [1767195.775343] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Sep 27 08:44:34 sherwood kernel: [1767195.775345] sd 2:0:1:0: [sdb]
Sep 27 08:44:34 sherwood kernel: [1767195.775347] Sense Key : Aborted Command [current] [descriptor]
Sep 27 08:44:34 sherwood kernel: [1767195.775350] Descriptor sense data with sense descriptors (in hex):
Sep 27 08:44:34 sherwood kernel: [1767195.775352]         72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 27 08:44:34 sherwood kernel: [1767195.775360]         c6 86 fa d0
Sep 27 08:44:34 sherwood kernel: [1767195.775365] sd 2:0:1:0: [sdb]
Sep 27 08:44:34 sherwood kernel: [1767195.775368] Add. Sense: Scsi parity error
Sep 27 08:44:34 sherwood kernel: [1767195.775371] sd 2:0:1:0: [sdb] CDB:
Sep 27 08:44:34 sherwood kernel: [1767195.775372] Write(10): 2a 00 c6 86 f9 c0 00 02 00 00
Sep 27 08:44:34 sherwood kernel: [1767195.775380] end_request: I/O error, dev sdb, sector 3330734528
Sep 27 08:44:34 sherwood kernel: [1767195.775425] ata3: EH complete
Sep 27 08:44:34 sherwood kernel: [1767195.775446] md/raid:md0: Disk failure on sdb, disabling device.
Sep 27 08:44:34 sherwood kernel: [1767195.775446] md/raid:md0: Operation continuing on 2 devices.

Expected results:

For it to not fail

Additional info:

I find it odd that this has now happened on all three devices independently.  The first time it happend I ran smart short and long test on the device, passed.  I've dd'd the drive with zero and then attempted again, still failed at high transfer.  I've replaced the SATA cable with a new one, no difference.  I'm not sure why this happening, I'm not sure its a mdadm problem, kernel problem or hardware problem.  The log does seem to indicate the drive is failing or at least it had to reset the data link to the drive which caused the MD to fail.  Any help in clearing this up would be helpful.

Comment 1 Ryan 2012-09-27 20:20:01 UTC
FYI: I've been noting these evens on my wiki attempting to collect info and determine the solution.

https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull

Comment 2 Jes Sorensen 2012-09-28 07:08:32 UTC
Ryan,

What you are seeing there are drive failures from the physical drive.
Basically your sdb is rolling over with write errors.

In your wiki you state that sde is the failed disk, but the error messages
there come from sdd, and in the above example they are from sdb. If you are
sure about sde being bad, then it is likely you are sitting with multiple
bad drives, which again would explain why your raid ends up failing :(

Alternatively you could have the drives connected using bad cables, or
possible SATA III drives in a SATA II enclosure?

Doesn't look very good, I am sorry to say.

Jes

Comment 3 Jes Sorensen 2012-10-08 11:38:54 UTC
Ryan,

Did you make any progress in isolating the problem here?

Jes

Comment 4 Ryan 2012-10-11 18:41:41 UTC
No not yet.  I think my next step is to purchase another drive.  Replace one of them, and then RMA another.  This way I can control if it is one drive or another.

Multiple drives have been failing.  The first instance was /dev/sde, the second failure accrued on /dev/sdd, and the third failure is from /dev/sdb.

On my wiki I haven't organized it the best, the main page is the second failure.  Then subsequent ones I'm creating a page for each:

https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull
https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull/092712-00

If it truly is bad hardware then I have two to three drives failing (all of them).

Comment 5 Jes Sorensen 2012-10-12 07:14:52 UTC
Ryan,

I see, if it's multiple drives failing like this, then the problem could be
elsewhere in the hardware chain too. The controller(s), cables, even memory.

The error messages you are seeing are definitely from the hardware, and below
the raid layers. The fact that you have multiple drives dropping out like this
is likely to trip the raid above :(

Cheers,
Jes

Comment 6 Ryan 2012-10-16 04:44:56 UTC
Jes,

Yeah I understand that and have been looking at hardware.  I've replaced the cables, still errors.  I've ran SMART short and long tests on the drives; no error.  I didn't think about the system memory as you pointed out so I tested that and no failure.  Which I've had that be a problem in the past (don't get memory with blinking lights).

Aside from the controller on the mother board I'm not sure what else I can test.

I have order a new drive and will arrive on Thursday (10/18/12)  I should have time on the weekend to test things out and see how it behaves.  Will update the ticket then.

Comment 7 Ryan 2012-10-20 14:56:58 UTC
I installed the new drive.  When I installed I noticed the fan for the drive bay was being blocked by some wires so I thought maybe the drives were overheating.  Operational top temp is 60c for these guys which I hope they are not getting too but hot drives = lower life so I fixed this up.  No such luck though.

I attempted to added a failed drive first.  It failed to re-add.
I then attempted to add the new drive to the array.  It failed in a similar manner.

I have a feeling this is a hardware issue although I'm not sure what I can test next, anyone have any ideas?

Comment 8 Jes Sorensen 2012-10-24 17:48:39 UTC
Bummer :(

I am pretty sure it is hardware too, otherwise I would have my inbox full
of reports like these, and you are the only one who is seeing this one.

I would start by running memtest86 on the system to see if there is any
memory errors that show up. You did say you ran memory test, so I don't know
if you ran something like this.

Second, I would suspect either cables or the controller.

The only case I have seen myself that looks similar was trying to run SATA
III drives at 600 in an enclosure that was only rated for SATA II 300.

Otherwise I am fairly out of ideas too unfortunately. It could be the
motherboard too, but that is hard to pin down.

Cheers,
Jes

Comment 9 Ryan 2013-02-28 08:32:01 UTC
An update on this, I pulled to PCI SATA cards I had lying around and plugged now all 4 x 3TB drives into the external cards.

I first then re-added all drives.  Let the three 3TB drives re-sync which would typically fail within 10 minutes connected to the motherboards SATA connectors.  Drives sync finished after little more than a day.  

I then added the new forth drive I purchased for troubleshooting, again took over a day to sync.  Again finished without an issue.

The filesystem was then grown to the new size and I've been using the drives without issue on the new controllers for over a week now.

I would now rule out the drives completely.

These leaves only my motherboards SATA controller or some software component or a mixture of the two.  I tend to not think its the controllers problem as the other drives attached do not have a problem, only the 3TB drives.  I have two 2TB drives configure as a md strip for backup (which is now grossly out of space to perform the backups)

I guess to rule out the meta devices I could take a couple of 500GB drives I have and test it out to see if the same problem appears but I don't know if I have the time right now.  I'm just happy that my RAID is currently working.  To note the problem only started to happen after I updated my system to Fedora 16, its currently Fedora 17.

I've attached a new lshw output.

Comment 10 Ryan 2013-02-28 08:32:45 UTC
Created attachment 703841 [details]
Output of my lshw(02-28-13)

Comment 11 Jes Sorensen 2013-04-29 11:11:03 UTC
Ryan,

Sorry for not getting back to you earlier - have you been running successfully
with the PCI card since February?

Cheers,
Jes

Comment 12 Ryan 2013-05-01 08:38:52 UTC
Not a problem on getting back.  Yeah I've ran with no problems at long duration and heavy data transfers without issue since PCI controllers have been used for the 3TB drives.

Ryan

Comment 13 Jes Sorensen 2013-05-01 12:40:28 UTC
I am glad to hear you are able to run the system reliably now and your data
sounds to be safe.

This does sound like what you were seeing was a hardware or possibly a driver 
problem - though if it was the driver I would expect to have seen more similar
reports.

If you're ok with it, I suggest we close this bug out since it doesn't seem
to be a raid related issue. If you hit more problems with md raid, I do want
to hear about it however.

Cheers,
Jes

Comment 14 Ryan 2013-05-01 16:32:14 UTC
I would like to confirm the actual cause of the problem, how could I go about testing the driver to confirm or deny this as the problem?  If you know of any pointers to documentation that I could follow I would like to know exactly why this started happening.

Thank,
Ryan

Comment 15 Jes Sorensen 2013-05-02 08:09:02 UTC
One thing I just noticed from the lshw output is that you had a JMB micro SATA
card in there and an onboard Intel controller which was running in ATA legacy
mode.

If the 3TB drives were on the onboard in ATA mode, it might be worth flipping
it over to AHCI mode in the BIOS and see if this makes the problem go away.
This would be an indication that the piix legacy driver is having problems.

Comment 16 Fedora End Of Life 2013-07-03 22:55:09 UTC
This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 17 Fedora End Of Life 2013-08-01 01:08:07 UTC
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.