Created attachment 618202 [details] Output of my lshw Description of problem: It appears that after one disk has been filled 1/3 of 3 x 3TB RAID 5 MD that the RAID starts to fail after seeing high disk rate transfers. Version-Release number of selected component (if applicable): [ryan@sherwood ~]$ mdadm --version mdadm - v3.2.5 - 18th May 2012 [root@sherwood ~]# uname -a Linux sherwood 3.5.3-1.fc17.x86_64 #1 SMP Wed Aug 29 18:46:34 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux [root@sherwood ~]# cat /etc/redhat-release Fedora release 17 (Beefy Miracle) How reproducible: Any high volume transfers seems to cause the problem after one of disk had been filled. In my case ~3TB, recently I was uncompressing a tar.bz2 of 27GB to the MD device, about 10min into it one disk reported as failing. Actual results: Sep 27 08:44:33 sherwood kernel: [1767194.995739] ata3.01: exception Emask 0x10 SAct 0x0 SErr 0x400101 action 0x0 Sep 27 08:44:33 sherwood kernel: [1767194.995745] ata3.01: SError: { RecovData UnrecovData Handshk } Sep 27 08:44:33 sherwood kernel: [1767194.995748] ata3.01: failed command: WRITE DMA EXT Sep 27 08:44:33 sherwood kernel: [1767194.995754] ata3.01: cmd 35/00:00:c0:f9:86/00:02:c6:00:00/f0 tag 0 dma 262144 out Sep 27 08:44:33 sherwood kernel: [1767194.995754] res 51/84:f0:d0:fa:86/84:00:c6:00:00/16 Emask 0x30 (host bus error) Sep 27 08:44:33 sherwood kernel: [1767194.995757] ata3.01: status: { DRDY ERR } Sep 27 08:44:33 sherwood kernel: [1767194.995759] ata3.01: error: { ICRC ABRT } Sep 27 08:44:33 sherwood kernel: [1767194.995766] ata3.00: hard resetting link Sep 27 08:44:34 sherwood kernel: [1767195.300021] ata3.01: hard resetting link Sep 27 08:44:34 sherwood kernel: [1767195.756079] ata3.00: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 27 08:44:34 sherwood kernel: [1767195.756094] ata3.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Sep 27 08:44:34 sherwood kernel: [1767195.766259] ata3.00: configured for UDMA/133 Sep 27 08:44:34 sherwood kernel: [1767195.775295] ata3.01: configured for UDMA/133 Sep 27 08:44:34 sherwood kernel: [1767195.775341] sd 2:0:1:0: [sdb] Sep 27 08:44:34 sherwood kernel: [1767195.775343] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Sep 27 08:44:34 sherwood kernel: [1767195.775345] sd 2:0:1:0: [sdb] Sep 27 08:44:34 sherwood kernel: [1767195.775347] Sense Key : Aborted Command [current] [descriptor] Sep 27 08:44:34 sherwood kernel: [1767195.775350] Descriptor sense data with sense descriptors (in hex): Sep 27 08:44:34 sherwood kernel: [1767195.775352] 72 0b 47 00 00 00 00 0c 00 0a 80 00 00 00 00 00 Sep 27 08:44:34 sherwood kernel: [1767195.775360] c6 86 fa d0 Sep 27 08:44:34 sherwood kernel: [1767195.775365] sd 2:0:1:0: [sdb] Sep 27 08:44:34 sherwood kernel: [1767195.775368] Add. Sense: Scsi parity error Sep 27 08:44:34 sherwood kernel: [1767195.775371] sd 2:0:1:0: [sdb] CDB: Sep 27 08:44:34 sherwood kernel: [1767195.775372] Write(10): 2a 00 c6 86 f9 c0 00 02 00 00 Sep 27 08:44:34 sherwood kernel: [1767195.775380] end_request: I/O error, dev sdb, sector 3330734528 Sep 27 08:44:34 sherwood kernel: [1767195.775425] ata3: EH complete Sep 27 08:44:34 sherwood kernel: [1767195.775446] md/raid:md0: Disk failure on sdb, disabling device. Sep 27 08:44:34 sherwood kernel: [1767195.775446] md/raid:md0: Operation continuing on 2 devices. Expected results: For it to not fail Additional info: I find it odd that this has now happened on all three devices independently. The first time it happend I ran smart short and long test on the device, passed. I've dd'd the drive with zero and then attempted again, still failed at high transfer. I've replaced the SATA cable with a new one, no difference. I'm not sure why this happening, I'm not sure its a mdadm problem, kernel problem or hardware problem. The log does seem to indicate the drive is failing or at least it had to reset the data link to the drive which caused the MD to fail. Any help in clearing this up would be helpful.
FYI: I've been noting these evens on my wiki attempting to collect info and determine the solution. https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull
Ryan, What you are seeing there are drive failures from the physical drive. Basically your sdb is rolling over with write errors. In your wiki you state that sde is the failed disk, but the error messages there come from sdd, and in the above example they are from sdb. If you are sure about sde being bad, then it is likely you are sitting with multiple bad drives, which again would explain why your raid ends up failing :( Alternatively you could have the drives connected using bad cables, or possible SATA III drives in a SATA II enclosure? Doesn't look very good, I am sorry to say. Jes
Ryan, Did you make any progress in isolating the problem here? Jes
No not yet. I think my next step is to purchase another drive. Replace one of them, and then RMA another. This way I can control if it is one drive or another. Multiple drives have been failing. The first instance was /dev/sde, the second failure accrued on /dev/sdd, and the third failure is from /dev/sdb. On my wiki I haven't organized it the best, the main page is the second failure. Then subsequent ones I'm creating a page for each: https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull/092712-00 If it truly is bad hardware then I have two to three drives failing (all of them).
Ryan, I see, if it's multiple drives failing like this, then the problem could be elsewhere in the hardware chain too. The controller(s), cables, even memory. The error messages you are seeing are definitely from the hardware, and below the raid layers. The fact that you have multiple drives dropping out like this is likely to trip the raid above :( Cheers, Jes
Jes, Yeah I understand that and have been looking at hardware. I've replaced the cables, still errors. I've ran SMART short and long tests on the drives; no error. I didn't think about the system memory as you pointed out so I tested that and no failure. Which I've had that be a problem in the past (don't get memory with blinking lights). Aside from the controller on the mother board I'm not sure what else I can test. I have order a new drive and will arrive on Thursday (10/18/12) I should have time on the weekend to test things out and see how it behaves. Will update the ticket then.
I installed the new drive. When I installed I noticed the fan for the drive bay was being blocked by some wires so I thought maybe the drives were overheating. Operational top temp is 60c for these guys which I hope they are not getting too but hot drives = lower life so I fixed this up. No such luck though. I attempted to added a failed drive first. It failed to re-add. I then attempted to add the new drive to the array. It failed in a similar manner. I have a feeling this is a hardware issue although I'm not sure what I can test next, anyone have any ideas?
Bummer :( I am pretty sure it is hardware too, otherwise I would have my inbox full of reports like these, and you are the only one who is seeing this one. I would start by running memtest86 on the system to see if there is any memory errors that show up. You did say you ran memory test, so I don't know if you ran something like this. Second, I would suspect either cables or the controller. The only case I have seen myself that looks similar was trying to run SATA III drives at 600 in an enclosure that was only rated for SATA II 300. Otherwise I am fairly out of ideas too unfortunately. It could be the motherboard too, but that is hard to pin down. Cheers, Jes
An update on this, I pulled to PCI SATA cards I had lying around and plugged now all 4 x 3TB drives into the external cards. I first then re-added all drives. Let the three 3TB drives re-sync which would typically fail within 10 minutes connected to the motherboards SATA connectors. Drives sync finished after little more than a day. I then added the new forth drive I purchased for troubleshooting, again took over a day to sync. Again finished without an issue. The filesystem was then grown to the new size and I've been using the drives without issue on the new controllers for over a week now. I would now rule out the drives completely. These leaves only my motherboards SATA controller or some software component or a mixture of the two. I tend to not think its the controllers problem as the other drives attached do not have a problem, only the 3TB drives. I have two 2TB drives configure as a md strip for backup (which is now grossly out of space to perform the backups) I guess to rule out the meta devices I could take a couple of 500GB drives I have and test it out to see if the same problem appears but I don't know if I have the time right now. I'm just happy that my RAID is currently working. To note the problem only started to happen after I updated my system to Fedora 16, its currently Fedora 17. I've attached a new lshw output.
Created attachment 703841 [details] Output of my lshw(02-28-13)
Ryan, Sorry for not getting back to you earlier - have you been running successfully with the PCI card since February? Cheers, Jes
Not a problem on getting back. Yeah I've ran with no problems at long duration and heavy data transfers without issue since PCI controllers have been used for the 3TB drives. Ryan
I am glad to hear you are able to run the system reliably now and your data sounds to be safe. This does sound like what you were seeing was a hardware or possibly a driver problem - though if it was the driver I would expect to have seen more similar reports. If you're ok with it, I suggest we close this bug out since it doesn't seem to be a raid related issue. If you hit more problems with md raid, I do want to hear about it however. Cheers, Jes
I would like to confirm the actual cause of the problem, how could I go about testing the driver to confirm or deny this as the problem? If you know of any pointers to documentation that I could follow I would like to know exactly why this started happening. Thank, Ryan
One thing I just noticed from the lshw output is that you had a JMB micro SATA card in there and an onboard Intel controller which was running in ATA legacy mode. If the 3TB drives were on the onboard in ATA mode, it might be worth flipping it over to AHCI mode in the BIOS and see if this makes the problem go away. This would be an indication that the piix legacy driver is having problems.
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.