Bug 861160
Summary: | Meta device failing on random drive after more than half space used (RAID5) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Ryan <ryan.redhat> | ||||||
Component: | mdadm | Assignee: | Jes Sorensen <Jes.Sorensen> | ||||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | unspecified | ||||||||
Version: | 17 | CC: | agk, dledford, Jes.Sorensen, ryan.redhat | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2013-08-01 01:08:00 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Ryan
2012-09-27 16:33:03 UTC
FYI: I've been noting these evens on my wiki attempting to collect info and determine the solution. https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull Ryan, What you are seeing there are drive failures from the physical drive. Basically your sdb is rolling over with write errors. In your wiki you state that sde is the failed disk, but the error messages there come from sdd, and in the above example they are from sdb. If you are sure about sde being bad, then it is likely you are sitting with multiple bad drives, which again would explain why your raid ends up failing :( Alternatively you could have the drives connected using bad cables, or possible SATA III drives in a SATA II enclosure? Doesn't look very good, I am sorry to say. Jes Ryan, Did you make any progress in isolating the problem here? Jes No not yet. I think my next step is to purchase another drive. Replace one of them, and then RMA another. This way I can control if it is one drive or another. Multiple drives have been failing. The first instance was /dev/sde, the second failure accrued on /dev/sdd, and the third failure is from /dev/sdb. On my wiki I haven't organized it the best, the main page is the second failure. Then subsequent ones I'm creating a page for each: https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull https://www.rnelnet.com/trac/rnelnet/wiki/issues/ibm3tb/onediskfull/092712-00 If it truly is bad hardware then I have two to three drives failing (all of them). Ryan, I see, if it's multiple drives failing like this, then the problem could be elsewhere in the hardware chain too. The controller(s), cables, even memory. The error messages you are seeing are definitely from the hardware, and below the raid layers. The fact that you have multiple drives dropping out like this is likely to trip the raid above :( Cheers, Jes Jes, Yeah I understand that and have been looking at hardware. I've replaced the cables, still errors. I've ran SMART short and long tests on the drives; no error. I didn't think about the system memory as you pointed out so I tested that and no failure. Which I've had that be a problem in the past (don't get memory with blinking lights). Aside from the controller on the mother board I'm not sure what else I can test. I have order a new drive and will arrive on Thursday (10/18/12) I should have time on the weekend to test things out and see how it behaves. Will update the ticket then. I installed the new drive. When I installed I noticed the fan for the drive bay was being blocked by some wires so I thought maybe the drives were overheating. Operational top temp is 60c for these guys which I hope they are not getting too but hot drives = lower life so I fixed this up. No such luck though. I attempted to added a failed drive first. It failed to re-add. I then attempted to add the new drive to the array. It failed in a similar manner. I have a feeling this is a hardware issue although I'm not sure what I can test next, anyone have any ideas? Bummer :( I am pretty sure it is hardware too, otherwise I would have my inbox full of reports like these, and you are the only one who is seeing this one. I would start by running memtest86 on the system to see if there is any memory errors that show up. You did say you ran memory test, so I don't know if you ran something like this. Second, I would suspect either cables or the controller. The only case I have seen myself that looks similar was trying to run SATA III drives at 600 in an enclosure that was only rated for SATA II 300. Otherwise I am fairly out of ideas too unfortunately. It could be the motherboard too, but that is hard to pin down. Cheers, Jes An update on this, I pulled to PCI SATA cards I had lying around and plugged now all 4 x 3TB drives into the external cards. I first then re-added all drives. Let the three 3TB drives re-sync which would typically fail within 10 minutes connected to the motherboards SATA connectors. Drives sync finished after little more than a day. I then added the new forth drive I purchased for troubleshooting, again took over a day to sync. Again finished without an issue. The filesystem was then grown to the new size and I've been using the drives without issue on the new controllers for over a week now. I would now rule out the drives completely. These leaves only my motherboards SATA controller or some software component or a mixture of the two. I tend to not think its the controllers problem as the other drives attached do not have a problem, only the 3TB drives. I have two 2TB drives configure as a md strip for backup (which is now grossly out of space to perform the backups) I guess to rule out the meta devices I could take a couple of 500GB drives I have and test it out to see if the same problem appears but I don't know if I have the time right now. I'm just happy that my RAID is currently working. To note the problem only started to happen after I updated my system to Fedora 16, its currently Fedora 17. I've attached a new lshw output. Created attachment 703841 [details]
Output of my lshw(02-28-13)
Ryan, Sorry for not getting back to you earlier - have you been running successfully with the PCI card since February? Cheers, Jes Not a problem on getting back. Yeah I've ran with no problems at long duration and heavy data transfers without issue since PCI controllers have been used for the 3TB drives. Ryan I am glad to hear you are able to run the system reliably now and your data sounds to be safe. This does sound like what you were seeing was a hardware or possibly a driver problem - though if it was the driver I would expect to have seen more similar reports. If you're ok with it, I suggest we close this bug out since it doesn't seem to be a raid related issue. If you hit more problems with md raid, I do want to hear about it however. Cheers, Jes I would like to confirm the actual cause of the problem, how could I go about testing the driver to confirm or deny this as the problem? If you know of any pointers to documentation that I could follow I would like to know exactly why this started happening. Thank, Ryan One thing I just noticed from the lshw output is that you had a JMB micro SATA card in there and an onboard Intel controller which was running in ATA legacy mode. If the 3TB drives were on the onboard in ATA mode, it might be worth flipping it over to AHCI mode in the BIOS and see if this makes the problem go away. This would be an indication that the piix legacy driver is having problems. This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |