Bug 223216
Summary: | Files on RAID experiencing corruption | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Stuart MacDonald <stuartm> |
Component: | mdadm | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED NOTABUG | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 6 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-01-27 08:20:55 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stuart MacDonald
2007-01-18 14:37:43 UTC
*** Bug 223217 has been marked as a duplicate of this bug. *** *** Bug 223218 has been marked as a duplicate of this bug. *** Sorry about that; both Opera and IE showed an {Invalid product '} error after I pressed submit, so I thought it hadn't gone through. IE was slower though and I could see the Bug Processed page before the error message came up. It's the same problem as 181015. Ha, forgot this. **diffed. Not actually diff(1), but a perl script I wrote to be diff++. Checks permissions and file types as well. But essentially diff. This is almost always a CPU or memory problem. Single bit errors don't happen because of the RAID system, although 512 byte block errors (aka an entire block on the hard disk) could be caused by the raid system. Please check your CPU, CPU fan, power supply, and RAM. Any of those items not functioning properly can cause single byte errors. As far as the errors only showing up on the /dev/sd* devices implying that it's the raid subsystem, this would be incorrect. What it more likely implies is that the disks on the raid subsystem are faster than the disks on the ide subsystem and therefore the combined DMA and processor load on RAM is higher when talking to the scsi disks than the ide disks, causing errors that didn't show up before to show up. On my personal web site I have a memory checking script that detects these sorts of memory errors. You can download that script and see if your machine passes thoses tests. It's at http://people.redhat.com/dledford New work machine from May 2006. Case open, no dust build up. CPU fan is running fine. Power supply is an Antec 450W, which is new. The old one starting dying and was RMAed in Dec, just before the copy operations but after the old HD started having bad blocks. RAM is Kingston Hyper-X 2x 1Gb sticks. These have an anodized-blue heat spreader on them, and are warm to the touch. I memtest86-ed them when I got the machine and they passed. Case fan is also running well, although probably not doing much. Wow! The heat sink (no fan) on the Via K8T890 Northbridge is very hot. I don't suppose there's an easy way to have # cp -r <hda2> <md0> slowed down? A slower copy would generate less heat, and may avoid errors. This would also explain why I haven't seen any random corruption while working with the files on md0. I had been putting that down to being lucky so far. I'll give memtest86 another run, but not sure when. No idea how to check the CPU though. I'll run your script as well. Results so far: memtest86 ran for 24 hours with no errors. Your memory script ran once with no errors, working strictly from hda3. Your memory script is not done the first run through on md0, but is reporting errors. Wouldn't the fact that the IDE drive doesn't show errors but the SATA drives do mean that the problem is the RAID, and not the RAM? (The IDE drive is using DMA.) Also, it seems that in every error, it is the LSB that's wrong, and it's always flipped to 1 wrongly. I get 'u's instead of 't's, '!'s instead of ' 's, '3's instead of '2's. This implies that it is the RAM, and one particular faulty bit. So I'm confused. :-) Things I'm going to try: - turn off irqbalance and try the script again - try each memory stick individually Any other suggestions? memtest86 will often times miss the errors that my script catches, so it passing doesn't surprise me, it just means that your memory very borderline OK. The memory script reporting errors is a sure sign of either RAM or CPU problems. The fact that the single IDE drive doesn't show problems but dual SATA drives do does not mean the raid is the culprit. In any given machine, the architecture for memory access is like this: CPU <-> Memory <-> PCI bus Controller ^ | v RAM The CPU and PCI bus can both initiate memory access transactions. The memory controller then queues them up and issues them to RAM as soon as it can. When the CPU is the only thing accessing RAM (aka, when running memtest86), it only produces a certain amount/pattern of loading on the memory. When both CPU and PCI devices are accessing RAM, the loading is heavier and the pattern different. When you have two PCI devices and the CPU accessing RAM, especially when the two PCI devices individually are faster than the one other PCI device you tried, it can expose problems that didn't show up with just the one slower device and the CPU trying to access RAM. Furthermore, in the simpler test of just copying data from IDE drives to SATA drives, the copy program never attempts to access the data. It basically says to the operating system "here's a 64k chunk of RAM, have hard drive hda copy the first 64k of data from this file into it" and it happens via DMA. Then it takes that same 64k chunk of data and tells the operating system to write it to the SATA drives (the raid subsystem simply issues the same write twice, once to each drive, in the case of RAID1 which is what I assume you are using). The raid subsystem never "inspects" the data to see what it is, so it never has the chance to do single bit errors like you are seeing. Now, with my memory test, it's possible the CPU could be causing the problem because the CPU has to decompress the compressed tarball when writing the files out. The corruption could happen during that decompression. In the case of just plain copying files from hda to the SATA drives, the CPU doesn't make any attempt to modify the data, so it's less likely to cause errors there (it's still possible for it to happen because the CPU copies the data from DMA buffers to the cp program's internal buffers sometimes depending on the type of hard drive controller you have and whether or not it supports DMA directly to user space). So, between those two facts, and the fact that you got corruption on both a plain copy operation and the stress test, I would say you have memory problems with about a 95% certainty. Now, you can test the memory sticks independently and that may identify the bad DIMM, but there's also a chance that both will pass by themselves. In that case, you may have a motherboard that, due to the placement of the memory DIMM slots, can not support full speed memory operation with multiple DIMMs. For instance, I have a motherboard with 3 DIMM slots. With any two DIMMs inserted side by side, I can run at 400MHz DDR. As soon as I add a third DIMM, I have to slow it down to 333MHz DDR or else it develops memory errors. The manual for the motherboard spells this limitation out in my case. It also may be possible that the memory controller on the motherboard is bad, in which case replacing the motherboard itself is the only option. > (the raid subsystem simply issues the same write twice, once to each drive, > in the case of RAID1 which is what I assume you are using) Damn it. I even said RAID-1 in my initial problem report. It's actually RAID-0. I found out the hard way that the "hardware" RAID support doesn't work on this motherboard very well, at least with Linux, so I've got it set up as a software RAID, /dev/md0 -> /dev/sda + /dev/sdb. Well, irqbalance off made no difference, an error popped up pretty soon after starting the script. This is my memory: http://www.ec.kingston.com/ecom/configurator_new/ PartsInfo.asp?root=us&LinkBack=http://www.kingston.com&ktcpartno=KHX3200AK2/2G One thing I haven't mentioned: the BIOS on this does not autodetect the Kingston-specified timings, so I had set them manually. Before the shell script test though I changed it back to the autodetected timings. Didn't make a difference AFAICS. My motherboard manual makes no mention of any memory limitation, but I've already run into one that Asus will not confirm: http://www.ec.kingston.com/ecom/configurator_new/ modelsinfo.asp?SysID=24573&mfr=ASUS&model=A8V&root=us&LinkBack=http%3A%2F%2Fwww.kingston.com&Sys=24573- ASUS-A8V-E+SE+Motherboard&distributor=0&submit1=Search (only supports 6 ranks). I'm going to scale back the speed to 333 and try again. The BIOS not autodetecting the memory turned out to be a feature: Kingston programs the SPD with lesser default timings. I scaled the memory back to 333, 266 and 200, and the md0 corruption still occurs. I found a 512 Mb stick of DDR 400 and tried that instead of my regular memory, and the corruption still occurs, so I think that rules out my memory being bad. I can't test the regular memory in another machine though, I'm pretty sure this is the only one we have that has a RAID, and it's almost certainly the only AMD system. I think this means that it's a motherboard issue. I located another machine with SATA connectors, got linux running and brought the RAID-0 up. The stress passed. So it's not the drives. The motherboard has been RMAed. Thanks for all your help, especially the stress test script. Very handy. |