Bug 810941
Summary: | IO to RAID 5 hangs after large amounts of data have been written | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Mark Huth <mhuth1776> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 16 | CC: | frollic, gansalmon, itamar, jforbes, jonathan, kernel-maint, liam, madhu.chinakonda, netllama |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-11-14 16:32:43 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Mark Huth
2012-04-09 17:16:48 UTC
I forgot to include the kernel versions: Fedora (3.3.0-4.fc16.x86_64) and Fedora (3.2.7-1.fc16.x86_64) both exhibit the same behavior. Poking around a bit more this morning - I was finally able to kill the rsync thread. Don't know if it is just a timeout or what changed. After that, IO to the RAID still hangs, but this time in get_active_stripe. I have: flush-9:127 uninterruptible on get_active_stripe md127_resync uninterruptible on get_active_stripe shotwell uninterruptible on get_active_stripe shotwill is not killable. I'll try a system reboot now and see if it progresses I am also able to reproduce this with bonnie++ running locally. No access to the nfs mount is involved. Why is this bug filed against mm? Does this library play any part in your problem? (In reply to comment #4) > Why is this bug filed against mm? Does this library play any part in your > problem? Because that is the routine the rsync or copy process appears to be stuck in according to the process monitor. (balance_dirty_pages_ratelimited_nr) It may also be a problem with the sata device drivers or with the RAID drivers. I don't know what is going on for sure. I have also reproduced this while reading from the RAID, although that result in the entire system hanging up and I could not get any additional information. The problem did not occur while transferring 3TB to a USB3 drive. I took out the RAID and added a plain drive and started a copy to that overnight. I'll check that later today. So the mm library may just be an innocent victim when something else goes wrong. (In reply to comment #5) > (In reply to comment #4) > > Why is this bug filed against mm? Does this library play any part in your > > problem? > > Because that is the routine the rsync or copy process appears to be stuck in > according to the process monitor. (balance_dirty_pages_ratelimited_nr) But what does that have to do with the libmm? rsync does not link against it... I'd say you're filing against the wrong package and you should go for kernel... (In reply to comment #4) > Why is this bug filed against mm? Does this library play any part in your > problem? Because that is the routine the rsync or copy process appears to be stuck in according to the process monitor. (balance_dirty_pages_ratelimited_nr) It may also be a problem with the sata device drivers or with the RAID drivers. I don't know what is going on for sure. I have also reproduced this while reading from the RAID, although that result in the entire system hanging up and I could not get any additional information. The problem did not occur while transferring 3TB to a USB3 drive. I took out the RAID and added a plain drive and started a copy to that overnight. I'll check that later today. So the mm library may just be an innocent victim when something else goes wrong. Sorry about the double comment - mid-air collision. Okay, my mistake. It doesn't say libmm, just mm, and I thought it was the mm component of the kernel. Changing component to kernel. I tried it with a simple disk hooked to the Marvel SATA controller, and saw no problems after transferring 3TB of data. I can do one more experiment by hooking a plain drive to the Intel Z79 Sata channel to eliminate the SATA driver/chipset combo. One other piece of info: sometimes I see do_irq - no irq handler for vector (-1) messages in dmesg, but there doesn't seem to be a time corelation with the hangup. I suppose I can try to run kgdb on the kernel to see what the stacks look like. Direct use of a disk attached to both the Marvell and the Intel controllers works fine to the limit of 3TB disk size. So this is probably a RAID issue. Can https://bugzilla.redhat.com/show_bug.cgi?id=721127 be the same bug ? As to whether this is the same as 721127, that's hard to say, since 721127 seems to be more than one bug. There are problems with raid on usb, usb drives alone, etc, with the gui freezing. This bug is specific to RAID 5 on local SATA disks. (I will try some other raid modes this week, since the RAID is useless as it stands) I updated to the 3.3.4-3.fc16 this weekend and the problem persists. In the mean time I have checked the SATA controllers as best I can by transferring 3 TB to drives on each of the SATA controllers on the motherboard, as well as to several USB3 3TB drives (backing up the old RAID that I want to replace with this one). In all cases not involving the RAID 5, all IO completes with no hangs. As soon as I started to transfer to the RAID 5, whether from nfs, USB3 or a local disk, the problem appears after 300GB to a couple of TB into the rsync or copy operation. I can also cause the problem while reading from the RAID and transferring to a local disk. I also reproduced the problem with Bonnie+ test suite running on the RAID. The symptom is that the transfer hangs, with the process stuck either in get_active_stripe or balance_dirty_pages_ratelimited_nr. Subsequently, the filesystem on the RAID cannot be unmounted, and a reboot attempt hangs. AFter a reboot, the array is not clean, and resyncs (briefly) and becomes operational again until some time later during the next transfer. If you can explain something else for me to do to help debug this, I will. But the current situation is that the RAID is useless on all of the recent kernels from fc16. As to whether this is the same as 721127, that's hard to say, since 721127 seems to be more than one bug. There are problems with raid on usb, usb drives alone, etc, with the gui freezing. This bug is specific to RAID 5 on local SATA disks. (I will try some other raid modes this week, since the RAID is useless as it stands) I updated to the 3.3.4-3.fc16 this weekend and the problem persists. In the mean time I have checked the SATA controllers as best I can by transferring 3 TB to drives on each of the SATA controllers on the motherboard, as well as to several USB3 3TB drives (backing up the old RAID that I want to replace with this one). In all cases not involving the RAID 5, all IO completes with no hangs. As soon as I started to transfer to the RAID 5, whether from nfs, USB3 or a local disk, the problem appears after 300GB to a couple of TB into the rsync or copy operation. I can also cause the problem while reading from the RAID and transferring to a local disk. I also reproduced the problem with Bonnie+ test suite running on the RAID. The symptom is that the transfer hangs, with the process stuck either in get_active_stripe or balance_dirty_pages_ratelimited_nr. Subsequently, the filesystem on the RAID cannot be unmounted, and a reboot attempt hangs. AFter a reboot, the array is not clean, and resyncs (briefly) and becomes operational again until some time later during the next transfer. If you can explain something else for me to do to help debug this, I will. But the current situation is that the RAID is useless on all of the recent kernels from fc16. I suspect that I'm seeing this issue as well, although my symptoms are slightly different. I was running Fedora15 with its stock 2.6.40.3-0 x86_64 kernel since Sept 1 of last year on a postgresql database server (128GB RAM, 16 core Xeon CPUs) and a 1.2TB RAID5 array holding the database. It ran rock solid & stable the entire time. On May 18, I upgraded to Fedora16, and ever since, then I've been seeing seemingly random uninterruptable IO hangs while the system is under heavy load (from database queries & transactions). The behavior is the same every time, all queries to the database never return, and start stacking up, ultimately making the entire database unresponsive. Attempts to stop the database, or even strace its process all hang and are uninterruptable as well. As a result, its impossible to cleanly reboot the system, or even unmount the RAID array, since postgresql completely stops responding. However, its not just postgresql. Attempts to simply read (using less or more, etc) a log file on the same RAID array also fails, and ends up stuck in an uninterrupable hang. I've now run into this problem 4 times in the 10 days since the OS upgrade, and its pretty catostrophic, as this is a large production server. Each time I basically have to power cycle the server, since it wedges trying to gracefully reboot, and I then end up having to wait around 4 hours for the RAID array to resync & become consistent before I can get anything approaching decent performance. Oh, and its not just 1 server, but a 3 server cluster (all identical hardware), and this has been happening on all 3 servers, although its been far more frequent on the 'master' which handles all database write queries, while the slaves only process read queries. I'm growing increasingly desperate, and my next attempt to stab at this is to actually reinstall that old 2.6.40.3-0 F15 kernel (in Fedora16) in the hope that it eliminates the instability. If there's some specific information that I can capture that would aid in debugging this, please let me know. Lonni, Thank you. I'm now very sure that we have the same issue. Read or write will either one create the issue. I've used rsync and bonnie++ and just a simple archive copy. The only place there is a failure is in a RAID 5. I've tried 2 disk mirrors and 2-disk stripes, both of which allowed transfers to the limit of the volume size. I've recently moved the RAID to a different set of SATA ports, these are supported Marvell 2-port 6Gbps devices, two on a 4x PCI-Express board. So the very new Intel SATA ports in the Z79 chipset with a Sandy Bridge extreme are not the culprit. Probably a race somewhere with a small window. I usually see the hang non-interruptible in balance_dirty_pages_rate_limitnr, but occasionally see it in get_active_stripe. Must reboot system - but cannot - hardware reset (or power cycle) is required, after which the RAID needs to rebuild for a while. Not sure if that was required when only reads were being done. So I can reproduce this at will over the course of a few hours or overnight at worst. What can I do to help debug this - the current state makes the latest kernels unusable for enterprise systems or serious servers. Unfortunately, the bleeding edge nature of my hardware does not allow me to go back to a 2.6 kernel. What size and number of disks do you have, and what is the configuration (RAID mode). I'm using RAID5, I'm using 3 1TB disks. # Mass update to all open bugs. Kernel 3.6.2-1.fc16 has just been pushed to updates. This update is a significant rebase from the previous version. Please retest with this kernel, and let us know if your problem has been fixed. In the event that you have upgraded to a newer release and the bug you reported is still present, please change the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. If you are not the original bug reporter and you still experience this bug, please file a new report, as it is possible that you may be seeing a different problem. (Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient). With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report. Since only the bug reported can re-open bugs, and Redhat loves to close duplicates and apparently no one at Redhat pays attention when other people report the same problem in the same bug, this one has been closed, with no means of re-opening it by anyone other than the person who filed it. This isn't fixed. Not remotely so. |