Bug 810941 - IO to RAID 5 hangs after large amounts of data have been written
Summary: IO to RAID 5 hangs after large amounts of data have been written
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 16
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2012-04-09 17:16 UTC by Mark Huth
Modified: 2012-11-14 20:02 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-11-14 16:32:43 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)

Description Mark Huth 2012-04-09 17:16:48 UTC
Description of problem:

While writing large amount amounts of data (> 1TB), cp -a or rsync eventually hangs prior to completing the full transfer.  Examination of the process state has usually shown the copying process (cp or rsync) in status "Uninterruptible"  with the waiting channel of balance_dirty_pages_ratelimited_nr.

At that point, the process is unkillable, and further reads or writes to the file system mounted on the RAID array hangs, although new processes attempting such IO are killable.

Attempts to unmount the file system (ext3) fail with status of busy.  Reboots also fail, as they are unable to unmount the files system.  A reset or power cycle is then necessary to return the system to full function.

When the hang occurs, the remainder of the system remains functional - in fact, I'm writing this bug report from the system with a stalled rsync process.

This hang occurs frequently (well, as frequently as I can transfer TB+ data over a network, say once a day.)  The source ofr the data is an older system running a software raid 6.  That file system is NFS mounted on the target system and connected via 1Gbps network link.  The transfer rate for large files often hits and remains at 110+/- Mbytes per second.  There is little other CPU load on the system, just the Gnome and other background stuff that runs.  The process or is 12-way SMP (6 cores with hyperthreading)

Configuration:

Intel i7-3960x with 32 GB ram (all four channels populated with 2 DIMMs each)
ASUS P9X79 Deluxe.  No overclocking (other than automatic speed step in processor)

System disk is a partitioned 2TB Samsung on a Marvel SATA3 (6Gbps) controller.

The RAID_5 is on 4 of the Intel SATA channels in the X79 chipset, with two of the drives on the 6Gbps channels, while two others are on the 3Gbps channels.  Each of these four drives is a Seagate 3TB XT (ATA ST33000651AS).  A fifth drive is in the RAID array, and located on the other channel of the 6Gbps Marvel controller.  The fifth drive is a Seagate 3TB Barracuda (ATA ST3000DM001-9YN166)

The other Intel SATA channels have optical drives (BluRay ROM and BluRay re-writer from LG) Neither is active during data transfers that cause this hang.

The RAID is 12 TB, /dev/md127 with a single partition.  The file system is ext3 with an integral journal.  The FS mounts /dev/md127p1 on /media/SVID_RAID.

Here follows the output from mdadm.  I just noticed that it shows status active, resyncing(DELAYED).  This is after the system sat for 36 hours since the hang occurred.  I'm not sure that that was the status earlier, but I did not investigate.  The raid is in sync upon reboot and reactivation (or at least the completion of the resync takes minimal time)

mdadm --detail /dev/md127
/dev/md127:
        Version : 1.2
  Creation Time : Sat Mar 31 00:33:41 2012
     Raid Level : raid5
     Array Size : 11721060352 (11178.07 GiB 12002.37 GB)
  Used Dev Size : 2930265088 (2794.52 GiB 3000.59 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat Apr  7 10:57:55 2012
          State : active, resyncing (DELAYED) 
 Active Devices : 5
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : BigRAID
           UUID : e417f2da:829002c8:136f0887:c132ad6f
         Events : 19810

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       1       8       33        1      active sync   /dev/sdc1
       2       8       17        2      active sync   /dev/sdb1
       3       8        1        3      active sync   /dev/sda1
       5       8       65        4      active sync   /dev/sde1

Version-Release number of selected component (if applicable):


How reproducible:

Write large amounts of data to the RAID5


Steps to Reproduce:
1.Mount the RAID ext3 fs, mount the source NFS filesystem.
2.sudo su
3.rsync -v -a <nfs_source_directory> <RAID5_target_directory)

Note that I have neither confirmed nor ruled out the requirement that the source be nfs.  That's the only place I have enough data to make this interesting.
  
Actual results:
After 1.6 - 1.9 TB data transferred rsync hangs.  Further writes or reads to the RAID5 FS also hang.  There are 3 rsync threads.  CTL-c kills two with a broken pipe and unbuffered write failed error message.  The third rsync thread is hung in uninterruptible status on waitchannel balance_dirty_pages_ratelimited_nr.


Expected results:
Files transfers should complete and the rsync or cp should exit normally.



Additional info:

This might be related to a couple of other bug reports that report hanging on RAID IO, but those bugs seemed to indicate that the file system could be unmounted administratively and a reboot completed.  I have not been able to do so (I've tried the usual umount administratively without success - file system busy) and have needed a reset or power cycle to regain use of the RAID5.

Comment 1 Mark Huth 2012-04-09 17:20:14 UTC
I forgot to include the kernel versions:

Fedora (3.3.0-4.fc16.x86_64)

and

Fedora (3.2.7-1.fc16.x86_64)

both exhibit the same behavior.

Comment 2 Mark Huth 2012-04-09 17:44:40 UTC
Poking around a bit more this morning - I was finally able to kill the rsync thread.  Don't know if it is just a timeout or what changed.

After that, IO to the RAID still hangs, but this time in get_active_stripe.  I have:

flush-9:127 uninterruptible on get_active_stripe
md127_resync uninterruptible on get_active_stripe
shotwell uninterruptible on get_active_stripe

shotwill is not killable.

I'll try a system reboot now and see if it progresses

Comment 3 Mark Huth 2012-04-13 21:53:06 UTC
I am also able to reproduce this with bonnie++ running locally.  No access to the nfs mount is involved.

Comment 4 Andreas Thienemann 2012-04-15 08:59:46 UTC
Why is this bug filed against mm? Does this library play any part in your problem?

Comment 5 Mark Huth 2012-04-17 14:39:57 UTC
(In reply to comment #4)
> Why is this bug filed against mm? Does this library play any part in your
> problem?

Because that is the routine the rsync or copy process appears to be stuck in according to the process monitor. (balance_dirty_pages_ratelimited_nr)

It may also be a problem with the sata device drivers or with the RAID drivers.  I don't know what is going on for sure.

I have also reproduced this while reading from the RAID, although that result in the entire system hanging up and I could not get any additional information.
The problem did not occur while transferring 3TB to a USB3 drive.  I took out the RAID and added a plain drive and started a copy to that overnight.  I'll check that later today.

So the mm library may just be an innocent victim when something else goes wrong.

Comment 6 Andreas Thienemann 2012-04-17 21:56:29 UTC
(In reply to comment #5)
> (In reply to comment #4)
> > Why is this bug filed against mm? Does this library play any part in your
> > problem?
> 
> Because that is the routine the rsync or copy process appears to be stuck in
> according to the process monitor. (balance_dirty_pages_ratelimited_nr)

But what does that have to do with the libmm?

rsync does not link against it...

I'd say you're filing against the wrong package and you should go for kernel...

Comment 7 Mark Huth 2012-04-18 04:48:52 UTC
(In reply to comment #4)
> Why is this bug filed against mm? Does this library play any part in your
> problem?

Because that is the routine the rsync or copy process appears to be stuck in according to the process monitor. (balance_dirty_pages_ratelimited_nr)

It may also be a problem with the sata device drivers or with the RAID drivers.  I don't know what is going on for sure.

I have also reproduced this while reading from the RAID, although that result in the entire system hanging up and I could not get any additional information.
The problem did not occur while transferring 3TB to a USB3 drive.  I took out the RAID and added a plain drive and started a copy to that overnight.  I'll check that later today.

So the mm library may just be an innocent victim when something else goes wrong.

Comment 8 Mark Huth 2012-04-18 04:55:25 UTC
Sorry about the double comment - mid-air collision.

Okay, my mistake.  It doesn't say libmm, just mm, and I thought it was the mm component of the kernel. Changing component to kernel.

Comment 9 Mark Huth 2012-04-18 05:00:30 UTC
I tried it with a simple disk hooked to the Marvel SATA controller, and saw no problems after transferring 3TB of data.  I can do one more experiment by hooking a plain drive to the Intel Z79 Sata channel to eliminate the SATA driver/chipset combo.

One other piece of info:  sometimes I see do_irq - no irq handler for vector   (-1) messages in dmesg, but there doesn't seem to be a time corelation with the hangup.  I suppose I can try to run kgdb on the kernel to see what the stacks look like.

Comment 10 Mark Huth 2012-04-24 17:26:12 UTC
Direct use of a disk attached to both the Marvell and the Intel controllers works fine to the limit of 3TB disk size.  So this is probably a RAID issue.

Comment 11 frollic nilsson 2012-05-09 12:45:49 UTC
Can https://bugzilla.redhat.com/show_bug.cgi?id=721127 be the same bug ?

Comment 12 Mark Huth 2012-05-14 18:58:11 UTC
As to whether this is the same as 721127, that's hard to say, since 721127 seems to be more than one bug.  There are problems with raid on usb, usb drives alone, etc, with the gui freezing.

This bug is specific to RAID 5 on local SATA disks. (I will try some other raid modes this week, since the RAID is useless as it stands)

I updated to the 3.3.4-3.fc16 this weekend and the problem persists.

In the mean time I have checked the SATA controllers as best I can by transferring 3 TB to drives on each of the SATA controllers on the motherboard, as well as to several USB3 3TB drives (backing up the old RAID that I want to replace with this one).  In all cases not involving the RAID 5, all IO completes with no hangs.  As soon as I started to transfer to the RAID 5, whether from nfs, USB3 or a local disk, the problem appears after 300GB to a couple of TB into the rsync or copy operation.  I can also cause the problem while reading from the RAID and transferring to a local disk.  I also reproduced the problem with Bonnie+ test suite running on the RAID.

The symptom is that the transfer hangs, with the process stuck either in get_active_stripe or balance_dirty_pages_ratelimited_nr.  Subsequently, the filesystem on the RAID cannot be unmounted, and a reboot attempt hangs.  AFter a reboot, the array is not clean, and resyncs (briefly) and becomes operational again until some time later during the next transfer.

If you can explain something else for me to do to help debug this, I will.  But the current situation is that the RAID is useless on all of the recent kernels from fc16.

Comment 13 Mark Huth 2012-05-14 22:45:07 UTC
As to whether this is the same as 721127, that's hard to say, since 721127 seems to be more than one bug.  There are problems with raid on usb, usb drives alone, etc, with the gui freezing.

This bug is specific to RAID 5 on local SATA disks. (I will try some other raid modes this week, since the RAID is useless as it stands)

I updated to the 3.3.4-3.fc16 this weekend and the problem persists.

In the mean time I have checked the SATA controllers as best I can by transferring 3 TB to drives on each of the SATA controllers on the motherboard, as well as to several USB3 3TB drives (backing up the old RAID that I want to replace with this one).  In all cases not involving the RAID 5, all IO completes with no hangs.  As soon as I started to transfer to the RAID 5, whether from nfs, USB3 or a local disk, the problem appears after 300GB to a couple of TB into the rsync or copy operation.  I can also cause the problem while reading from the RAID and transferring to a local disk.  I also reproduced the problem with Bonnie+ test suite running on the RAID.

The symptom is that the transfer hangs, with the process stuck either in get_active_stripe or balance_dirty_pages_ratelimited_nr.  Subsequently, the filesystem on the RAID cannot be unmounted, and a reboot attempt hangs.  AFter a reboot, the array is not clean, and resyncs (briefly) and becomes operational again until some time later during the next transfer.

If you can explain something else for me to do to help debug this, I will.  But the current situation is that the RAID is useless on all of the recent kernels from fc16.

Comment 14 Lonni J Friedman 2012-05-28 20:07:24 UTC
I suspect that I'm seeing this issue as well, although my symptoms are slightly different.  I was running Fedora15 with its stock 2.6.40.3-0 x86_64 kernel since Sept 1 of last year on a postgresql database server (128GB RAM, 16 core Xeon CPUs) and a 1.2TB RAID5 array holding the database.  It ran rock solid & stable the entire time.  On May 18, I upgraded to Fedora16, and ever since, then I've been seeing seemingly random uninterruptable IO hangs while the system is under heavy load (from database queries & transactions).  The behavior is the same every time, all queries to the database never return, and start stacking up, ultimately making the entire database unresponsive.  Attempts to stop the database, or even strace its process all hang and are uninterruptable as well.  As a result, its impossible to cleanly reboot the system, or even unmount the RAID array, since postgresql completely stops responding.  However, its not just postgresql.  Attempts to simply read (using less or more, etc) a log file on the same RAID array also fails, and ends up stuck in an uninterrupable hang.  I've now run into this problem 4 times in the 10 days since the OS upgrade, and its pretty catostrophic, as this is a large production server.   Each time I basically have to power cycle the server, since it wedges trying to gracefully reboot, and I then end up having to wait around 4 hours for the RAID array to resync & become consistent before I can get anything approaching decent performance.

Oh, and its not just 1 server, but a 3 server cluster (all identical hardware), and this has been happening on all 3 servers, although its been far more frequent on the 'master' which handles all database write queries, while the slaves only process read queries.

I'm growing increasingly desperate, and my next attempt to stab at this is to actually reinstall that old 2.6.40.3-0 F15 kernel (in Fedora16) in the hope that it eliminates the instability.

If there's some specific information that I can capture that would aid in debugging this, please let me know.

Comment 15 Mark Huth 2012-05-30 05:31:56 UTC
Lonni,

Thank you.  I'm now very sure that we have the same issue.  Read or write will either one create the issue.  I've used rsync and bonnie++ and just a simple archive copy.  The only place there is a failure is in a RAID 5.  I've tried 2 disk mirrors and 2-disk stripes, both of which allowed transfers to the limit of the volume size.  I've recently moved the RAID to a different set of SATA ports, these are supported Marvell 2-port 6Gbps devices, two on a  4x PCI-Express board.  So the very new Intel SATA ports in the Z79 chipset with a Sandy Bridge extreme are not the culprit.  Probably a race somewhere with a small window.  I usually see the hang non-interruptible in balance_dirty_pages_rate_limitnr, but occasionally see it in get_active_stripe.

Must reboot system - but cannot - hardware reset (or power cycle) is required, after which the RAID needs to rebuild for a while.  Not sure if that was required when only reads were being done.

So I can reproduce this at will over the course of a few hours or overnight at worst.  What can I do to help debug this - the current state makes the latest kernels unusable for enterprise systems or serious servers.  Unfortunately, the bleeding edge nature of my hardware does not allow me to  go back to a 2.6 kernel.

What size and number of disks do you have, and what is the configuration (RAID mode).

Comment 16 Lonni J Friedman 2012-05-30 16:22:44 UTC
I'm using RAID5, I'm using 3 1TB disks.

Comment 17 Dave Jones 2012-10-23 15:38:22 UTC
# Mass update to all open bugs.

Kernel 3.6.2-1.fc16 has just been pushed to updates.
This update is a significant rebase from the previous version.

Please retest with this kernel, and let us know if your problem has been fixed.

In the event that you have upgraded to a newer release and the bug you reported
is still present, please change the version field to the newest release you have
encountered the issue with.  Before doing so, please ensure you are testing the
latest kernel update in that release and attach any new and relevant information
you may have gathered.

If you are not the original bug reporter and you still experience this bug,
please file a new report, as it is possible that you may be seeing a
different problem. 
(Please don't clone this bug, a fresh bug referencing this bug in the comment is sufficient).

Comment 18 Justin M. Forbes 2012-11-14 16:32:43 UTC
With no response, we are closing this bug under the assumption that it is no longer an issue. If you still experience this bug, please feel free to reopen the bug report.

Comment 19 Lonni J Friedman 2012-11-14 20:02:06 UTC
Since only the bug reported can re-open bugs, and Redhat loves to close duplicates and apparently no one at Redhat pays attention when other people report the same problem in the same bug, this one has been closed, with no means of re-opening it by anyone other than the person who filed it.

This isn't fixed.  Not remotely so.


Note You need to log in before you can comment on or make changes to this bug.