Bug 602457 - mdadm resync freeze at random
mdadm resync freeze at random
Status: CLOSED ERRATA
Product: Fedora
Classification: Fedora
Component: mdadm (Show other bugs)
13
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Doug Ledford
Fedora Extras Quality Assurance
: Reopened
Depends On:
Blocks: 617280
  Show dependency treegraph
 
Reported: 2010-06-09 17:21 EDT by David Edwards
Modified: 2011-07-03 09:30 EDT (History)
5 users (show)

See Also:
Fixed In Version: mdadm-3.1.3-0.git20100804.2.fc13
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 617280 718496 759648 (view as bug list)
Environment:
Last Closed: 2010-12-07 15:13:01 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description David Edwards 2010-06-09 17:21:46 EDT
I just installed Fedora 13 on my Dell Optiplex 755 with an Intel RAID controller.

00:1f.2 RAID bus controller: Intel Corporation 82801 SATA RAID Controller (rev 02)

The system is resyncing the RAID array. After a about 15 minutes of use (this is on average), the system will become unresponsive. I have to perform a hard reboot.


Version-Release number of selected component (if applicable):


How reproducible: Every time


Steps to Reproduce:
1. Boot System
2. Wait
3.
  
Actual results: System Freeze


Expected results: 


Additional info:

Linux HOSTNAME 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux


/dev/md127:
      Container : /dev/md0, member 0
     Raid Level : raid1
     Array Size : 244137984 (232.83 GiB 250.00 GB)
  Used Dev Size : 244138116 (232.83 GiB 250.00 GB)
   Raid Devices : 2
  Total Devices : 2

          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 1% complete


           UUID : 72a86dc9:f6cd6593:7e418bd6:48d0d442
    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      244137984 blocks super external:/md0/0 [2/2] [UU]
      [>....................]  resync =  1.1% (2713152/244138116) finish=74.5min speed=53938K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>
Comment 1 David Edwards 2010-06-09 18:39:10 EDT
Also, I can reproduce this if I kick off a large data transfer to or from the array. After about five minutes the system will freeze.
Comment 2 David Edwards 2010-06-10 11:32:18 EDT
I restarted last night before I left the office. When I returned today, the raid had completed the resync. I have had no issues with the system this morning. I would be willing to bet that if I broke the array, I would have these issues again.

Is there any information that I need to provide? How should i proceed?
Comment 3 Doug Ledford 2010-06-10 12:31:47 EDT
The issue you are seeing is not a hard lock, it is total unresponsiveness, but it only lasts until the resync completes.  In other words, had you not rebooted the machine, it would have completed the resync each time.  This is a known issue we are working on.
Comment 4 Doug Ledford 2010-06-10 12:35:18 EDT

*** This bug has been marked as a duplicate of bug 586299 ***
Comment 5 Doug Ledford 2010-06-10 12:36:41 EDT
Doh!  This bug is against Fedora while the bug I marked this a dup of is against Red Hat Enterprise Linux 6.  Sorry about that.  Reopening this bug.
Comment 6 Doug Ledford 2010-06-10 12:38:13 EDT
Dan, this bug in Fedora looks to be the same as bug 586299 and indicates we probably do need to try and track this down.  Is the problem reproducible for you?
Comment 7 Dan Williams 2010-06-10 12:41:57 EDT
Not so far...  let me try F13 and see if I can get a hit.
Comment 8 Dan Williams 2010-06-11 01:43:33 EDT
I think I see the bug, and why it triggers only in the resync case.

When we get stuck the thread issuing the write is stuck in md_write_start() waiting for:

wait_event(mddev->sb_wait,
           !test_bit(MD_CHANGE_CLEAN, &mddev->flags) &&
           !test_bit(MD_CHANGE_PENDING, &mddev->flags));

MD_CHANGE_CLEAN is cleared by mdmon writing "active" to .../md/array_state.  I believe mdmon is properly doing this and wakes up mddev->sb_wait.  However, while we are waiting for this wakeup it is possible that the resync thread hits a checkpoint event.  This will cause MD_CHANGE_CLEAN to be set again.  The result being that md_write_start() wakes up sees the bit is still set and goes back to sleep, and mdmon stays asleep until resync completed event.  The fix is to get mdmon to subscribe to sync_completed events so that mdmon wakes up to rehandle MD_CHANGE_CLEAN.  I have actually already implemented this for rebuild checkpointing [1].  Neil had a couple review comments so I'll fix those up and resubmit.

[1]: http://git.kernel.org/?p=linux/kernel/git/djbw/mdadm.git;a=commitdiff;h=484240d8
Comment 9 David Edwards 2010-06-11 11:22:37 EDT
Is there anything I can do to help?
Comment 10 Doug Ledford 2010-07-09 13:22:47 EDT
Dan, is the fix to this in Neil's current git master branch queued for the upcoming 3.1.3 mdadm release?
Comment 11 Dan Williams 2010-07-09 13:32:32 EDT
(In reply to comment #10)
> Dan, is the fix to this in Neil's current git master branch queued for the
> upcoming 3.1.3 mdadm release?    

Yes, I flagged commit 484240d8 "mdmon: periodically checkpoint recovery" as urgent to Neil, so I expect it to be a part of 3.1.3.

Neil asked for commit 4f0a7acc "mdmon: record sync_completed directly to the metadata" as a cleanup, but it is not necessary for resolving the hang.  Waking up on sync_completed events is the critical piece.
Comment 12 Doug Ledford 2010-07-09 13:38:38 EDT
OK, I'll get a git build pulled together soon and check for that commit.  Thanks Dan.
Comment 13 Eric Work 2010-07-19 02:42:11 EDT
I took the latest F13 package of mdadm (mdadm-3.1.2-10.fc13.x86_64) and applied the following git patches.

484240d8 "mdmon: periodically checkpoint recovery"
4f0a7acc "mdmon: record sync_completed directly to the metadata"

I've run a complete check twice and haven't had a single problem with the system freezing.  The only thing I've noticed is minor delays saving files, which is expected when a check is running.  I've re-enabled the weekly RAID check so I'll report back if there are any problems with that, but I expect not.  Previous to these patches it would slowly die until it reached around 30-40% then it would become unresponsive until it completed the check or I rebooted.  Good work finding this bug it's been causing major problems for me over the last 6 months and forced me to boot into Windows to complete the check and mark the array as clean.
Comment 14 Doug Ledford 2010-07-20 19:07:45 EDT
This should be fixed in the latest f-13 build of mdadm.
Comment 15 Fedora Update System 2010-07-22 11:36:36 EDT
mdadm-3.1.3-0.git20100722.1.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc13
Comment 16 Fedora Update System 2010-07-22 22:39:14 EDT
mdadm-3.1.3-0.git20100722.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc13
Comment 17 Fedora Update System 2010-08-05 10:25:36 EDT
mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13
Comment 18 Fedora Update System 2010-08-05 10:26:14 EDT
mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12
Comment 19 Fedora Update System 2010-08-05 10:26:48 EDT
mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14
Comment 20 Fedora Update System 2010-08-05 19:29:46 EDT
mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12
Comment 21 Fedora Update System 2010-08-05 19:53:19 EDT
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13
Comment 22 Fedora Update System 2010-08-09 21:30:23 EDT
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14
Comment 23 David Edwards 2010-11-11 09:56:16 EST
I am testing the latest release on F13.
Comment 24 David Edwards 2010-11-11 10:02:43 EST
So rebuild speed has gone down to about 10000Kbps (average) versus 80000Kbps (average) before the patch. It would get up to about 120000Kbps and the system would start to die. So far, so good...

I am testing mdadm-3.1.3-0.git20100804.2.fc13.
Comment 25 Fedora Update System 2010-12-07 15:12:20 EST
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 26 Fedora Update System 2010-12-07 15:14:14 EST
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 27 scott-brown 2011-01-16 13:41:58 EST
Seeing what I presume to be same behaviour on mdadm-3.1.3-0.git20100804.2.fc14.x86_64 

System runs for days with 4x Seagate Barracuda 7200.12 Model: ST31000528AS in a LVM group, however when added to a RAID5 array, the system will freeze randomly.

Building RAID array manually or through GNOME DiskUtil (using all defaults) doesnt seem to make difference.  EXT4 or XFS doesnt make difference.  Mounted or not, doesnt make difference.

General behaviour... 

1. array starts, begins resync
2. eventually all disk activity stops
3. system is unresposive to all input

Hard reset of system is required.  Obviously this is something I would prefer not to do. Net result is that I can not use RAID on this sytem.

System is homegrown, mobo is Gigabyte GA-H55M-SV2, BIOS set to AHCI.  

Linux version 2.6.35.10-74.fc14.x86_64 (mockbuild@x86-11.phx2.fedoraproject.org) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Thu Dec 23 16:04:50 UTC 2010

/dev/md0:
        Version : 1.2
  Creation Time : Sat Jan 15 21:27:10 2011
     Raid Level : raid5
     Array Size : 2930276352 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 976758784 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jan 16 13:36:18 2011
          State : active, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 94% complete

           Name : :TerraStore
           UUID : 8151c129:646ea005:34e60918:bc4a0ed6
         Events : 3737

    Number   Major   Minor   RaidDevice State
       0       8       81        0      active sync   /dev/sdf1
       1       8       49        1      active sync   /dev/sdd1
       2       8       33        2      active sync   /dev/sdc1
       4       8       17        3      spare rebuilding   /dev/sdb1

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdf1[0] sdb1[4] sdc1[2] sdd1[1]
      2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [===================>.]  recovery = 95.1% (929804532/976758784) finish=18.8min speed=41551K/sec
      bitmap: 3/8 pages [12KB], 65536KB chunk

unused devices: <none>
Comment 28 trancegenetic 2011-07-03 09:30:13 EDT
I still seem to experience this bug

system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64

Symptoms
System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse.

This always happens at an mdadm resync or even a rebuild. While copying files through the running samba server it seems to trigger the freeze/hang much faster.

System details:
/dev/md0:
        Version : 1.0
  Creation Time : Wed Jun  8 16:45:24 2011
     Raid Level : raid1
     Array Size : 488385400 (465.76 GiB 500.11 GB)
  Used Dev Size : 488385400 (465.76 GiB 500.11 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jul  3 15:25:19 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : neptune-fix:0
           UUID : 7d13420f:1996e153:6706c024:8e22715f
         Events : 3047

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       65        1      active sync   /dev/sde1


/dev/md1:
        Version : 1.1
  Creation Time : Wed Jun  8 16:47:55 2011
     Raid Level : raid5
     Array Size : 3907023872 (3726.03 GiB 4000.79 GB)
  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jul  3 15:20:37 2011
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : neptune-fix:1
           UUID : 8a48b835:5f02582b:68f0c66f:f1d85639
         Events : 17507

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       3       8       49        2      active sync   /dev/sdd1

Note You need to log in before you can comment on or make changes to this bug.