I just installed Fedora 13 on my Dell Optiplex 755 with an Intel RAID controller. 00:1f.2 RAID bus controller: Intel Corporation 82801 SATA RAID Controller (rev 02) The system is resyncing the RAID array. After a about 15 minutes of use (this is on average), the system will become unresponsive. I have to perform a hard reboot. Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: 1. Boot System 2. Wait 3. Actual results: System Freeze Expected results: Additional info: Linux HOSTNAME 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux /dev/md127: Container : /dev/md0, member 0 Raid Level : raid1 Array Size : 244137984 (232.83 GiB 250.00 GB) Used Dev Size : 244138116 (232.83 GiB 250.00 GB) Raid Devices : 2 Total Devices : 2 State : active, resyncing Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Rebuild Status : 1% complete UUID : 72a86dc9:f6cd6593:7e418bd6:48d0d442 Number Major Minor RaidDevice State 1 8 0 0 active sync /dev/sda 0 8 16 1 active sync /dev/sdb Personalities : [raid1] md127 : active raid1 sda[1] sdb[0] 244137984 blocks super external:/md0/0 [2/2] [UU] [>....................] resync = 1.1% (2713152/244138116) finish=74.5min speed=53938K/sec md0 : inactive sdb[1](S) sda[0](S) 4514 blocks super external:imsm unused devices: <none>
Also, I can reproduce this if I kick off a large data transfer to or from the array. After about five minutes the system will freeze.
I restarted last night before I left the office. When I returned today, the raid had completed the resync. I have had no issues with the system this morning. I would be willing to bet that if I broke the array, I would have these issues again. Is there any information that I need to provide? How should i proceed?
The issue you are seeing is not a hard lock, it is total unresponsiveness, but it only lasts until the resync completes. In other words, had you not rebooted the machine, it would have completed the resync each time. This is a known issue we are working on.
*** This bug has been marked as a duplicate of bug 586299 ***
Doh! This bug is against Fedora while the bug I marked this a dup of is against Red Hat Enterprise Linux 6. Sorry about that. Reopening this bug.
Dan, this bug in Fedora looks to be the same as bug 586299 and indicates we probably do need to try and track this down. Is the problem reproducible for you?
Not so far... let me try F13 and see if I can get a hit.
I think I see the bug, and why it triggers only in the resync case. When we get stuck the thread issuing the write is stuck in md_write_start() waiting for: wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_CLEAN, &mddev->flags) && !test_bit(MD_CHANGE_PENDING, &mddev->flags)); MD_CHANGE_CLEAN is cleared by mdmon writing "active" to .../md/array_state. I believe mdmon is properly doing this and wakes up mddev->sb_wait. However, while we are waiting for this wakeup it is possible that the resync thread hits a checkpoint event. This will cause MD_CHANGE_CLEAN to be set again. The result being that md_write_start() wakes up sees the bit is still set and goes back to sleep, and mdmon stays asleep until resync completed event. The fix is to get mdmon to subscribe to sync_completed events so that mdmon wakes up to rehandle MD_CHANGE_CLEAN. I have actually already implemented this for rebuild checkpointing [1]. Neil had a couple review comments so I'll fix those up and resubmit. [1]: http://git.kernel.org/?p=linux/kernel/git/djbw/mdadm.git;a=commitdiff;h=484240d8
Is there anything I can do to help?
Dan, is the fix to this in Neil's current git master branch queued for the upcoming 3.1.3 mdadm release?
(In reply to comment #10) > Dan, is the fix to this in Neil's current git master branch queued for the > upcoming 3.1.3 mdadm release? Yes, I flagged commit 484240d8 "mdmon: periodically checkpoint recovery" as urgent to Neil, so I expect it to be a part of 3.1.3. Neil asked for commit 4f0a7acc "mdmon: record sync_completed directly to the metadata" as a cleanup, but it is not necessary for resolving the hang. Waking up on sync_completed events is the critical piece.
OK, I'll get a git build pulled together soon and check for that commit. Thanks Dan.
I took the latest F13 package of mdadm (mdadm-3.1.2-10.fc13.x86_64) and applied the following git patches. 484240d8 "mdmon: periodically checkpoint recovery" 4f0a7acc "mdmon: record sync_completed directly to the metadata" I've run a complete check twice and haven't had a single problem with the system freezing. The only thing I've noticed is minor delays saving files, which is expected when a check is running. I've re-enabled the weekly RAID check so I'll report back if there are any problems with that, but I expect not. Previous to these patches it would slowly die until it reached around 30-40% then it would become unresponsive until it completed the check or I rebooted. Good work finding this bug it's been causing major problems for me over the last 6 months and forced me to boot into Windows to complete the check and mark the array as clean.
This should be fixed in the latest f-13 build of mdadm.
mdadm-3.1.3-0.git20100722.1.fc13 has been submitted as an update for Fedora 13. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc13
mdadm-3.1.3-0.git20100722.2.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc13
mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13
mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12
mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14
mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14
I am testing the latest release on F13.
So rebuild speed has gone down to about 10000Kbps (average) versus 80000Kbps (average) before the patch. It would get up to about 120000Kbps and the system would start to die. So far, so good... I am testing mdadm-3.1.3-0.git20100804.2.fc13.
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.
Seeing what I presume to be same behaviour on mdadm-3.1.3-0.git20100804.2.fc14.x86_64 System runs for days with 4x Seagate Barracuda 7200.12 Model: ST31000528AS in a LVM group, however when added to a RAID5 array, the system will freeze randomly. Building RAID array manually or through GNOME DiskUtil (using all defaults) doesnt seem to make difference. EXT4 or XFS doesnt make difference. Mounted or not, doesnt make difference. General behaviour... 1. array starts, begins resync 2. eventually all disk activity stops 3. system is unresposive to all input Hard reset of system is required. Obviously this is something I would prefer not to do. Net result is that I can not use RAID on this sytem. System is homegrown, mobo is Gigabyte GA-H55M-SV2, BIOS set to AHCI. Linux version 2.6.35.10-74.fc14.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Thu Dec 23 16:04:50 UTC 2010 /dev/md0: Version : 1.2 Creation Time : Sat Jan 15 21:27:10 2011 Raid Level : raid5 Array Size : 2930276352 (2794.53 GiB 3000.60 GB) Used Dev Size : 976758784 (931.51 GiB 1000.20 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jan 16 13:36:18 2011 State : active, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 94% complete Name : :TerraStore UUID : 8151c129:646ea005:34e60918:bc4a0ed6 Events : 3737 Number Major Minor RaidDevice State 0 8 81 0 active sync /dev/sdf1 1 8 49 1 active sync /dev/sdd1 2 8 33 2 active sync /dev/sdc1 4 8 17 3 spare rebuilding /dev/sdb1 Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sdf1[0] sdb1[4] sdc1[2] sdd1[1] 2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] [===================>.] recovery = 95.1% (929804532/976758784) finish=18.8min speed=41551K/sec bitmap: 3/8 pages [12KB], 65536KB chunk unused devices: <none>
I still seem to experience this bug system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64 Symptoms System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse. This always happens at an mdadm resync or even a rebuild. While copying files through the running samba server it seems to trigger the freeze/hang much faster. System details: /dev/md0: Version : 1.0 Creation Time : Wed Jun 8 16:45:24 2011 Raid Level : raid1 Array Size : 488385400 (465.76 GiB 500.11 GB) Used Dev Size : 488385400 (465.76 GiB 500.11 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jul 3 15:25:19 2011 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : neptune-fix:0 UUID : 7d13420f:1996e153:6706c024:8e22715f Events : 3047 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 65 1 active sync /dev/sde1 /dev/md1: Version : 1.1 Creation Time : Wed Jun 8 16:47:55 2011 Raid Level : raid5 Array Size : 3907023872 (3726.03 GiB 4000.79 GB) Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jul 3 15:20:37 2011 State : active Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : neptune-fix:1 UUID : 8a48b835:5f02582b:68f0c66f:f1d85639 Events : 17507 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 3 8 49 2 active sync /dev/sdd1