Hide Forgot
+++ This bug was initially created as a clone of Bug #602457 +++ I just installed Fedora 13 on my Dell Optiplex 755 with an Intel RAID controller. 00:1f.2 RAID bus controller: Intel Corporation 82801 SATA RAID Controller (rev 02) The system is resyncing the RAID array. After a about 15 minutes of use (this is on average), the system will become unresponsive. I have to perform a hard reboot. Version-Release number of selected component (if applicable): How reproducible: Every time Steps to Reproduce: 1. Boot System 2. Wait 3. Actual results: System Freeze Expected results: Additional info: Linux HOSTNAME 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux /dev/md127: Container : /dev/md0, member 0 Raid Level : raid1 Array Size : 244137984 (232.83 GiB 250.00 GB) Used Dev Size : 244138116 (232.83 GiB 250.00 GB) Raid Devices : 2 Total Devices : 2 State : active, resyncing Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Rebuild Status : 1% complete UUID : 72a86dc9:f6cd6593:7e418bd6:48d0d442 Number Major Minor RaidDevice State 1 8 0 0 active sync /dev/sda 0 8 16 1 active sync /dev/sdb Personalities : [raid1] md127 : active raid1 sda[1] sdb[0] 244137984 blocks super external:/md0/0 [2/2] [UU] [>....................] resync = 1.1% (2713152/244138116) finish=74.5min speed=53938K/sec md0 : inactive sdb[1](S) sda[0](S) 4514 blocks super external:imsm unused devices: <none> --- Additional comment from arkiados on 2010-06-09 18:39:10 EDT --- Also, I can reproduce this if I kick off a large data transfer to or from the array. After about five minutes the system will freeze. --- Additional comment from arkiados on 2010-06-10 11:32:18 EDT --- I restarted last night before I left the office. When I returned today, the raid had completed the resync. I have had no issues with the system this morning. I would be willing to bet that if I broke the array, I would have these issues again. Is there any information that I need to provide? How should i proceed? --- Additional comment from dledford on 2010-06-10 12:31:47 EDT --- The issue you are seeing is not a hard lock, it is total unresponsiveness, but it only lasts until the resync completes. In other words, had you not rebooted the machine, it would have completed the resync each time. This is a known issue we are working on. --- Additional comment from dledford on 2010-06-10 12:35:18 EDT --- *** This bug has been marked as a duplicate of bug 586299 *** --- Additional comment from dledford on 2010-06-10 12:36:41 EDT --- Doh! This bug is against Fedora while the bug I marked this a dup of is against Red Hat Enterprise Linux 6. Sorry about that. Reopening this bug. --- Additional comment from dledford on 2010-06-10 12:38:13 EDT --- Dan, this bug in Fedora looks to be the same as bug 586299 and indicates we probably do need to try and track this down. Is the problem reproducible for you? --- Additional comment from dan.j.williams on 2010-06-10 12:41:57 EDT --- Not so far... let me try F13 and see if I can get a hit. --- Additional comment from dan.j.williams on 2010-06-11 01:43:33 EDT --- I think I see the bug, and why it triggers only in the resync case. When we get stuck the thread issuing the write is stuck in md_write_start() waiting for: wait_event(mddev->sb_wait, !test_bit(MD_CHANGE_CLEAN, &mddev->flags) && !test_bit(MD_CHANGE_PENDING, &mddev->flags)); MD_CHANGE_CLEAN is cleared by mdmon writing "active" to .../md/array_state. I believe mdmon is properly doing this and wakes up mddev->sb_wait. However, while we are waiting for this wakeup it is possible that the resync thread hits a checkpoint event. This will cause MD_CHANGE_CLEAN to be set again. The result being that md_write_start() wakes up sees the bit is still set and goes back to sleep, and mdmon stays asleep until resync completed event. The fix is to get mdmon to subscribe to sync_completed events so that mdmon wakes up to rehandle MD_CHANGE_CLEAN. I have actually already implemented this for rebuild checkpointing [1]. Neil had a couple review comments so I'll fix those up and resubmit. [1]: http://git.kernel.org/?p=linux/kernel/git/djbw/mdadm.git;a=commitdiff;h=484240d8 --- Additional comment from arkiados on 2010-06-11 11:22:37 EDT --- Is there anything I can do to help? --- Additional comment from dledford on 2010-07-09 13:22:47 EDT --- Dan, is the fix to this in Neil's current git master branch queued for the upcoming 3.1.3 mdadm release? --- Additional comment from dan.j.williams on 2010-07-09 13:32:32 EDT --- (In reply to comment #10) > Dan, is the fix to this in Neil's current git master branch queued for the > upcoming 3.1.3 mdadm release? Yes, I flagged commit 484240d8 "mdmon: periodically checkpoint recovery" as urgent to Neil, so I expect it to be a part of 3.1.3. Neil asked for commit 4f0a7acc "mdmon: record sync_completed directly to the metadata" as a cleanup, but it is not necessary for resolving the hang. Waking up on sync_completed events is the critical piece. --- Additional comment from dledford on 2010-07-09 13:38:38 EDT --- OK, I'll get a git build pulled together soon and check for that commit. Thanks Dan. --- Additional comment from work.eric on 2010-07-19 02:42:11 EDT --- I took the latest F13 package of mdadm (mdadm-3.1.2-10.fc13.x86_64) and applied the following git patches. 484240d8 "mdmon: periodically checkpoint recovery" 4f0a7acc "mdmon: record sync_completed directly to the metadata" I've run a complete check twice and haven't had a single problem with the system freezing. The only thing I've noticed is minor delays saving files, which is expected when a check is running. I've re-enabled the weekly RAID check so I'll report back if there are any problems with that, but I expect not. Previous to these patches it would slowly die until it reached around 30-40% then it would become unresponsive until it completed the check or I rebooted. Good work finding this bug it's been causing major problems for me over the last 6 months and forced me to boot into Windows to complete the check and mark the array as clean. --- Additional comment from dledford on 2010-07-20 19:07:45 EDT --- This should be fixed in the latest f-13 build of mdadm. --- Additional comment from updates on 2010-07-22 11:36:36 EDT --- mdadm-3.1.3-0.git20100722.1.fc13 has been submitted as an update for Fedora 13. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc13 --- Additional comment from updates on 2010-07-22 22:39:14 EDT --- mdadm-3.1.3-0.git20100722.2.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc13 --- Additional comment from updates on 2010-08-05 10:25:36 EDT --- mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13 --- Additional comment from updates on 2010-08-05 10:26:14 EDT --- mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12 --- Additional comment from updates on 2010-08-05 10:26:48 EDT --- mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14 --- Additional comment from updates on 2010-08-05 19:29:46 EDT --- mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12 --- Additional comment from updates on 2010-08-05 19:53:19 EDT --- mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13 --- Additional comment from updates on 2010-08-09 21:30:23 EDT --- mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14 --- Additional comment from arkiados on 2010-11-11 09:56:16 EST --- I am testing the latest release on F13. --- Additional comment from arkiados on 2010-11-11 10:02:43 EST --- So rebuild speed has gone down to about 10000Kbps (average) versus 80000Kbps (average) before the patch. It would get up to about 120000Kbps and the system would start to die. So far, so good... I am testing mdadm-3.1.3-0.git20100804.2.fc13. --- Additional comment from updates on 2010-12-07 15:12:20 EST --- mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report. --- Additional comment from updates on 2010-12-07 15:14:14 EST --- mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report. --- Additional comment from scott-brown on 2011-01-16 13:41:58 EST --- Seeing what I presume to be same behaviour on mdadm-3.1.3-0.git20100804.2.fc14.x86_64 System runs for days with 4x Seagate Barracuda 7200.12 Model: ST31000528AS in a LVM group, however when added to a RAID5 array, the system will freeze randomly. Building RAID array manually or through GNOME DiskUtil (using all defaults) doesnt seem to make difference. EXT4 or XFS doesnt make difference. Mounted or not, doesnt make difference. General behaviour... 1. array starts, begins resync 2. eventually all disk activity stops 3. system is unresposive to all input Hard reset of system is required. Obviously this is something I would prefer not to do. Net result is that I can not use RAID on this sytem. System is homegrown, mobo is Gigabyte GA-H55M-SV2, BIOS set to AHCI. Linux version 2.6.35.10-74.fc14.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Thu Dec 23 16:04:50 UTC 2010 /dev/md0: Version : 1.2 Creation Time : Sat Jan 15 21:27:10 2011 Raid Level : raid5 Array Size : 2930276352 (2794.53 GiB 3000.60 GB) Used Dev Size : 976758784 (931.51 GiB 1000.20 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jan 16 13:36:18 2011 State : active, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 512K Rebuild Status : 94% complete Name : :TerraStore UUID : 8151c129:646ea005:34e60918:bc4a0ed6 Events : 3737 Number Major Minor RaidDevice State 0 8 81 0 active sync /dev/sdf1 1 8 49 1 active sync /dev/sdd1 2 8 33 2 active sync /dev/sdc1 4 8 17 3 spare rebuilding /dev/sdb1 Personalities : [raid6] [raid5] [raid4] md0 : active raid5 sdf1[0] sdb1[4] sdc1[2] sdd1[1] 2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_] [===================>.] recovery = 95.1% (929804532/976758784) finish=18.8min speed=41551K/sec bitmap: 3/8 pages [12KB], 65536KB chunk unused devices: <none> --- Additional comment from trancegenetic on 2011-07-03 09:30:13 EDT --- I still seem to experience this bug system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64 Symptoms System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse. This always happens at an mdadm resync or even a rebuild. While copying files through the running samba server it seems to trigger the freeze/hang much faster. System details: /dev/md0: Version : 1.0 Creation Time : Wed Jun 8 16:45:24 2011 Raid Level : raid1 Array Size : 488385400 (465.76 GiB 500.11 GB) Used Dev Size : 488385400 (465.76 GiB 500.11 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jul 3 15:25:19 2011 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : neptune-fix:0 UUID : 7d13420f:1996e153:6706c024:8e22715f Events : 3047 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 65 1 active sync /dev/sde1 /dev/md1: Version : 1.1 Creation Time : Wed Jun 8 16:47:55 2011 Raid Level : raid5 Array Size : 3907023872 (3726.03 GiB 4000.79 GB) Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jul 3 15:20:37 2011 State : active Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : neptune-fix:1 UUID : 8a48b835:5f02582b:68f0c66f:f1d85639 Events : 17507 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 3 8 49 2 active sync /dev/sdd1
I still seem to experience this bug on RHEL6.0 system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64 Symptoms System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse. This always happens at an mdadm resync or even a rebuild. While copying files through the running samba server it seems to trigger the freeze/hang much faster. System details: /dev/md0: Version : 1.0 Creation Time : Wed Jun 8 16:45:24 2011 Raid Level : raid1 Array Size : 488385400 (465.76 GiB 500.11 GB) Used Dev Size : 488385400 (465.76 GiB 500.11 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jul 3 15:25:19 2011 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : neptune-fix:0 UUID : 7d13420f:1996e153:6706c024:8e22715f Events : 3047 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 65 1 active sync /dev/sde1 /dev/md1: Version : 1.1 Creation Time : Wed Jun 8 16:47:55 2011 Raid Level : raid5 Array Size : 3907023872 (3726.03 GiB 4000.79 GB) Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) Raid Devices : 3 Total Devices : 3 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sun Jul 3 15:20:37 2011 State : active Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 512K Name : neptune-fix:1 UUID : 8a48b835:5f02582b:68f0c66f:f1d85639 Events : 17507 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 3 8 49 2 active sync /dev/sdd1
This is guaranteed not to be the same as the bug you duplicated. That bug was fixed, and it only effected Intel software RAID arrays, while you are using a linux MD software RAID array (yes, they are different, and the code in question from the original bug is never even used on your array). My first guess by your description is that this actually sounds like a hardware bug of some sort. I would suggest running a memory test on the machine to see if there are any issues it finds.
Ok, thanks for your reply. I already ran a memory test, no issues. I updated BIOS of my motherboard ASUS P5Q SE2 with intel core2duo e8500. This only happens during mdadm resync. Other heavy IO activity does not trigger this behaviour. System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse. It is even not possible to initiate a kernel panic with the sysrq keys. Absolutely nothing is logged in the logs.
Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
Does this problem still happen with RHEL6.2? Thanks, Jes
We are faced this issue at least two times (for 6.1 kernel and now for 6.2 too). Both times system freezes during weekly RAID array checking. But I can't firmly reproduce it. 00:1f.2 RAID bus controller: Intel Corporation 82801 SATA Controller [RAID mode] mdadm --detail /dev/md127 Container : /dev/md0, member 0 Raid Level : raid1 Array Size : 1953511424 (1863.01 GiB 2000.40 GB) Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB) Raid Devices : 2 Total Devices : 2 State : active, checking Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Check Status : 8% complete UUID : a72f8913:3693ec59:a88612ff:3072c153 Number Major Minor RaidDevice State 1 8 0 0 active sync /dev/sda 0 8 16 1 active sync /dev/sdb mdadm --examine /dev/md0 /dev/md0: Magic : Intel Raid ISM Cfg Sig. Version : 1.1.00 Orig Family : de8f6928 Family : 023c041b Generation : 0012d34c Attributes : All supported UUID : c4cf95d9:3c32945b:10a9a67c:5fa8fc07 Checksum : 1f770fe2 correct MPB Sectors : 1 Disks : 2 RAID Devices : 1 Disk01 Serial : MN1220F31NYX1D State : active Id : 00030000 Usable Size : 3907023112 (1863.01 GiB 2000.40 GB) [Volume0]: UUID : a72f8913:3693ec59:a88612ff:3072c153 RAID Level : 1 Members : 2 Slots : [UU] Failed disk : 1 This Slot : 1 Array Size : 3907022848 (1863.01 GiB 2000.40 GB) Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB) Sector Offset : 0 Num Stripes : 15261808 Chunk Size : 64 KiB Reserved : 0 Migrate State : idle Map State : normal Dirty State : dirty Disk00 Serial : MN1220F32B1Z3D State : active Id : 00020000 Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
Guybrush The most likely source of this is mdmon getting killed. Does it happen with the latest 6.3 updates as well? Do you do suspend/resume on the system that sees this? Jes
No, suspend/resume was not used. Unfortunately, I have no access to that system anymore, so I don't know whether update fixes the issue and also would not be able to help in further investigation.
Ok, without more data I have no way of reproducing the problem unfortunately. The two reports here are for different configs, one for IMSM BIOS RAID and one for regular RAID. Since there hasn't been any updates on the original bug since July I am going to assume it is no longer an issue. If these problems reappear, please open a new BZ. Jes