User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100216 Fedora/3.5.8-1.fc12 Firefox/3.5.8 I'm using Intel bios raid 1. Every couple of weeks the raid goes out of sync for some reason. The disk activity light comes on and two processes show up (in "top" listing): md127_raid1 and md127_resync. Over a period of minutes the whole system gradually comes to a halt. After awhile, even text (non-GUI) terminal cannot log in. In an hour or so the system reboots and then all is fine for the next couple of weeks. Until this can be fixed, is there a way to reduce priority ("renice") the md127_* processes so they don't lock up the machine while they do their thing? Reproducible: Always Steps to Reproduce: 1.Wait a few weeks for RAID mirror to break (Why it breaks is a mystery) 2.Watch things gradually stop working over a period of ~10 minutes 3.Wait an hour or so for an automatic reboot. If I knew what to capture or look for, I probably could get some useful information during that 10 minute window. What should I do to get logs or dumps that would help in debugging?
With some hints from the Forum, I have worked through the anacron configuration and shell scripts that run the md* code. Following is a file that I generated during one of these lock-up episodes. Note that after the scan gets to 25% or so, the script itself even locks up (run by root with "nice -5") and the machine sits there for over an hour until the script resumes. I've disabled the weekly scan (now that I know how to do that), because it is unacceptable for my machine to "go away" for an hour with no way to get it back short of a power-cycle reset. Even mouse tracking ceased to function. CTL-ALT-F2 to a text terminal wouldn't respond to a login request. I would like to reenable the weekly scan as soon as possible so the integrity of my RAID array is verified on a regular basis. Otherwise it is one more manual chore I am sure to forget! :-) Two corrections to my original report. Apparently this is a routine scan, not a broken RAID 1 mirror. And the "automatic reboot" appears to have been queued during my attempts to get the machine back, it does not normally occur. Here is the script (run as root using "nice -5"): || #! /bin/bash || # Capture what's going on while mdadm is hogging machine || # || echo "Loop writing stats to '~/bug.log' every 30 seconds" || while true; do || echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>" || iostat -t || cat /proc/mdstat || sleep 30 || done >>~/bug.log & And here is the heavily edited output (Much of middle part deleted): >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 03/31/2010 _i686_ (2 CPU) 03/31/2010 01:27:36 PM avg-cpu: %user %nice %system %iowait %steal %idle 3.48 0.00 5.53 23.12 0.00 67.87 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 392.17 52352.48 116.50 5815837 12942 sdb 210.85 51.89 48558.09 5764 5394318 md127 824.47 3922.42 116.21 435742 12910 Personalities : [raid1] md127 : active raid1 sda[1] sdb[0] 312568832 blocks super external:/md_d-1/0 [2/2] [UU] [>....................] resync = 0.8% (2690688/312568832) finish=106.7min speed=48356K/sec md0 : inactive sdb[1](S) sda[0](S) 4514 blocks super external:imsm unused devices: <none> >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 03/31/2010 _i686_ (2 CPU) 03/31/2010 01:28:06 PM avg-cpu: %user %nice %system %iowait %steal %idle 3.00 0.00 4.81 19.02 0.00 73.17 ((((((((((((((((... Iterations deleted...)))))))))))))))) >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 03/31/2010 _i686_ (2 CPU) 03/31/2010 01:47:07 PM avg-cpu: %user %nice %system %iowait %steal %idle 1.53 0.00 2.22 3.42 0.00 92.83 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 883.86 122028.45 80.92 156390437 103704 sdb 523.29 4.50 121407.61 5764 155594776 md127 86.09 699.75 80.69 896790 103414 Personalities : [raid1] md127 : active raid1 sda[1] sdb[0] 312568832 blocks super external:/md_d-1/0 [2/2] [UU] [====>................] resync = 24.8% (77745536/312568832) finish=59.3min speed=65922K/sec md0 : inactive sdb[1](S) sda[0](S) 4514 blocks super external:imsm unused devices: <none> >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 03/31/2010 _i686_ (2 CPU) 03/31/2010 01:47:37 PM avg-cpu: %user %nice %system %iowait %steal %idle 1.51 0.00 2.21 3.35 0.00 92.93 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 887.95 122414.67 79.09 160557861 103738 sdb 525.04 4.39 121808.16 5764 159762362 md127 84.12 683.74 78.87 896790 103446 Personalities : [raid1] md127 : active raid1 sda[1] sdb[0] 312568832 blocks super external:/md_d-1/0 [2/2] [UU] [=====>...............] resync = 25.5% (79829440/312568832) finish=55.0min speed=70408K/sec md0 : inactive sdb[1](S) sda[0](S) 4514 blocks super external:imsm unused devices: <none> >>>>>>>>>>>>>>>>>>>>>>>>>>> Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 03/31/2010 _i686_ (2 CPU) 03/31/2010 02:55:08 PM avg-cpu: %user %nice %system %iowait %steal %idle 0.90 0.00 1.88 67.59 0.00 29.63 [... Rest of log deleted]
Just thought it worth sharing that I recently had a similar issue. For me it seems setting '/proc/sys/dev/raid/speed_limit_max' to a lower value corrected it. Hope this might help :) *your mileage may vary as at the time of writing this I assume it worked because typically it would have happened by now
I'm wondering if this bug has been fixed. I just cut the power to my PC and restarted it. A md127_resync process is running. The data speed is staying between 55K/sec and 70K/sec. I haven't noticed any degradation in responsiveness. A few weeks ago, a kernel update included a fix for a RAID 5 issue. (See bug #575402.) Maybe that update helped this problem as well. I'm now running F13. I experienced similar symptoms last week when using the kernel from the F13 DVD.
No, it hasn't been fixed. The UI became unresponsive when the disk was about 60% resynched. The data speed was about 83K/sec. I'll try setting /proc/sys/dev/raid/speed_limit_max to see if that helps.
The following command didn't help. The UI still became unresponsive. echo "50000" > /proc/sys/dev/raid/speed_limit_max The data speed did stay around 50K/sec. I noticed the following when the computer was unresponsive: = the mouse cursor still moves OK = windows are no longer updated = I can go to a new virtual console and log in. (It takes a minute or so.) Sometimes a virtual console becomes unresponsive. In that case, I am still able to to go another virtual console with C-A-Fn and log in.
*** Bug 542546 has been marked as a duplicate of this bug. ***
This is specifically a problem with imsm arrays. If you wait for the resync to complete, it returns to normal. The problem has been fixed in mdadm-3.1.3-0.git20100722.1 or later.
mdadm-3.1.3-0.git20100722.1.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc12
mdadm-3.1.3-0.git20100722.2.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc12
mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13
mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12
mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14. http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14
mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository. If problems still persist, please make note of it in this bug report. If you want to test the update, you can install it with su -c 'yum --enablerepo=updates-testing update mdadm'. You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14
This message is a reminder that Fedora 12 is nearing its end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 12. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '12'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 12's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 12 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug to the applicable version. If you are unable to change the version, please add a comment here and someone will do it for you. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository. If problems still persist, please make note of it in this bug report.
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.