Bug 576749 - Intel bios RAID 1 - md127_resync activity chokes system to death
Summary: Intel bios RAID 1 - md127_resync activity chokes system to death
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: mdadm
Version: 12
Hardware: i686
OS: Linux
low
medium
Target Milestone: ---
Assignee: Doug Ledford
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 542546 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-03-25 02:41 UTC by Bruce Fowler
Modified: 2010-12-07 20:14 UTC (History)
10 users (show)

Fixed In Version: mdadm-3.1.3-0.git20100804.2.fc13
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-12-03 16:47:41 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Bruce Fowler 2010-03-25 02:41:52 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100216 Fedora/3.5.8-1.fc12 Firefox/3.5.8

I'm using Intel bios raid 1. Every couple of weeks the raid goes out of sync for some reason. The disk activity light comes on and two processes show up (in "top" listing): md127_raid1 and md127_resync. Over a period of minutes the whole system gradually comes to a halt. After awhile, even text (non-GUI) terminal cannot log in. In an hour or so the system reboots and then all is fine for the next couple of weeks. Until this can be fixed, is there a way to reduce priority ("renice") the md127_* processes so they don't lock up the machine while they do their thing?


Reproducible: Always

Steps to Reproduce:
1.Wait a few weeks for RAID mirror to break (Why it breaks is a mystery)
2.Watch things gradually stop working over a period of ~10 minutes
3.Wait an hour or so for an automatic reboot.



If I knew what to capture or look for, I probably could get some useful information during that 10 minute window.  What should I do to get logs or dumps that would help in debugging?

Comment 1 Bruce Fowler 2010-03-31 23:46:58 UTC
With some hints from the Forum, I have worked through the anacron configuration and shell scripts that run the md* code.  Following is a file that I generated during one of these lock-up episodes.  Note that after the scan gets to 25% or so, the script itself even locks up (run by root with "nice -5") and the machine sits there for over an hour until the script resumes.

I've disabled the weekly scan (now that I know how to do that), because it is unacceptable for my machine to "go away" for an hour with no way to get it back short of a power-cycle reset.  Even mouse tracking ceased to function.  CTL-ALT-F2 to a text terminal wouldn't respond to a login request.

I would like to reenable the weekly scan as soon as possible so the integrity of my RAID array is verified on a regular basis.  Otherwise it is one more manual chore I am sure to forget!  :-)

Two corrections to my original report.  Apparently this is a routine scan, not a broken RAID 1 mirror.  And the "automatic reboot" appears to have been queued during my attempts to get the machine back, it does not normally occur.

Here is the script (run as root using "nice -5"):

||  #! /bin/bash
||  # Capture what's going on while mdadm is hogging machine
||  #
||  echo "Loop writing stats to '~/bug.log' every 30 seconds"
||  while true; do
||  	echo ">>>>>>>>>>>>>>>>>>>>>>>>>>>"
||  	iostat -t
||  	cat /proc/mdstat
||  	sleep 30
||  done >>~/bug.log &

And here is the heavily edited output (Much of middle part deleted):

>>>>>>>>>>>>>>>>>>>>>>>>>>>
Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 	03/31/2010 	_i686_	(2 CPU)

03/31/2010 01:27:36 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.48    0.00    5.53   23.12    0.00   67.87

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             392.17     52352.48       116.50    5815837      12942
sdb             210.85        51.89     48558.09       5764    5394318
md127           824.47      3922.42       116.21     435742      12910

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      312568832 blocks super external:/md_d-1/0 [2/2] [UU]
      [>....................]  resync =  0.8% (2690688/312568832) finish=106.7min speed=48356K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 	03/31/2010 	_i686_	(2 CPU)

03/31/2010 01:28:06 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           3.00    0.00    4.81   19.02    0.00   73.17

((((((((((((((((... Iterations deleted...))))))))))))))))

>>>>>>>>>>>>>>>>>>>>>>>>>>>
Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 	03/31/2010 	_i686_	(2 CPU)

03/31/2010 01:47:07 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.53    0.00    2.22    3.42    0.00   92.83

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             883.86    122028.45        80.92  156390437     103704
sdb             523.29         4.50    121407.61       5764  155594776
md127            86.09       699.75        80.69     896790     103414

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      312568832 blocks super external:/md_d-1/0 [2/2] [UU]
      [====>................]  resync = 24.8% (77745536/312568832) finish=59.3min speed=65922K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 	03/31/2010 	_i686_	(2 CPU)

03/31/2010 01:47:37 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.51    0.00    2.21    3.35    0.00   92.93

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             887.95    122414.67        79.09  160557861     103738
sdb             525.04         4.39    121808.16       5764  159762362
md127            84.12       683.74        78.87     896790     103446

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      312568832 blocks super external:/md_d-1/0 [2/2] [UU]
      [=====>...............]  resync = 25.5% (79829440/312568832) finish=55.0min speed=70408K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>
>>>>>>>>>>>>>>>>>>>>>>>>>>>
Linux 2.6.32.9-70.fc12.i686.PAE (grimm.localdomain) 	03/31/2010 	_i686_	(2 CPU)

03/31/2010 02:55:08 PM
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.90    0.00    1.88   67.59    0.00   29.63

[... Rest of log deleted]

Comment 2 K Anderson 2010-06-01 20:10:02 UTC
Just thought it worth sharing that I recently had a similar issue.  For me it seems setting '/proc/sys/dev/raid/speed_limit_max' to a lower value corrected it.

Hope this might help :)

*your mileage may vary as at the time of writing this I assume it worked because typically it would have happened by now

Comment 3 Aram Agajanian 2010-06-01 22:33:03 UTC
I'm wondering if this bug has been fixed.  I just cut the power to my PC and restarted it.  A md127_resync process is running.  The data speed is staying between 55K/sec and 70K/sec.  I haven't noticed any degradation in responsiveness.

A few weeks ago, a kernel update included a fix for a RAID 5 issue.  (See bug #575402.)  Maybe that update helped this problem as well.

I'm now running F13.  I experienced similar symptoms last week when using the kernel from the F13 DVD.

Comment 4 Aram Agajanian 2010-06-01 23:00:48 UTC
No, it hasn't been fixed.  The UI became unresponsive when the disk was about 60% resynched.  The data speed was about 83K/sec.

I'll try setting /proc/sys/dev/raid/speed_limit_max to see if that helps.

Comment 5 Aram Agajanian 2010-06-01 23:29:50 UTC
The following command didn't help.  The UI still became unresponsive.

echo "50000" > /proc/sys/dev/raid/speed_limit_max

The data speed did stay around 50K/sec.

I noticed the following when the computer was unresponsive:

= the mouse cursor still moves OK

= windows are no longer updated

= I can go to a new virtual console and log in.  (It takes a minute or so.)  Sometimes a virtual console becomes unresponsive.  In that case, I am still able to to go another virtual console with C-A-Fn and log in.

Comment 6 Aram Agajanian 2010-06-05 03:21:07 UTC
*** Bug 542546 has been marked as a duplicate of this bug. ***

Comment 7 Doug Ledford 2010-07-22 15:02:24 UTC
This is specifically a problem with imsm arrays.  If you wait for the resync to complete, it returns to normal.  The problem has been fixed in mdadm-3.1.3-0.git20100722.1 or later.

Comment 8 Fedora Update System 2010-07-22 15:38:06 UTC
mdadm-3.1.3-0.git20100722.1.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc12

Comment 9 Fedora Update System 2010-07-23 02:38:00 UTC
mdadm-3.1.3-0.git20100722.2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc12

Comment 10 Fedora Update System 2010-08-05 14:25:09 UTC
mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13

Comment 11 Fedora Update System 2010-08-05 14:25:46 UTC
mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12

Comment 12 Fedora Update System 2010-08-05 14:26:24 UTC
mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14

Comment 13 Fedora Update System 2010-08-05 23:29:24 UTC
mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12

Comment 14 Fedora Update System 2010-08-05 23:52:55 UTC
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13

Comment 15 Fedora Update System 2010-08-10 01:30:00 UTC
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14

Comment 16 Bug Zapper 2010-11-03 18:40:38 UTC
This message is a reminder that Fedora 12 is nearing its end of life.
Approximately 30 (thirty) days from now Fedora will stop maintaining
and issuing updates for Fedora 12.  It is Fedora's policy to close all
bug reports from releases that are no longer maintained.  At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '12'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 12's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 12 is end of life.  If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora please change the 'version' of this 
bug to the applicable version.  If you are unable to change the version, 
please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 17 Bug Zapper 2010-12-03 16:47:41 UTC
Fedora 12 changed to end-of-life (EOL) status on 2010-12-02. Fedora 12 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 18 Fedora Update System 2010-12-07 20:11:50 UTC
mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 19 Fedora Update System 2010-12-07 20:13:51 UTC
mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.