RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 617280 - mdadm resync freeze at random
Summary: mdadm resync freeze at random
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: mdadm
Version: 6.0
Hardware: x86_64
OS: Linux
low
high
Target Milestone: rc
: ---
Assignee: Doug Ledford
QA Contact: Yulia Kopkova
URL:
Whiteboard:
: 586299 607090 (view as bug list)
Depends On: 602457
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-07-22 16:27 UTC by Doug Ledford
Modified: 2010-11-15 14:32 UTC (History)
8 users (show)

Fixed In Version: mdadm-3.1.3-0.git20100722.1
Doc Type: Bug Fix
Doc Text:
Clone Of: 602457
Environment:
Last Closed: 2010-11-15 14:32:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Doug Ledford 2010-07-22 16:27:55 UTC
+++ This bug was initially created as a clone of Bug #602457 +++

I just installed Fedora 13 on my Dell Optiplex 755 with an Intel RAID controller.

00:1f.2 RAID bus controller: Intel Corporation 82801 SATA RAID Controller (rev 02)

The system is resyncing the RAID array. After a about 15 minutes of use (this is on average), the system will become unresponsive. I have to perform a hard reboot.


Version-Release number of selected component (if applicable):


How reproducible: Every time


Steps to Reproduce:
1. Boot System
2. Wait
3.
  
Actual results: System Freeze


Expected results: 


Additional info:

Linux HOSTNAME 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux


/dev/md127:
      Container : /dev/md0, member 0
     Raid Level : raid1
     Array Size : 244137984 (232.83 GiB 250.00 GB)
  Used Dev Size : 244138116 (232.83 GiB 250.00 GB)
   Raid Devices : 2
  Total Devices : 2

          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 1% complete


           UUID : 72a86dc9:f6cd6593:7e418bd6:48d0d442
    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      244137984 blocks super external:/md0/0 [2/2] [UU]
      [>....................]  resync =  1.1% (2713152/244138116) finish=74.5min speed=53938K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

--- Additional comment from arkiados on 2010-06-09 18:39:10 EDT ---

Also, I can reproduce this if I kick off a large data transfer to or from the array. After about five minutes the system will freeze.

--- Additional comment from arkiados on 2010-06-10 11:32:18 EDT ---

I restarted last night before I left the office. When I returned today, the raid had completed the resync. I have had no issues with the system this morning. I would be willing to bet that if I broke the array, I would have these issues again.

Is there any information that I need to provide? How should i proceed?

--- Additional comment from dledford on 2010-06-10 12:31:47 EDT ---

The issue you are seeing is not a hard lock, it is total unresponsiveness, but it only lasts until the resync completes.  In other words, had you not rebooted the machine, it would have completed the resync each time.  This is a known issue we are working on.

--- Additional comment from dledford on 2010-06-10 12:35:18 EDT ---



*** This bug has been marked as a duplicate of bug 586299 ***

--- Additional comment from dledford on 2010-06-10 12:36:41 EDT ---

Doh!  This bug is against Fedora while the bug I marked this a dup of is against Red Hat Enterprise Linux 6.  Sorry about that.  Reopening this bug.

--- Additional comment from dledford on 2010-06-10 12:38:13 EDT ---

Dan, this bug in Fedora looks to be the same as bug 586299 and indicates we probably do need to try and track this down.  Is the problem reproducible for you?

--- Additional comment from dan.j.williams on 2010-06-10 12:41:57 EDT ---

Not so far...  let me try F13 and see if I can get a hit.

--- Additional comment from dan.j.williams on 2010-06-11 01:43:33 EDT ---

I think I see the bug, and why it triggers only in the resync case.

When we get stuck the thread issuing the write is stuck in md_write_start() waiting for:

wait_event(mddev->sb_wait,
           !test_bit(MD_CHANGE_CLEAN, &mddev->flags) &&
           !test_bit(MD_CHANGE_PENDING, &mddev->flags));

MD_CHANGE_CLEAN is cleared by mdmon writing "active" to .../md/array_state.  I believe mdmon is properly doing this and wakes up mddev->sb_wait.  However, while we are waiting for this wakeup it is possible that the resync thread hits a checkpoint event.  This will cause MD_CHANGE_CLEAN to be set again.  The result being that md_write_start() wakes up sees the bit is still set and goes back to sleep, and mdmon stays asleep until resync completed event.  The fix is to get mdmon to subscribe to sync_completed events so that mdmon wakes up to rehandle MD_CHANGE_CLEAN.  I have actually already implemented this for rebuild checkpointing [1].  Neil had a couple review comments so I'll fix those up and resubmit.

[1]: http://git.kernel.org/?p=linux/kernel/git/djbw/mdadm.git;a=commitdiff;h=484240d8

--- Additional comment from arkiados on 2010-06-11 11:22:37 EDT ---

Is there anything I can do to help?

--- Additional comment from dledford on 2010-07-09 13:22:47 EDT ---

Dan, is the fix to this in Neil's current git master branch queued for the upcoming 3.1.3 mdadm release?

--- Additional comment from dan.j.williams on 2010-07-09 13:32:32 EDT ---

(In reply to comment #10)
> Dan, is the fix to this in Neil's current git master branch queued for the
> upcoming 3.1.3 mdadm release?    

Yes, I flagged commit 484240d8 "mdmon: periodically checkpoint recovery" as urgent to Neil, so I expect it to be a part of 3.1.3.

Neil asked for commit 4f0a7acc "mdmon: record sync_completed directly to the metadata" as a cleanup, but it is not necessary for resolving the hang.  Waking up on sync_completed events is the critical piece.

--- Additional comment from dledford on 2010-07-09 13:38:38 EDT ---

OK, I'll get a git build pulled together soon and check for that commit.  Thanks Dan.

--- Additional comment from work.eric on 2010-07-19 02:42:11 EDT ---

I took the latest F13 package of mdadm (mdadm-3.1.2-10.fc13.x86_64) and applied the following git patches.

484240d8 "mdmon: periodically checkpoint recovery"
4f0a7acc "mdmon: record sync_completed directly to the metadata"

I've run a complete check twice and haven't had a single problem with the system freezing.  The only thing I've noticed is minor delays saving files, which is expected when a check is running.  I've re-enabled the weekly RAID check so I'll report back if there are any problems with that, but I expect not.  Previous to these patches it would slowly die until it reached around 30-40% then it would become unresponsive until it completed the check or I rebooted.  Good work finding this bug it's been causing major problems for me over the last 6 months and forced me to boot into Windows to complete the check and mark the array as clean.

--- Additional comment from dledford on 2010-07-20 19:07:45 EDT ---

This should be fixed in the latest f-13 build of mdadm.

--- Additional comment from updates on 2010-07-22 11:36:36 EDT ---

mdadm-3.1.3-0.git20100722.1.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc13

Comment 1 Doug Ledford 2010-07-22 17:07:13 UTC
*** Bug 586299 has been marked as a duplicate of this bug. ***

Comment 2 Doug Ledford 2010-07-27 16:39:04 UTC
*** Bug 607090 has been marked as a duplicate of this bug. ***

Comment 4 David Edwards 2010-11-11 14:55:49 UTC
I am testing the latest release.

Comment 5 releng-rhel@redhat.com 2010-11-15 14:32:59 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.