Bug 718496

Summary:	mdadm resync freeze at random
Product:	Red Hat Enterprise Linux 6	Reporter:	trancegenetic
Component:	mdadm	Assignee:	Jes Sorensen <Jes.Sorensen>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	qe-baseos-daemons
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.0	CC:	arkiados, dan.j.williams, dledford, rfv781, scott-brown, trancegenetic, vgriit, work.eric
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:	602457	Environment:
Last Closed:	2013-02-08 18:59:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description trancegenetic 2011-07-03 15:37:14 UTC

+++ This bug was initially created as a clone of Bug #602457 +++

I just installed Fedora 13 on my Dell Optiplex 755 with an Intel RAID controller.

00:1f.2 RAID bus controller: Intel Corporation 82801 SATA RAID Controller (rev 02)

The system is resyncing the RAID array. After a about 15 minutes of use (this is on average), the system will become unresponsive. I have to perform a hard reboot.


Version-Release number of selected component (if applicable):


How reproducible: Every time


Steps to Reproduce:
1. Boot System
2. Wait
3.
  
Actual results: System Freeze


Expected results: 


Additional info:

Linux HOSTNAME 2.6.33.5-112.fc13.x86_64 #1 SMP Thu May 27 02:28:31 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux


/dev/md127:
      Container : /dev/md0, member 0
     Raid Level : raid1
     Array Size : 244137984 (232.83 GiB 250.00 GB)
  Used Dev Size : 244138116 (232.83 GiB 250.00 GB)
   Raid Devices : 2
  Total Devices : 2

          State : active, resyncing
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 1% complete


           UUID : 72a86dc9:f6cd6593:7e418bd6:48d0d442
    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

Personalities : [raid1] 
md127 : active raid1 sda[1] sdb[0]
      244137984 blocks super external:/md0/0 [2/2] [UU]
      [>....................]  resync =  1.1% (2713152/244138116) finish=74.5min speed=53938K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>

--- Additional comment from arkiados on 2010-06-09 18:39:10 EDT ---

Also, I can reproduce this if I kick off a large data transfer to or from the array. After about five minutes the system will freeze.

--- Additional comment from arkiados on 2010-06-10 11:32:18 EDT ---

I restarted last night before I left the office. When I returned today, the raid had completed the resync. I have had no issues with the system this morning. I would be willing to bet that if I broke the array, I would have these issues again.

Is there any information that I need to provide? How should i proceed?

--- Additional comment from dledford on 2010-06-10 12:31:47 EDT ---

The issue you are seeing is not a hard lock, it is total unresponsiveness, but it only lasts until the resync completes.  In other words, had you not rebooted the machine, it would have completed the resync each time.  This is a known issue we are working on.

--- Additional comment from dledford on 2010-06-10 12:35:18 EDT ---



*** This bug has been marked as a duplicate of bug 586299 ***

--- Additional comment from dledford on 2010-06-10 12:36:41 EDT ---

Doh!  This bug is against Fedora while the bug I marked this a dup of is against Red Hat Enterprise Linux 6.  Sorry about that.  Reopening this bug.

--- Additional comment from dledford on 2010-06-10 12:38:13 EDT ---

Dan, this bug in Fedora looks to be the same as bug 586299 and indicates we probably do need to try and track this down.  Is the problem reproducible for you?

--- Additional comment from dan.j.williams on 2010-06-10 12:41:57 EDT ---

Not so far...  let me try F13 and see if I can get a hit.

--- Additional comment from dan.j.williams on 2010-06-11 01:43:33 EDT ---

I think I see the bug, and why it triggers only in the resync case.

When we get stuck the thread issuing the write is stuck in md_write_start() waiting for:

wait_event(mddev->sb_wait,
           !test_bit(MD_CHANGE_CLEAN, &mddev->flags) &&
           !test_bit(MD_CHANGE_PENDING, &mddev->flags));

MD_CHANGE_CLEAN is cleared by mdmon writing "active" to .../md/array_state.  I believe mdmon is properly doing this and wakes up mddev->sb_wait.  However, while we are waiting for this wakeup it is possible that the resync thread hits a checkpoint event.  This will cause MD_CHANGE_CLEAN to be set again.  The result being that md_write_start() wakes up sees the bit is still set and goes back to sleep, and mdmon stays asleep until resync completed event.  The fix is to get mdmon to subscribe to sync_completed events so that mdmon wakes up to rehandle MD_CHANGE_CLEAN.  I have actually already implemented this for rebuild checkpointing [1].  Neil had a couple review comments so I'll fix those up and resubmit.

[1]: http://git.kernel.org/?p=linux/kernel/git/djbw/mdadm.git;a=commitdiff;h=484240d8

--- Additional comment from arkiados on 2010-06-11 11:22:37 EDT ---

Is there anything I can do to help?

--- Additional comment from dledford on 2010-07-09 13:22:47 EDT ---

Dan, is the fix to this in Neil's current git master branch queued for the upcoming 3.1.3 mdadm release?

--- Additional comment from dan.j.williams on 2010-07-09 13:32:32 EDT ---

(In reply to comment #10)
> Dan, is the fix to this in Neil's current git master branch queued for the
> upcoming 3.1.3 mdadm release?    

Yes, I flagged commit 484240d8 "mdmon: periodically checkpoint recovery" as urgent to Neil, so I expect it to be a part of 3.1.3.

Neil asked for commit 4f0a7acc "mdmon: record sync_completed directly to the metadata" as a cleanup, but it is not necessary for resolving the hang.  Waking up on sync_completed events is the critical piece.

--- Additional comment from dledford on 2010-07-09 13:38:38 EDT ---

OK, I'll get a git build pulled together soon and check for that commit.  Thanks Dan.

--- Additional comment from work.eric on 2010-07-19 02:42:11 EDT ---

I took the latest F13 package of mdadm (mdadm-3.1.2-10.fc13.x86_64) and applied the following git patches.

484240d8 "mdmon: periodically checkpoint recovery"
4f0a7acc "mdmon: record sync_completed directly to the metadata"

I've run a complete check twice and haven't had a single problem with the system freezing.  The only thing I've noticed is minor delays saving files, which is expected when a check is running.  I've re-enabled the weekly RAID check so I'll report back if there are any problems with that, but I expect not.  Previous to these patches it would slowly die until it reached around 30-40% then it would become unresponsive until it completed the check or I rebooted.  Good work finding this bug it's been causing major problems for me over the last 6 months and forced me to boot into Windows to complete the check and mark the array as clean.

--- Additional comment from dledford on 2010-07-20 19:07:45 EDT ---

This should be fixed in the latest f-13 build of mdadm.

--- Additional comment from updates on 2010-07-22 11:36:36 EDT ---

mdadm-3.1.3-0.git20100722.1.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.1.fc13

--- Additional comment from updates on 2010-07-22 22:39:14 EDT ---

mdadm-3.1.3-0.git20100722.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100722.2.fc13

--- Additional comment from updates on 2010-08-05 10:25:36 EDT ---

mdadm-3.1.3-0.git20100804.2.fc13 has been submitted as an update for Fedora 13.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13

--- Additional comment from updates on 2010-08-05 10:26:14 EDT ---

mdadm-3.1.3-0.git20100804.2.fc12 has been submitted as an update for Fedora 12.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12

--- Additional comment from updates on 2010-08-05 10:26:48 EDT ---

mdadm-3.1.3-0.git20100804.2.fc14 has been submitted as an update for Fedora 14.
http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14

--- Additional comment from updates on 2010-08-05 19:29:46 EDT ---

mdadm-3.1.3-0.git20100804.2.fc12 has been pushed to the Fedora 12 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc12

--- Additional comment from updates on 2010-08-05 19:53:19 EDT ---

mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc13

--- Additional comment from updates on 2010-08-09 21:30:23 EDT ---

mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update mdadm'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.2.fc14

--- Additional comment from arkiados on 2010-11-11 09:56:16 EST ---

I am testing the latest release on F13.

--- Additional comment from arkiados on 2010-11-11 10:02:43 EST ---

So rebuild speed has gone down to about 10000Kbps (average) versus 80000Kbps (average) before the patch. It would get up to about 120000Kbps and the system would start to die. So far, so good...

I am testing mdadm-3.1.3-0.git20100804.2.fc13.

--- Additional comment from updates on 2010-12-07 15:12:20 EST ---

mdadm-3.1.3-0.git20100804.2.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

--- Additional comment from updates on 2010-12-07 15:14:14 EST ---

mdadm-3.1.3-0.git20100804.2.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

--- Additional comment from scott-brown on 2011-01-16 13:41:58 EST ---

Seeing what I presume to be same behaviour on mdadm-3.1.3-0.git20100804.2.fc14.x86_64 

System runs for days with 4x Seagate Barracuda 7200.12 Model: ST31000528AS in a LVM group, however when added to a RAID5 array, the system will freeze randomly.

Building RAID array manually or through GNOME DiskUtil (using all defaults) doesnt seem to make difference.  EXT4 or XFS doesnt make difference.  Mounted or not, doesnt make difference.

General behaviour... 

1. array starts, begins resync
2. eventually all disk activity stops
3. system is unresposive to all input

Hard reset of system is required.  Obviously this is something I would prefer not to do. Net result is that I can not use RAID on this sytem.

System is homegrown, mobo is Gigabyte GA-H55M-SV2, BIOS set to AHCI.  

Linux version 2.6.35.10-74.fc14.x86_64 (mockbuild.fedoraproject.org) (gcc version 4.5.1 20100924 (Red Hat 4.5.1-4) (GCC) ) #1 SMP Thu Dec 23 16:04:50 UTC 2010

/dev/md0:
        Version : 1.2
  Creation Time : Sat Jan 15 21:27:10 2011
     Raid Level : raid5
     Array Size : 2930276352 (2794.53 GiB 3000.60 GB)
  Used Dev Size : 976758784 (931.51 GiB 1000.20 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jan 16 13:36:18 2011
          State : active, degraded, recovering
 Active Devices : 3
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 94% complete

           Name : :TerraStore
           UUID : 8151c129:646ea005:34e60918:bc4a0ed6
         Events : 3737

    Number   Major   Minor   RaidDevice State
       0       8       81        0      active sync   /dev/sdf1
       1       8       49        1      active sync   /dev/sdd1
       2       8       33        2      active sync   /dev/sdc1
       4       8       17        3      spare rebuilding   /dev/sdb1

Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdf1[0] sdb1[4] sdc1[2] sdd1[1]
      2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [===================>.]  recovery = 95.1% (929804532/976758784) finish=18.8min speed=41551K/sec
      bitmap: 3/8 pages [12KB], 65536KB chunk

unused devices: <none>

--- Additional comment from trancegenetic on 2011-07-03 09:30:13 EDT ---

I still seem to experience this bug

system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64

Symptoms
System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse.

This always happens at an mdadm resync or even a rebuild. While copying files through the running samba server it seems to trigger the freeze/hang much faster.

System details:
/dev/md0:
        Version : 1.0
  Creation Time : Wed Jun  8 16:45:24 2011
     Raid Level : raid1
     Array Size : 488385400 (465.76 GiB 500.11 GB)
  Used Dev Size : 488385400 (465.76 GiB 500.11 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jul  3 15:25:19 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : neptune-fix:0
           UUID : 7d13420f:1996e153:6706c024:8e22715f
         Events : 3047

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       65        1      active sync   /dev/sde1


/dev/md1:
        Version : 1.1
  Creation Time : Wed Jun  8 16:47:55 2011
     Raid Level : raid5
     Array Size : 3907023872 (3726.03 GiB 4000.79 GB)
  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jul  3 15:20:37 2011
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : neptune-fix:1
           UUID : 8a48b835:5f02582b:68f0c66f:f1d85639
         Events : 17507

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       3       8       49        2      active sync   /dev/sdd1

Comment 1 trancegenetic 2011-07-03 15:38:29 UTC

I still seem to experience this bug on RHEL6.0

system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64

Symptoms
System freezes/hangs, screen output hangs, no input is possible from keyboard
or mouse.

This always happens at an mdadm resync or even a rebuild. While copying files
through the running samba server it seems to trigger the freeze/hang much
faster.

System details:
/dev/md0:
        Version : 1.0
  Creation Time : Wed Jun  8 16:45:24 2011
     Raid Level : raid1
     Array Size : 488385400 (465.76 GiB 500.11 GB)
  Used Dev Size : 488385400 (465.76 GiB 500.11 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jul  3 15:25:19 2011
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : neptune-fix:0
           UUID : 7d13420f:1996e153:6706c024:8e22715f
         Events : 3047

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       65        1      active sync   /dev/sde1


/dev/md1:
        Version : 1.1
  Creation Time : Wed Jun  8 16:47:55 2011
     Raid Level : raid5
     Array Size : 3907023872 (3726.03 GiB 4000.79 GB)
  Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
   Raid Devices : 3
  Total Devices : 3
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sun Jul  3 15:20:37 2011
          State : active
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 512K

           Name : neptune-fix:1
           UUID : 8a48b835:5f02582b:68f0c66f:f1d85639
         Events : 17507

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       3       8       49        2      active sync   /dev/sdd1

Comment 3 Doug Ledford 2011-07-06 14:39:30 UTC

This is guaranteed not to be the same as the bug you duplicated.  That bug was fixed, and it only effected Intel software RAID arrays, while you are using a linux MD software RAID array (yes, they are different, and the code in question from the original bug is never even used on your array).  My first guess by your description is that this actually sounds like a hardware bug of some sort.  I would suggest running a memory test on the machine to see if there are any issues it finds.

Comment 4 trancegenetic 2011-07-06 20:12:57 UTC

Ok, thanks for your reply.

I already ran a memory test, no issues. I updated BIOS of my motherboard ASUS P5Q SE2 with intel core2duo e8500. 


This only happens during mdadm resync. Other heavy IO activity does not trigger this behaviour.

System freezes/hangs, screen output hangs, no input is possible from keyboard
or mouse.

It is even not possible to initiate a kernel panic with the sysrq keys.

Absolutely nothing is logged in the logs.

Comment 5 RHEL Program Management 2011-10-07 15:56:08 UTC

Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 6 Jes Sorensen 2012-01-27 13:26:50 UTC

Does this problem still happen with RHEL6.2?

Thanks,
Jes

Comment 7 tgb 2012-02-06 11:56:46 UTC

We are faced this issue at least two times (for 6.1 kernel and now for 6.2 too). Both times system freezes during weekly RAID array checking. But I can't firmly reproduce it.

00:1f.2 RAID bus controller: Intel Corporation 82801 SATA Controller [RAID mode]

mdadm --detail /dev/md127
      Container : /dev/md0, member 0
     Raid Level : raid1
     Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
  Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
   Raid Devices : 2
  Total Devices : 2

          State : active, checking 
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

   Check Status : 8% complete


           UUID : a72f8913:3693ec59:a88612ff:3072c153
    Number   Major   Minor   RaidDevice State
       1       8        0        0      active sync   /dev/sda
       0       8       16        1      active sync   /dev/sdb

mdadm --examine /dev/md0
/dev/md0:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : de8f6928
         Family : 023c041b
     Generation : 0012d34c
     Attributes : All supported
           UUID : c4cf95d9:3c32945b:10a9a67c:5fa8fc07
       Checksum : 1f770fe2 correct
    MPB Sectors : 1
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : MN1220F31NYX1D
          State : active
             Id : 00030000
    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

[Volume0]:
           UUID : a72f8913:3693ec59:a88612ff:3072c153
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : 1
      This Slot : 1
     Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
   Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
  Sector Offset : 0
    Num Stripes : 15261808
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : dirty

  Disk00 Serial : MN1220F32B1Z3D
          State : active
             Id : 00020000
    Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)

Comment 8 Jes Sorensen 2013-02-08 06:32:05 UTC

Guybrush

The most likely source of this is mdmon getting killed. Does it happen with
the latest 6.3 updates as well?

Do you do suspend/resume on the system that sees this?

Jes

Comment 9 tgb 2013-02-08 14:54:02 UTC

No, suspend/resume was not used.
Unfortunately, I have no access to that system anymore, so I don't know whether update fixes the issue and also would not be able to help in further investigation.

Comment 10 Jes Sorensen 2013-02-08 18:59:10 UTC

Ok, without more data I have no way of reproducing the problem unfortunately.

The two reports here are for different configs, one for IMSM BIOS RAID and one
for regular RAID. Since there hasn't been any updates on the original bug
since July I am going to assume it is no longer an issue.

If these problems reappear, please open a new BZ.

Jes