Bug 554774 - Random system hangs with md as last entry in /var/log/messages
Summary: Random system hangs with md as last entry in /var/log/messages
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: mdadm
Version: 12
Hardware: x86_64
OS: Linux
low
medium
Target Milestone: ---
Assignee: Doug Ledford
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-01-12 15:56 UTC by Don Harden
Modified: 2010-02-19 23:31 UTC (History)
1 user (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2010-02-19 23:31:18 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Don Harden 2010-01-12 15:56:57 UTC
Description of problem:
Up to date Fedora 12 x86_64 randomly hangs after being idle for several hours.  Last log entries similar to:

71 Jan 11 20:21:06 zebulon dhclient: Discarding packet with bogus hlen.
72 Jan 12 03:02:01 zebulon kernel: md: data-check of RAID array md
73 Jan 12 03:02:01 zebulon kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
74 Jan 12 03:02:01 zebulon kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-
75 Jan 12 03:02:01 zebulon kernel: md: using 128k window, over a total of 244195840 blocks.
76 Jan 12 09:45:11 zebulon kernel: imklog 4.4.2, log source = /proc/kmsg started.

No corresponding entries in /var/cache/abrt

/ is on a standard ext4 partition 
/boot is on a standard ext3 partition 

The RAID 5 partition is used for user data and backup.

Version-Release number of selected component (if applicable):

Fedora 12 x86_64 2.6.31.9-174.fc12.x86_64 
mdadm.x86_64  3.0.3-2.fc12
filesystem.x86_64  2.4.30-2.fc12 

How reproducible:
Random;  three times in about a month.

Steps to Reproduce:
1.  Boot - do work 
2.  Go home
3.  Next morning system is unresponsive - locked up
  
Actual results:
Random system hangs


Expected results:
No random system hangs

Additional info:
# mount 
/dev/sdb6 on / type ext4 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/sdb1 on /boot type ext3 (rw)
/dev/sda3 on /eSATA/F10-BCK type ext3 (rw)
/dev/sda5 on /eSATA/F12-BCK type ext3 (rw)
/dev/sda2 on /eSATA/Z-BCK type ext3 (rw)
/dev/md0 on /Z type ext3 (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)


 # mdadm -Q --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Wed Jun 11 13:04:38 2008
     Raid Level : raid5
     Array Size : 488391680 (465.77 GiB 500.11 GB)
  Used Dev Size : 244195840 (232.88 GiB 250.06 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Jan 12 10:48:34 2010
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 256K

           UUID : b929c87c:dfc45da3:b94ed757:96c8e19e
         Events : 0.525443

    Number   Major   Minor   RaidDevice State
       0       8       33        0      active sync   /dev/sdc1
       1       8       49        1      active sync   /dev/sdd1
       2       8       65        2      active sync   /dev/sde1

Comment 1 Doug Ledford 2010-02-19 20:18:19 UTC
How reproducible:
Random;  three times in about a month.


This is not random.  The raid check script is run only once per week, so three times in about a month is pretty much every time the script is run.  We know it's the script because of this in the messages:

72 Jan 12 03:02:01 zebulon kernel: md: data-check of RAID array md


So, your machine is locking up almost every time the data check is run.  Generally speaking this is going to be one of two possible issues: a kernel bug or a hardware error leading to a hardlock.

To troubleshoot this, all of the following questions are helpful to answer:

Is this still happening?  Do you have the most up to date F12 kernel?  When it happens are there any suspicious things about the computer, such as hard drive LED stuck on solid?  Do things like the keyboard capslock LED change when you hit capslock while the machine is hung?  Have you run any memory or CPU test programs to check and see if there are any possible weaknesses in the reliability of your hardware?  Have you tried a more up to date kernel from rawhide to see if this is possibly a bug in the md stack of the kernel you are running?

Comment 2 Don Harden 2010-02-19 22:38:17 UTC
Hi Doug,

I have kept this F12 box up to date.  Current kernel is 2.6.31.12-174.2.3.fc12.x86_64.   I have not tried a rawhide kernel.

I ran Prime95 stress tests overnight a couple of times.  No problems, hangs or crashes.   Seems not to be CPU or RAM.

I decreased the speed_limit_max paramter from 200000 (dev.raid.speed_limit_max in /etc/sysctl.conf).   During a manual run of /etc/cron.weekly/99-raid-check my box still hung at 150000, but not at 125000 KB/s.  

I probably should have posted an update, but I was waiting for more time to see if the box would hang again.   But so far a speed_limit_max paramter of 1250000 KB/s seems to be working. 

Other info ...

When the the box did hang (before lowering speed_limit_max) it would lock up to the point that even Magic SysRq keys would not work.  I did not notice if the HD LED or CAPS LOCKS were stuck on.

I still have no clue if it is a software or hardware issue.

The three disks that make up the RAID 5 array are all Seagate Barracuda ST3250410AS 7200.10 SATA 3.0Gb/s 250-GB drives.

The motherboard is a Gigabyte X48T-DQ6 with a 82801I (ICH9 Family) 2 port SATA IDE Controller and a 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller.  The thre RAID disk are on the 4 port controller.

I have not overclocked the CPU, mucked with the RAM timings, etc.  Just pretty basic BIOS settings.

Let me know if there is other info I can provide.

Thanks for looking into this.
Don

Comment 3 Doug Ledford 2010-02-19 23:31:18 UTC
Hi Don, the fact that slowing down the speed limit fixed the problem combined with how hard it locked (alt-sysreq non functional) almost certainly means it is hardware.  In this case, I would usually suspect ram first, CPU second, motherboard third, and in rare cases it can be the power supply.  I'm going to close this out as notabug, but if hardware changes don't resolve the issue, then feel free to reopen the bug.  You can also see my work web page for a shell script that will exercise the memory in your machine harder than any dedicated memory tester and see if that displays any problems.  The address is http://people.redhat.com/dledford and look for the memory test script in the left hand link column.


Note You need to log in before you can comment on or make changes to this bug.