Description of problem: Up to date Fedora 12 x86_64 randomly hangs after being idle for several hours. Last log entries similar to: 71 Jan 11 20:21:06 zebulon dhclient: Discarding packet with bogus hlen. 72 Jan 12 03:02:01 zebulon kernel: md: data-check of RAID array md 73 Jan 12 03:02:01 zebulon kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. 74 Jan 12 03:02:01 zebulon kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data- 75 Jan 12 03:02:01 zebulon kernel: md: using 128k window, over a total of 244195840 blocks. 76 Jan 12 09:45:11 zebulon kernel: imklog 4.4.2, log source = /proc/kmsg started. No corresponding entries in /var/cache/abrt / is on a standard ext4 partition /boot is on a standard ext3 partition The RAID 5 partition is used for user data and backup. Version-Release number of selected component (if applicable): Fedora 12 x86_64 2.6.31.9-174.fc12.x86_64 mdadm.x86_64 3.0.3-2.fc12 filesystem.x86_64 2.4.30-2.fc12 How reproducible: Random; three times in about a month. Steps to Reproduce: 1. Boot - do work 2. Go home 3. Next morning system is unresponsive - locked up Actual results: Random system hangs Expected results: No random system hangs Additional info: # mount /dev/sdb6 on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) /dev/sdb1 on /boot type ext3 (rw) /dev/sda3 on /eSATA/F10-BCK type ext3 (rw) /dev/sda5 on /eSATA/F12-BCK type ext3 (rw) /dev/sda2 on /eSATA/Z-BCK type ext3 (rw) /dev/md0 on /Z type ext3 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) # mdadm -Q --detail /dev/md0 /dev/md0: Version : 0.90 Creation Time : Wed Jun 11 13:04:38 2008 Raid Level : raid5 Array Size : 488391680 (465.77 GiB 500.11 GB) Used Dev Size : 244195840 (232.88 GiB 250.06 GB) Raid Devices : 3 Total Devices : 3 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Jan 12 10:48:34 2010 State : clean Active Devices : 3 Working Devices : 3 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 256K UUID : b929c87c:dfc45da3:b94ed757:96c8e19e Events : 0.525443 Number Major Minor RaidDevice State 0 8 33 0 active sync /dev/sdc1 1 8 49 1 active sync /dev/sdd1 2 8 65 2 active sync /dev/sde1
How reproducible: Random; three times in about a month. This is not random. The raid check script is run only once per week, so three times in about a month is pretty much every time the script is run. We know it's the script because of this in the messages: 72 Jan 12 03:02:01 zebulon kernel: md: data-check of RAID array md So, your machine is locking up almost every time the data check is run. Generally speaking this is going to be one of two possible issues: a kernel bug or a hardware error leading to a hardlock. To troubleshoot this, all of the following questions are helpful to answer: Is this still happening? Do you have the most up to date F12 kernel? When it happens are there any suspicious things about the computer, such as hard drive LED stuck on solid? Do things like the keyboard capslock LED change when you hit capslock while the machine is hung? Have you run any memory or CPU test programs to check and see if there are any possible weaknesses in the reliability of your hardware? Have you tried a more up to date kernel from rawhide to see if this is possibly a bug in the md stack of the kernel you are running?
Hi Doug, I have kept this F12 box up to date. Current kernel is 2.6.31.12-174.2.3.fc12.x86_64. I have not tried a rawhide kernel. I ran Prime95 stress tests overnight a couple of times. No problems, hangs or crashes. Seems not to be CPU or RAM. I decreased the speed_limit_max paramter from 200000 (dev.raid.speed_limit_max in /etc/sysctl.conf). During a manual run of /etc/cron.weekly/99-raid-check my box still hung at 150000, but not at 125000 KB/s. I probably should have posted an update, but I was waiting for more time to see if the box would hang again. But so far a speed_limit_max paramter of 1250000 KB/s seems to be working. Other info ... When the the box did hang (before lowering speed_limit_max) it would lock up to the point that even Magic SysRq keys would not work. I did not notice if the HD LED or CAPS LOCKS were stuck on. I still have no clue if it is a software or hardware issue. The three disks that make up the RAID 5 array are all Seagate Barracuda ST3250410AS 7200.10 SATA 3.0Gb/s 250-GB drives. The motherboard is a Gigabyte X48T-DQ6 with a 82801I (ICH9 Family) 2 port SATA IDE Controller and a 82801IR/IO/IH (ICH9R/DO/DH) 4 port SATA IDE Controller. The thre RAID disk are on the 4 port controller. I have not overclocked the CPU, mucked with the RAM timings, etc. Just pretty basic BIOS settings. Let me know if there is other info I can provide. Thanks for looking into this. Don
Hi Don, the fact that slowing down the speed limit fixed the problem combined with how hard it locked (alt-sysreq non functional) almost certainly means it is hardware. In this case, I would usually suspect ram first, CPU second, motherboard third, and in rare cases it can be the power supply. I'm going to close this out as notabug, but if hardware changes don't resolve the issue, then feel free to reopen the bug. You can also see my work web page for a shell script that will exercise the memory in your machine harder than any dedicated memory tester and see if that displays any problems. The address is http://people.redhat.com/dledford and look for the memory test script in the left hand link column.