From Bugzilla Helper: User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.6-xfs i686) Description of problem: I've got a system with software RAID10 set up on it. The pertinent parts of /etc/raidtab are below. The system has roswell1 installed on it, with all the up2date patches as of late last week. This filesystem consistently hangs after some period of use (bonnie, tiobench, etc). No messages are printed to the screen or to any log file. /proc/mdstat looks fine. The system comes up fine after reboot. But once in this state, any access to this filesystem hangs that command. I have five other raid sets defined using these same disks. None of the other filesystems give me any trouble. # ls /raid10 & # ps -fl F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 100 S root 2532 2531 0 76 0 - 623 wait4 16:49 pts/1 00:00:00 -bash 000 D root 2599 2532 0 69 0 - 427 down 16:50 pts/1 00:00:00 ls --color=tty /raid10 000 R root 2631 2532 0 79 0 - 771 - 16:52 pts/1 00:00:00 ps -fl # cat /proc/mdstat Personalities : [raid0] [raid1] [raid5] read_ahead 1024 sectors md7 : active raid0 md6[1] md5[0] 10249088 blocks 64k chunks md5 : active raid1 sdb7[1] sda7[0] 5124608 blocks [2/2] [UU] md6 : active raid1 sdd7[1] sdc7[0] 5124608 blocks [2/2] [UU] # # first half of /raid10 # raiddev /dev/md5 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sda7 raid-disk 0 device /dev/sdb7 raid-disk 1 # # second half of /raid10 # raiddev /dev/md6 raid-level 1 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/sdc7 raid-disk 0 device /dev/sdd7 raid-disk 1 # # /raid10 # raiddev /dev/md7 raid-level 0 nr-raid-disks 2 chunk-size 64k persistent-superblock 1 nr-spare-disks 0 device /dev/md5 raid-disk 0 device /dev/md6 raid-disk 1 Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. boot 2. use the raid10 filesystem 3. wait for filesystem to lock up Actual Results: locks up Expected Results: keeps working Additional info: This may be ext3 related, though that could be a red herring.
before I reboot to run more tests... # ps -fle | grep ' D ' 040 D root 8 1 0 69 0 - 0 end 12:36 ? 00:00:41 [bdflush] 040 D root 9 1 0 69 0 - 0 end 12:36 ? 00:00:11 [kupdated] 040 D root 213 1 0 69 0 - 0 end 12:36 ? 00:00:16 [kjournald] 040 D root 1786 1 0 69 0 - 1410 down 13:28 ? 00:00:03 ./tiotest -t 2 -f 512 -r 2000 -b 4096 -d /raid10 -T 040 D root 1787 1 0 69 0 - 1410 down 13:28 ? 00:00:03 ./tiotest -t 2 -f 512 -r 2000 -b 4096 -d /raid10 -T 000 D root 2599 2532 0 69 0 - 427 down 16:50 pts/1 00:00:00 ls --color=tty /raid10
changed to component "kernel" per request of Stephen Tweedie I will also attach the output of alt-sysrq-t and the output of ps -efal.
Created attachment 28779 [details] /var/log/messages content from pressing alt-sysrq-t
Created attachment 28780 [details] ps -efal output from roughly the same time as the alt-sysrq-t output
We (Red Hat) should try to fix this before next release.
Could you do a quick test using ext2fs? I suspect it's ext3fs that might be the problem here. I've got a testsystem running with your exact RAID setup (but using ext2fs), and it doesnt hang after hours of tests.
The hardware has been bundled down the street to Linux World. If I recall correctly, ext3 would always lock up. ext2 didn't always, or took longer
I was grokking bugzilla before entering a few more items and came across this again in my searching. At the moment, I'm working on a new system with software raid10 and it is performing like a champ. However, this is with the SGI XFS 1.0.2 release which is based on RH72 and I'm using an XFS filesystem and not ext3. If it was an md bug or something else in the kernel, then all seems to be well now. If it was/is an ext3 bug, could still be there - I prefer to avoid ext3.