52138 – software raid10 consistently hangs

Bug 52138 - software raid10 consistently hangs

Summary: software raid10 consistently hangs

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Public Beta
Classification:	Retired
Component:	kernel
Sub Component:
Version:	roswell
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Ingo Molnar
QA Contact:	David Lawrence
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2001-08-21 00:10 UTC by Jim Wright
Modified:	2007-04-18 16:36 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2001-12-07 23:30:20 UTC
Embargoed:

Attachments	(Terms of Use)
/var/log/messages content from pressing alt-sysrq-t (121.54 KB, text/plain) 2001-08-21 19:33 UTC, Jim Wright	no flags	Details
ps -efal output from roughly the same time as the alt-sysrq-t output (11.88 KB, text/plain) 2001-08-21 19:35 UTC, Jim Wright	no flags	Details
View All

Description Jim Wright 2001-08-21 00:10:07 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.6-xfs i686)

Description of problem:
I've got a system with software RAID10 set up on it.  The pertinent
parts of /etc/raidtab are below.  The system has roswell1 installed on
it, with all the up2date patches as of late last week.  This filesystem
consistently hangs after some period of use (bonnie, tiobench, etc).
No messages are printed to the screen or to any log file.  /proc/mdstat
looks fine.  The system comes up fine after reboot.  But once in this
state, any access to this filesystem hangs that command.

I have five other raid sets defined using these same disks.  None of
the other filesystems give me any trouble.


# ls /raid10 &
# ps -fl
  F S UID        PID  PPID  C PRI  NI ADDR    SZ WCHAN  STIME TTY        
TIME CMD
100 S root      2532  2531  0  76   0    -   623 wait4  16:49 pts/1  
00:00:00 -bash
000 D root      2599  2532  0  69   0    -   427 down   16:50 pts/1  
00:00:00 ls --color=tty /raid10
000 R root      2631  2532  0  79   0    -   771 -      16:52 pts/1  
00:00:00 ps -fl

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md7 : active raid0 md6[1] md5[0]
      10249088 blocks 64k chunks

md5 : active raid1 sdb7[1] sda7[0]
      5124608 blocks [2/2] [UU]

md6 : active raid1 sdd7[1] sdc7[0]
      5124608 blocks [2/2] [UU]





#
# first half of /raid10
#
raiddev             /dev/md5
raid-level                  1
nr-raid-disks               2
chunk-size                  64k
persistent-superblock       1
nr-spare-disks              0
    device          /dev/sda7
    raid-disk     0
    device          /dev/sdb7
    raid-disk     1
#
# second half of /raid10
#
raiddev             /dev/md6
raid-level                  1
nr-raid-disks               2
chunk-size                  64k
persistent-superblock       1
nr-spare-disks              0
    device          /dev/sdc7
    raid-disk     0
    device          /dev/sdd7
    raid-disk     1
#
# /raid10
#
raiddev             /dev/md7
raid-level                  0
nr-raid-disks               2
chunk-size                  64k
persistent-superblock       1
nr-spare-disks              0
    device          /dev/md5
    raid-disk     0
    device          /dev/md6
    raid-disk     1


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. boot
2. use the raid10 filesystem
3. wait for filesystem to lock up
	

Actual Results:  locks up

Expected Results:  keeps working

Additional info:

This may be ext3 related, though that could be a red herring.

Comment 1 Jim Wright 2001-08-21 00:27:12 UTC

before I reboot to run more tests...

# ps -fle | grep ' D '
040 D root         8     1  0  69   0    -     0 end    12:36 ?        00:00:41
[bdflush]
040 D root         9     1  0  69   0    -     0 end    12:36 ?        00:00:11
[kupdated]
040 D root       213     1  0  69   0    -     0 end    12:36 ?        00:00:16
[kjournald]
040 D root      1786     1  0  69   0    -  1410 down   13:28 ?        00:00:03
./tiotest -t 2 -f 512 -r 2000 -b 4096 -d /raid10 -T
040 D root      1787     1  0  69   0    -  1410 down   13:28 ?        00:00:03
./tiotest -t 2 -f 512 -r 2000 -b 4096 -d /raid10 -T
000 D root      2599  2532  0  69   0    -   427 down   16:50 pts/1    00:00:00
ls --color=tty /raid10

Comment 2 Jim Wright 2001-08-21 19:30:42 UTC

changed to component "kernel" per request of Stephen Tweedie

I will also attach the output of alt-sysrq-t and the output of ps -efal.

Comment 3 Jim Wright 2001-08-21 19:33:40 UTC

Created attachment 28779 [details]
/var/log/messages content from pressing alt-sysrq-t

Comment 4 Jim Wright 2001-08-21 19:35:43 UTC

Created attachment 28780 [details]
ps -efal output from roughly the same time as the alt-sysrq-t output

Comment 5 Glen Foster 2001-08-21 20:28:58 UTC

We (Red Hat) should try to fix this before next release.

Comment 6 Ingo Molnar 2001-08-27 07:56:03 UTC

Could you do a quick test using ext2fs? I suspect it's ext3fs that might be 
the problem here. I've got a testsystem running with your exact RAID setup 
(but using ext2fs), and it doesnt hang after hours of tests.

Comment 7 Jim Wright 2001-08-31 16:43:10 UTC

The hardware has been bundled down the street to Linux World.

If I recall correctly, ext3 would always lock up.  ext2 didn't always, or took
longer

Comment 8 Jim Wright 2001-12-07 23:30:15 UTC

I was grokking bugzilla before entering a few more items and came
across this again in my searching.  At the moment, I'm working on
a new system with software raid10 and it is performing like
a champ.  However, this is with the SGI XFS 1.0.2 release which
is based on RH72 and I'm using an XFS filesystem and not ext3.  If
it was an md bug or something else in the kernel, then all seems to
be well now.  If it was/is an ext3 bug, could still be there - I
prefer to avoid ext3.

Note You need to log in before you can comment on or make changes to this bug.