Bug 52138 - software raid10 consistently hangs
software raid10 consistently hangs
Status: CLOSED CURRENTRELEASE
Product: Red Hat Public Beta
Classification: Retired
Component: kernel (Show other bugs)
roswell
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Ingo Molnar
David Lawrence
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2001-08-20 20:10 EDT by Jim Wright
Modified: 2007-04-18 12:36 EDT (History)
0 users

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2001-12-07 18:30:20 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
/var/log/messages content from pressing alt-sysrq-t (121.54 KB, text/plain)
2001-08-21 15:33 EDT, Jim Wright
no flags Details
ps -efal output from roughly the same time as the alt-sysrq-t output (11.88 KB, text/plain)
2001-08-21 15:35 EDT, Jim Wright
no flags Details

  None (edit)
Description Jim Wright 2001-08-20 20:10:07 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.77 [en] (X11; U; Linux 2.4.6-xfs i686)

Description of problem:
I've got a system with software RAID10 set up on it.  The pertinent
parts of /etc/raidtab are below.  The system has roswell1 installed on
it, with all the up2date patches as of late last week.  This filesystem
consistently hangs after some period of use (bonnie, tiobench, etc).
No messages are printed to the screen or to any log file.  /proc/mdstat
looks fine.  The system comes up fine after reboot.  But once in this
state, any access to this filesystem hangs that command.

I have five other raid sets defined using these same disks.  None of
the other filesystems give me any trouble.


# ls /raid10 &
# ps -fl
  F S UID        PID  PPID  C PRI  NI ADDR    SZ WCHAN  STIME TTY        
TIME CMD
100 S root      2532  2531  0  76   0    -   623 wait4  16:49 pts/1  
00:00:00 -bash
000 D root      2599  2532  0  69   0    -   427 down   16:50 pts/1  
00:00:00 ls --color=tty /raid10
000 R root      2631  2532  0  79   0    -   771 -      16:52 pts/1  
00:00:00 ps -fl

# cat /proc/mdstat
Personalities : [raid0] [raid1] [raid5]
read_ahead 1024 sectors
md7 : active raid0 md6[1] md5[0]
      10249088 blocks 64k chunks

md5 : active raid1 sdb7[1] sda7[0]
      5124608 blocks [2/2] [UU]

md6 : active raid1 sdd7[1] sdc7[0]
      5124608 blocks [2/2] [UU]





#
# first half of /raid10
#
raiddev             /dev/md5
raid-level                  1
nr-raid-disks               2
chunk-size                  64k
persistent-superblock       1
nr-spare-disks              0
    device          /dev/sda7
    raid-disk     0
    device          /dev/sdb7
    raid-disk     1
#
# second half of /raid10
#
raiddev             /dev/md6
raid-level                  1
nr-raid-disks               2
chunk-size                  64k
persistent-superblock       1
nr-spare-disks              0
    device          /dev/sdc7
    raid-disk     0
    device          /dev/sdd7
    raid-disk     1
#
# /raid10
#
raiddev             /dev/md7
raid-level                  0
nr-raid-disks               2
chunk-size                  64k
persistent-superblock       1
nr-spare-disks              0
    device          /dev/md5
    raid-disk     0
    device          /dev/md6
    raid-disk     1


Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. boot
2. use the raid10 filesystem
3. wait for filesystem to lock up
	

Actual Results:  locks up

Expected Results:  keeps working

Additional info:

This may be ext3 related, though that could be a red herring.
Comment 1 Jim Wright 2001-08-20 20:27:12 EDT
before I reboot to run more tests...

# ps -fle | grep ' D '
040 D root         8     1  0  69   0    -     0 end    12:36 ?        00:00:41
[bdflush]
040 D root         9     1  0  69   0    -     0 end    12:36 ?        00:00:11
[kupdated]
040 D root       213     1  0  69   0    -     0 end    12:36 ?        00:00:16
[kjournald]
040 D root      1786     1  0  69   0    -  1410 down   13:28 ?        00:00:03
./tiotest -t 2 -f 512 -r 2000 -b 4096 -d /raid10 -T
040 D root      1787     1  0  69   0    -  1410 down   13:28 ?        00:00:03
./tiotest -t 2 -f 512 -r 2000 -b 4096 -d /raid10 -T
000 D root      2599  2532  0  69   0    -   427 down   16:50 pts/1    00:00:00
ls --color=tty /raid10
Comment 2 Jim Wright 2001-08-21 15:30:42 EDT
changed to component "kernel" per request of Stephen Tweedie

I will also attach the output of alt-sysrq-t and the output of ps -efal.
Comment 3 Jim Wright 2001-08-21 15:33:40 EDT
Created attachment 28779 [details]
/var/log/messages content from pressing alt-sysrq-t
Comment 4 Jim Wright 2001-08-21 15:35:43 EDT
Created attachment 28780 [details]
ps -efal output from roughly the same time as the alt-sysrq-t output
Comment 5 Glen Foster 2001-08-21 16:28:58 EDT
We (Red Hat) should try to fix this before next release.
Comment 6 Ingo Molnar 2001-08-27 03:56:03 EDT
Could you do a quick test using ext2fs? I suspect it's ext3fs that might be 
the problem here. I've got a testsystem running with your exact RAID setup 
(but using ext2fs), and it doesnt hang after hours of tests.
Comment 7 Jim Wright 2001-08-31 12:43:10 EDT
The hardware has been bundled down the street to Linux World.

If I recall correctly, ext3 would always lock up.  ext2 didn't always, or took
longer
Comment 8 Jim Wright 2001-12-07 18:30:15 EST
I was grokking bugzilla before entering a few more items and came
across this again in my searching.  At the moment, I'm working on
a new system with software raid10 and it is performing like
a champ.  However, this is with the SGI XFS 1.0.2 release which
is based on RH72 and I'm using an XFS filesystem and not ext3.  If
it was an md bug or something else in the kernel, then all seems to
be well now.  If it was/is an ext3 bug, could still be there - I
prefer to avoid ext3.

Note You need to log in before you can comment on or make changes to this bug.