Bug 169443

Summary: lsraid and /proc/mdstat reports different statuses on arrays
Product: Red Hat Enterprise Linux 3 Reporter: Dan Fruehauf <danfr>
Component: raidtoolsAssignee: Doug Ledford <dledford>
Status: CLOSED NOTABUG QA Contact: David Lawrence <dkl>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0   
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-10-26 22:25:32 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Fruehauf 2005-09-28 13:18:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.8) Gecko/20050623 Fedora/1.0.4-5 Firefox/1.0.4

Description of problem:
lsraid and /proc/mdstat shows different things about an array.
I'm not sure how reproducible that is - but it's the 3rd time i encounter this...

mdstat :
[root@Linux201 root]# cat /proc/mdstat
Personalities : [raid1]
read_ahead 1024 sectors
Event: 7
md6 : active raid1 sdd3[1] sdb3[0]
      19486720 blocks [2/2] [UU]

md5 : active raid1 sdd2[1] sdb2[0]
      11727360 blocks [2/2] [UU]

md4 : active raid1 sdd1[1] sdb1[0]
      39069952 blocks [2/2] [UU]

md3 : active raid1 sdc4[2] sda4[0]
      3911744 blocks [2/1] [U_]

md2 : active raid1 sdc3[2] sda3[0]
      3911744 blocks [2/1] [U_]

md1 : active raid1 sdc2[2] sda2[0]
      62508800 blocks [2/1] [U_]

md0 : active raid1 sdc1[1] sda1[0]
      24000 blocks [2/2] [UU]

unused devices: <none>

[root@Linux201 root]# lsraid -a /dev/md2
[dev   9,   2] /dev/md2         4B509E8E.1ED7E78D.40F6EE70.6B0D0F7A online
[dev   8,   3] /dev/sda3        4B509E8E.1ED7E78D.40F6EE70.6B0D0F7A good
[dev   ?,   ?] (unknown)        00000000.00000000.00000000.00000000 missing

And the question is why doesnt /dev/md2 show the spare device?
md1 shows this :
[root@Linux201 root]# lsraid -a /dev/md1
[dev   9,   1] /dev/md1         096AE65D.AD8A29EE.C7BDADA3.3A2784B9 online
[dev   8,   2] /dev/sda2        096AE65D.AD8A29EE.C7BDADA3.3A2784B9 good
[dev   ?,   ?] (unknown)        00000000.00000000.00000000.00000000 missing
[dev   8,  34] /dev/sdc2        096AE65D.AD8A29EE.C7BDADA3.3A2784B9 spare

Which is fine.

So if i want to monitor a md device, what should i trust? /proc/mdstat? lsraid? both? or none?

Another things i'd like to note - in the current state - running raidstop /dev/md2 will crash the system (kernel panic).

Version-Release number of selected component (if applicable):
raidtools-1.00.3-7

How reproducible:
Sometimes

Steps to Reproduce:
1. When the array is in that state cat /proc/mdstat and lsraid a device
2. lsraid will not report the spare device although it should
3. Try to stop the 'problematic' md device and receive a panic
  

Actual Results:  /proc/mdstat and lsraid show different outputs instead of reporting the same.

Expected Results:  This is what i wanted to see in lsraid.

[root@Linux201 root]# lsraid -a /dev/md2
[dev   9,   2] /dev/md2         4B509E8E.1ED7E78D.40F6EE70.6B0D0F7A online
[dev   8,   3] /dev/sda3        4B509E8E.1ED7E78D.40F6EE70.6B0D0F7A good
[dev   ?,   ?] (unknown)        00000000.00000000.00000000.00000000 missing
[dev   8,  34] /dev/sdc3        4B509E8E.1ED7E78D.40F6EE70.6B0D0F7A spare

Needless to say I didn't want to recieve that panic either...

Additional info:

I'm currently running on RHEL3U2 but this problem happened to me twice on RHEL3U5 as well.

Comment 1 Doug Ledford 2005-10-26 22:25:32 UTC
First, lsraid is part of the deprecated raidtools package and likely will not be
updated regardless of whether there is a bug there or not.  Mdadm is the
preferred tool to use, and mdadm -E --brief will probably give you what you are
looking for.

Now, the issue of lsraid not seeing the spare is probably because the drive
hasn't started the resync process yet.  Since md1 is already resyncing, the
process of adding sdc3 to md2 would be delayed and therefore the disk would not
be registered as an active spare yet.

As to which to trust?  They are both correct.  The mdstat file shows the current
kernel state.  The kernel knows about the new spare disks, but isn't doing
anything with them yet.  When lsraid queries the disk array, it doesn't see the
spare because the spare isn't live yet.

The oops on shutting down the md2 array is another matter entirely and I believe
I already know what that problem is.  I believe you'll find that the oops has
already been reported in bz #134736.

Since I don't see a bug in this report, other than the oops which is already
reported elsewhere, I'm closing this bug report out.  Thank you for the report
and please add yourself to the above mentioned bugzilla if you would like to be
kept abreast of when the "shutting down a raid1 array while a rebuild is taking
place" bug gets fixed.