Bug 230860 - DegradedArray event for no good reason
Summary: DegradedArray event for no good reason
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: mdadm
Version: 6
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Doug Ledford
QA Contact:
URL:
Whiteboard: bzcl34nup
Depends On: 236666
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-03-03 21:02 UTC by Jack Tanner
Modified: 2008-05-06 19:18 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-05-06 19:18:14 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
hdparm -I for /dev/hdb (1.85 KB, text/plain)
2007-04-06 22:50 UTC, Jack Tanner
no flags Details
hdparm -I for /dev/hdc (1.82 KB, text/plain)
2007-04-06 22:51 UTC, Jack Tanner
no flags Details
dmesg when hdc is omitted (81.20 KB, text/plain)
2007-04-16 14:23 UTC, Jack Tanner
no flags Details

Description Jack Tanner 2007-03-03 21:02:53 UTC
Description of problem:

When I reboot, I get this e-mail:

========================================

This is an automatically generated mail message from mdadm
running on foo

A DegradedArray event had been detected on md device /dev/md0.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid1]
md0 : active raid1 hdb[0]
     156290816 blocks [2/1] [U_]

unused devices: <none>

========================================

This is really bizarre, because I should have 2 working devices!

# mdadm /dev/hdc --examine
/dev/hdc:
          Magic : a92b4efc
        Version : 00.90.00
           UUID : 32ea6e7a:2ce2493e:f594b64f:1a51337b
  Creation Time : Sun Mar 26 14:18:38 2006
     Raid Level : raid1
    Device Size : 156290816 (149.05 GiB 160.04 GB)
     Array Size : 156290816 (149.05 GiB 160.04 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0

    Update Time : Sat Mar  3 15:48:48 2007
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0
       Checksum : ac4752fb - correct
         Events : 0.255274


      Number   Major   Minor   RaidDevice State
this     1      22        0        1      active sync   /dev/hdc

   0     0       3       64        0      active sync   /dev/hdb
   1     1      22        0        1      active sync   /dev/hdc


On the other hand, maybe I have only one working device!

# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Sun Mar 26 14:18:38 2006
     Raid Level : raid1
     Array Size : 156290816 (149.05 GiB 160.04 GB)
    Device Size : 156290816 (149.05 GiB 160.04 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sat Mar  3 15:51:04 2007
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 32ea6e7a:2ce2493e:f594b64f:1a51337b
         Events : 0.255284

    Number   Major   Minor   RaidDevice State
       0       3       64        0      active sync   /dev/hdb
       1       0        0        1      removed

So, which is it, one or two?

# mdadm /dev/md0 --add /dev/hdc
mdadm: re-added /dev/hdc
# cat /proc/mdstat 
Personalities : [raid1] 
md0 : active raid1 hdc[2] hdb[0]
      156290816 blocks [2/1] [U_]
      [>....................]  recovery =  0.1% (178304/156290816)
finish=43.7min speed=59434K/sec
      
unused devices: <none>

I know from previous experience that when this finishes, I'll have two working
devices, but /dev/hdc will be (incorrectly) marked as a spare in output from
mdadm --examine /dev/hdc, but not in the output from mdadm --details /dev/md0.

When I reboot (after a clean shutdown), the cycle will repeat with another
e-mail about a DegradedArray and /dev/hdc marked as removed for no reason. 

Version-Release number of selected component (if applicable):

# rpm -q mdadm
mdadm-2.5.4-2.fc6

# uname -a
Linux herbie 2.6.19-1.2911.6.4.fc6 #1 SMP Sat Feb 24 14:39:04 EST 2007 i686 i686
i386 GNU/Linux

# cat /etc/mdadm.conf 
DEVICE partitions
MAILADDR root
# DEVICE /dev/md0 /dev/hdb /dev/hdc
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=32ea6e7a:2ce2493e:f594b64f:1a51337b

# cat /proc/partitions 
major minor  #blocks  name

   3     0  156290904 hda
   3     1     104422 hda1
   3     2    2152678 hda2
   3     3  154031220 hda3
   3    64  156290904 hdb
  22     0  156290904 hdc
 253     0   61440000 dm-0
 253     1   10223616 dm-1
 253     2     917504 dm-2
   9     0  156290816 md0
 253     3  122978304 dm-3

Comment 1 Doug Ledford 2007-03-31 14:06:15 UTC
This sounds like a kernel problem, not an mdadm problem (specifically, it sounds
like the kernel is failing to update the superblock on hdc correctly, mdadm
simply tells the kernel to use hdc on a hotadd operation, the kernel is the
component that actually sets the superblock state and writes it to disk).  Have
recent kernel updates in fc6 fixed this problem for you?

Comment 2 Jack Tanner 2007-03-31 16:09:33 UTC
No help from recent kernel updates. I tested: while running 2.6.20-1.2933.fc6, I
shut down (/sbin/shutdown -r now), and when the system came up the array was
degraded again, and I had to add /dev/hdc back in.

I watched the shutdown procedure, and the last line before the system restarts
says that /dev/md0 is still in use. Could that have something to do with it?
Could it have something to do with having swap on /dev/VolGroup00/LogVol01?

Comment 3 Doug Ledford 2007-03-31 17:34:31 UTC
Yes, the array still being in use is likely to have an impact on things.  This
usually happens when you have / on an lvm device.  The reason the raid device
can't be shut down is that / is never unmounted (although it can be mounted
read-only).  Since / is never unmounted, the lvm device can't be completely shut
down, and since it can't be completely shut down, neither can the raid array. 
Normally, this is handled by the / filesystem going read only, then the lvm
device goes read only (and writes out a clean lvm superblock), then the raid
device goes read only (and also writes a clean superblock).  Also,
theoretically, the IDE subsystem should do a cache flush on the drive during
shutdown, but after all the above items have gone read only.  That would flush
the clean superblock to the device and make startup happen normally.

The problem here may be that one of your drives has write through caching and
the other write behind caching and the IDE subsystem isn't flushing the cache on
your drive (which could be a kernel problem if it just isn't flushing the drive
cache, or an initscripts problem if they aren't properly setting everything read
only prior to the kernel's cache flush).

Try using hdparm to detect the cache settings on hdb and hdc and see if there is
a difference.  If there is, try setting hdc to match hdb and see if that
corrects your problem.  If it doesn't, try to watch the exact ordering of things
getting shut down and post that here so I can reassign this either to the kernel
or to initscripts depending on which one looks like it is the culprit.

Comment 4 Jack Tanner 2007-04-06 22:50:41 UTC
Created attachment 151901 [details]
hdparm -I for /dev/hdb

First off, Doug, thank you for being so helpful, I really appreciate it.

Second, I'm not sure how to tell if the one drive has write-through and the
other has write-behind caching. I'm attaching hdparm -I output for hdb and hdc.


Third, independent of the unmounting bug, could it be that there's still a bug
in mdadm with regard to how it creates conflicting output for mdadm /dev/hdc
--examine and mdadm --detail /dev/md0?

Fourth, w.r.t the unmounting bug, I can't figure out what's not getting killed
properly. As an experiment, I put the system into runlevel 1, checked that no
services were running, waited a minute, and then did a shutdown -t 5 -r now.
Everything stopped "OK", and the md0 was still in use even after all file
systems got unmounted.

Comment 5 Jack Tanner 2007-04-06 22:51:17 UTC
Created attachment 151902 [details]
hdparm -I for /dev/hdc

Comment 6 Doug Ledford 2007-04-07 13:39:41 UTC
Do you have this raid device being used by the lvm subsystem?

Comment 7 Jack Tanner 2007-04-15 14:07:05 UTC
Yup.

Comment 8 Jack Tanner 2007-04-15 14:07:51 UTC
Yes, this raid device is being used by LVM.

Comment 9 Doug Ledford 2007-04-16 13:01:08 UTC
OK, I need to see the dmesg output from this machine starting at bootup until
you've added the missing device back into the raid array.

Comment 10 Jack Tanner 2007-04-16 14:23:46 UTC
Created attachment 152687 [details]
dmesg when hdc is omitted

Here's the entire dmesg. The drive is added back into the mirror around line
540.

Comment 11 Doug Ledford 2007-04-16 14:51:30 UTC
Next I need the output of:

ls /sys/block/dm-*/slaves


Comment 12 Jack Tanner 2007-04-17 04:13:27 UTC
This is once I've re-added hdc... Don't know if that makes a difference.

$ ls /sys/block/dm-*/slaves
/sys/block/dm-0/slaves:
hda3

/sys/block/dm-1/slaves:
hda3

/sys/block/dm-2/slaves:
hda3

/sys/block/dm-3/slaves:
md0


Comment 13 Doug Ledford 2007-04-17 04:31:24 UTC
There is a patch in the bug that this bug depends on.  Can you try out that
patch and get me both the dmesg and the console output during bootup?

Comment 14 Jack Tanner 2007-04-17 14:39:46 UTC
You mean the patch in bug 213586?

I'll be happy to try it, but I'm no mkintrd genius, so please a) give
instructions on what to do to make sure I'm trying it right, and b) tell me how
to back out the changes if my machine doesn't feel like booting.

Comment 15 Bug Zapper 2008-04-04 06:26:07 UTC
Fedora apologizes that these issues have not been resolved yet. We're
sorry it's taken so long for your bug to be properly triaged and acted
on. We appreciate the time you took to report this issue and want to
make sure no important bugs slip through the cracks.

If you're currently running a version of Fedora Core between 1 and 6,
please note that Fedora no longer maintains these releases. We strongly
encourage you to upgrade to a current Fedora release. In order to
refocus our efforts as a project we are flagging all of the open bugs
for releases which are no longer maintained and closing them.
http://fedoraproject.org/wiki/LifeCycle/EOL

If this bug is still open against Fedora Core 1 through 6, thirty days
from now, it will be closed 'WONTFIX'. If you can reporduce this bug in
the latest Fedora version, please change to the respective version. If
you are unable to do this, please add a comment to this bug requesting
the change.

Thanks for your help, and we apologize again that we haven't handled
these issues to this point.

The process we are following is outlined here:
http://fedoraproject.org/wiki/BugZappers/F9CleanUp

We will be following the process here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping to ensure this
doesn't happen again.

And if you'd like to join the bug triage team to help make things
better, check out http://fedoraproject.org/wiki/BugZappers

Comment 16 Bug Zapper 2008-05-06 19:18:12 UTC
This bug is open for a Fedora version that is no longer maintained and
will not be fixed by Fedora. Therefore we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen thus bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.