172711 – mdadm RAID 5 bug(s) when failing scsi device HDDs

Bug 172711 - mdadm RAID 5 bug(s) when failing scsi device HDDs

Summary: mdadm RAID 5 bug(s) when failing scsi device HDDs

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	mdadm
Sub Component:
Version:	3.0
Hardware:	i386
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-11-08 16:40 UTC by Petros Koutoupis
Modified:	2007-11-30 22:07 UTC (History)
CC List:	0 users
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2007-10-19 18:51:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Petros Koutoupis 2005-11-08 16:40:07 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4.3) Gecko/20040924

Description of problem:
This problem only occurs on the 2.4 (AS3) kernel and has also been tested on the 2.6 (AS4) kernel with no problems.

NOTE - RAID 0s and 1s work just fine with no problems.

Essentially what happens is that when you create a RAID 5 array using the mdadm utility a lot of bugs exist.  First of all...let me explain my setup.  I have a RAID-head populated by 12 SATA drives connect directly (direct connect) to the host through a QLogic HBA (qla2340).  Using the RAID-head utilities I create 12 NRAID arrays (or a single RAID 5 array and partition it to 4+ separate partitions) and map all the Logical Drives/LUNs created from the arrays to the HBA on the host.  When I modprobe to re-initialize the driver, the OS picks up all the /dev/sd(x) with no problems.  I then quickly use the sfdisk utility to partition at least 4 of the LUNs as a Linux RAID Partition.  No problem.  Using the mdadm function I write the following:

mdadm --create /dev/md0 --level=5 --raid-devices=3 --spare-devices=1 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

The array done through mdadm begins (with 3 devices and one hotspare) and can be monitored through /proc/mdstat.
Whether you wait for it to initialize or not....the ALMOST same results are obtained.  md0 is formatted and mounted.  I then proceed to run 8 processes of IO to the mounted devices.  As routine practice I must fail (physically pulling it out of the RAID-head enclosure or removing the LUN mapping) the array in order to see it basic functionality.
If the array is NOT initialized 100% - all 8 processes of IO get killed and the hotspare does not take over.  The array has failed and is not being rebuilt.
If the array IS initialized 100% - all 8 processes of IO are still active and the /proc/mdstat function does not update properly:

[root@rochester root]# cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid5] [multipath]
read_ahead 1024 sectors
Event: 41
md0 : active raid5 sdc1[2] sdd1[3] sdb1[1] sda1[0]
      76196096 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]
       
unused devices: <none>

ALSO when checking the details through the mdadm utility I also noticed:

[root@rochester root]# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.00
  Creation Time : Tue Nov  8 08:41:33 2005
     Raid Level : raid5
     Array Size : 76196096 (72.67 GiB 78.02 GB)
    Device Size : 38098048 (36.33 GiB 39.01 GB)
   Raid Devices : 3
  Total Devices : 5
Preferred Minor : 0
    Persistence : Superblock is persistent
 
    Update Time : Tue Nov  8 09:45:05 2005
          State : dirty, no-errors
 Active Devices : 3
Working Devices : 4
 Failed Devices : 1
  Spare Devices : 1
 
         Layout : left-symmetric
     Chunk Size : 64K
 
    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
       3       8       49        3      spare   /dev/sdd1
           UUID : ddd00a22:b39a6436:16400bc5:8be85b74
         Events : 0.2

I GET SOME unexpected results....even though I created the array with 4 devices (raid-devices=3 and spare-devices=1) but the mdadm details shows a psuedo device equaling to a total of 5 and automatically fails the 5th.
STILL no instance of known failed device and hotspare taking over.  The OS is not picking it up.

Version-Release number of selected component (if applicable):
2.4.21-27

How reproducible:
Always

Steps to Reproduce:
1. create (4+) NRAID arrays or a single RAID 5 array and partition it to 4+ separate partitions. Map the LUNs to the host's HBA and modprobe for device changes.
2. Partion, RAID, format and mount the the newly created array using the mdadm utility.
3. Fail a drive.
  

Actual Results:  Read description above.

Expected Results:  Hotspare should have taken over and RAID 5 array should have been rebuilding.

Additional info:

Comment 1 Petros Koutoupis 2005-11-14 20:46:26 UTC

I forgot to add that this is an issue with mdadm version v.1.5.0 - 22 Jan 2004;
as for version v.1.0.1 - 20 May 2002 and v.1.6.0 - 4 June 2004...these work just
fine.
I have also noticed that on SOME instances it would flag two spare devices when
I only set (ex.) 3 raid devices and 1 spare.  It would create the array (RAID 5)
with just 2 disks and mark the other two as spares (obviously this is physically
and logically incorrect).

Comment 2 RHEL Program Management 2007-10-19 18:51:40 UTC

This bug is filed against RHEL 3, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products. Since
this bug does not meet that criteria, it is now being closed.
 
For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/
 
If you feel this bug is indeed mission critical, please contact your
support representative. You may be asked to provide detailed
information on how this bug is affecting you.

Note You need to log in before you can comment on or make changes to this bug.