Bug 132980

Summary: raid1 mirrors that lose a drive go into endless redirecting loop
Product: [Fedora] Fedora Reporter: Hrunting Johnson <hrunting>
Component: kernelAssignee: Dave Jones <davej>
Status: CLOSED NEXTRELEASE QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2CC: pfrields, wtogami
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-04-16 04:36:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Hrunting Johnson 2004-09-20 16:24:56 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3)
Gecko/20040913 Firefox/0.10

Description of problem:
When a drive fails out of a RAID1 mirror, the RAID goes into an
endless "redirecting sector" loop that requires a reboot to stop. 
When this happens, the RAID becomes unusable and anything that tries
to access it gets hung.

Version-Release number of selected component (if applicable):
kernel-2.6.8-1.521

How reproducible:
Always

Steps to Reproduce:
1. create raid1 mirror
2. mdadm /dev/mdX --set-faulty /dev/sdX
3. tail -f /var/log/messages
    

Actual Results:  RAID goes into endless "redirecting sector" loop
(messages sent to klogd/console).  RAID is unusable.

Expected Results:  Drive fails out cleanly.  RAID operation continues
on one drive.

Additional info:

A patch has been posted to the linux-raid mailing list that fixes this
problem:  http://marc.theaimsgroup.com/?l=linux-raid&m=109527014728404&w=2

Comment 1 Konstantin Olchanski 2004-10-18 22:30:35 UTC
METOO, but on a production system with a twist and a real hardware fault:
1) raid1 mirror /dev/hda2, /dev/hdc2
2) disk hda develops a hardware fault (unreadable sectors)
3) system freezes after the first "i/o error sector blah..." followed
by "raid1: disk failure hda2, disabling", "raid1: hda2 rescheduling
blah...", SYSTEM FROZEN SOLID.
4) reboot, hdc2 is no longer in the raid volume (out of sync, disabled
pending mirror rebuild?)
5) /dev/md0 uses the faulty disk hda until it hits the unreadable
sectors, then goes into infinite loop as described in this bug: "try
to read hda2, get i/o error", "raid1: hda2: reschedule blah...".
K.O.


Comment 2 Hrunting Johnson 2004-10-20 13:36:36 UTC
Neil Brown, maintainer of the md module, released a series of patches
today that fix this problem and a couple of others.  It would be great
if this could be merged into the next kernel release:

http://marc.theaimsgroup.com/?l=linux-raid&m=109824318228668&w=2
http://marc.theaimsgroup.com/?l=linux-raid&m=109824318202358&w=2
http://marc.theaimsgroup.com/?l=linux-raid&m=109824318216933&w=2
http://marc.theaimsgroup.com/?l=linux-raid&m=109824318110429&w=2
http://marc.theaimsgroup.com/?l=linux-raid&m=109824321013934&w=2
http://marc.theaimsgroup.com/?l=linux-raid&m=109824307111239&w=2

I'm also bumping the severity of this to high since I've personally
lost data when these machines go into their crash because of resulting
fs inconsistencies.

Comment 3 Dave Jones 2005-04-16 04:36:54 UTC
Fedora Core 2 has now reached end of life, and no further updates will be
provided by Red Hat.  The Fedora legacy project will be producing further kernel
updates for security problems only.

If this bug has not been fixed in the latest Fedora Core 2 update kernel, please
try to reproduce it under Fedora Core 3, and reopen if necessary, changing the
product version accordingly.

Thank you.