Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 465781

Summary:

MD RAID1 error handler deadlock (raid1d / make_request)

Product:

Red Hat Enterprise Linux 5

Reporter:

Bryn M. Reeves <bmr>

Component:

kernel

Assignee:

Doug Ledford <dledford>

Status:

CLOSED ERRATA

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

high

Docs Contact:

Priority:

high

Version:

5.2

CC:

cward, dzickus, kzhang, mgahagan, peterm, qcai, tao

Target Milestone:

Keywords:

OtherQA

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2009-09-02 08:21:04 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

483701, 485920

Attachments:

Description	Flags
backport of upstream fix for RHEL5	none

Description Bryn M. Reeves 2008-10-06 12:28:33 UTC

Description of problem:
This problem was reported on LKML:

http://marc.info/?l=linux-raid&m=120036652032432&w=2

When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.

This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).

However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.

So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.

Thanks to "K.Tanaka" <k-tanaka.nec.com> for finding and reporting this
problem.

Version-Release number of selected component (if applicable):
2.6.18-*.el5

How reproducible:
Difficult without using SCSI fault injection framework etc.

Steps to Reproduce:
1. Arrange for a write to be submitted to an MD RAID1 device immediately following a SCSI error on one of the member devices but before raid1d is woken up to freeze the array and deal with the error (see the timeline of the deadlock in the linked LKML posting for further details on the necessary sequence of events).

Actual results:
raid1d deadlocks in freeze_array() waiting for nr_pending to be decremented. This will never happen because the writer is blocked waiting for raid1d to call raid1_end_write_request() to perform the decrement.

Expected results:
No deadlock.

Additional info:
Fixed upstream in git commit ID a35e63efa1fb18c6f20f38e3ddf3f8ffbcf0f6e7:

http://marc.info/?l=git-commits-head&m=120467912402266&w=2

Comment 2 Bryn M. Reeves 2008-10-06 12:39:22 UTC

Created attachment 319547 [details]
backport of upstream fix for RHEL5

Backported patch for 2.6.18-92.el5

Comment 4 RHEL Program Management 2009-02-16 15:10:07 UTC

Updating PM score.

Comment 5 Don Zickus 2009-05-12 17:40:13 UTC

in kernel-2.6.18-146.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 Issue Tracker 2009-05-25 09:08:25 UTC

Hi Tanaka-san,

Thank you for verification!
I'll forward your result to engineering.

----
> in kernel-2.6.18-146.el5
> You can download this test kernel from
http://people.redhat.com/dzickus/el5

I have tested the kernel 2.6.18-149.el5 with the attached fault injection
tool.
The patch is merged in the kernel and the problem is fixed.
----

Regards,
Masaki Furuta


This event sent from IssueTracker by mfuruta 
 issue 195150

Comment 8 Chris Ward 2009-07-03 18:10:24 UTC

~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 9 Zhang Kexin 2009-07-28 08:07:01 UTC

Hi Chris, 

According to comment #7, bug is verified by our customer/partner on kernel 2.6.18-149.el5. will you set this bug as verified?

Comment 10 Zhang Kexin 2009-07-28 08:09:20 UTC

patch is in kernel-2.6.18-159.el5, add SanityOnly.

Comment 13 errata-xmlrpc 2009-09-02 08:21:04 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html