Bug 465781 - MD RAID1 error handler deadlock (raid1d / make_request)
Summary: MD RAID1 error handler deadlock (raid1d / make_request)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.2
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Doug Ledford
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks: 483701 485920
TreeView+ depends on / blocked
 
Reported: 2008-10-06 12:28 UTC by Bryn M. Reeves
Modified: 2018-10-20 03:13 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-09-02 08:21:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
backport of upstream fix for RHEL5 (2.97 KB, patch)
2008-10-06 12:39 UTC, Bryn M. Reeves
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1243 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update 2009-09-01 08:53:34 UTC

Description Bryn M. Reeves 2008-10-06 12:28:33 UTC
Description of problem:
This problem was reported on LKML:

    http://marc.info/?l=linux-raid&m=120036652032432&w=2

    When handling a read error, we freeze the array to stop any other IO while
    attempting to over-write with correct data.
    
    This is done in the raid1d(raid10d) thread and must wait for all submitted IO
    to complete (except for requests that failed and are sitting in the retry
    queue - these are counted in ->nr_queue and will stay there during a freeze).
    
    However write requests need attention from raid1d as bitmap updates might be
    required.  This can cause a deadlock as raid1 is waiting for requests to
    finish that themselves need attention from raid1d.
    
    So we create a new function 'flush_pending_writes' to give that attention, and
    call it in freeze_array to be sure that we aren't waiting on raid1d.
    
    Thanks to "K.Tanaka" <k-tanaka.nec.com> for finding and reporting this
    problem.

Version-Release number of selected component (if applicable):
2.6.18-*.el5

How reproducible:
Difficult without using SCSI fault injection framework etc.

Steps to Reproduce:
1. Arrange for a write to be submitted to an MD RAID1 device immediately following a SCSI error on one of the member devices but before raid1d is woken up to freeze the array and deal with the error (see the timeline of the deadlock in the linked LKML posting for further details on the necessary sequence of events).
  
Actual results:
raid1d deadlocks in freeze_array() waiting for nr_pending to be decremented. This will never happen because the writer is blocked waiting for raid1d to call raid1_end_write_request() to perform the decrement.

Expected results:
No deadlock.

Additional info:
Fixed upstream in git commit ID a35e63efa1fb18c6f20f38e3ddf3f8ffbcf0f6e7:

http://marc.info/?l=git-commits-head&m=120467912402266&w=2

Comment 2 Bryn M. Reeves 2008-10-06 12:39:22 UTC
Created attachment 319547 [details]
backport of upstream fix for RHEL5

Backported patch for 2.6.18-92.el5

Comment 4 RHEL Program Management 2009-02-16 15:10:07 UTC
Updating PM score.

Comment 5 Don Zickus 2009-05-12 17:40:13 UTC
in kernel-2.6.18-146.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 7 Issue Tracker 2009-05-25 09:08:25 UTC
Hi Tanaka-san,

Thank you for verification!
I'll forward your result to engineering.

----
> in kernel-2.6.18-146.el5
> You can download this test kernel from
http://people.redhat.com/dzickus/el5

I have tested the kernel 2.6.18-149.el5 with the attached fault injection
tool.
The patch is merged in the kernel and the problem is fixed.
----

Regards,
Masaki Furuta


This event sent from IssueTracker by mfuruta 
 issue 195150

Comment 8 Chris Ward 2009-07-03 18:10:24 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 9 Zhang Kexin 2009-07-28 08:07:01 UTC
Hi Chris, 

According to comment #7, bug is verified by our customer/partner on kernel 2.6.18-149.el5. will you set this bug as verified?

Comment 10 Zhang Kexin 2009-07-28 08:09:20 UTC
patch is in kernel-2.6.18-159.el5, add SanityOnly.

Comment 13 errata-xmlrpc 2009-09-02 08:21:04 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html


Note You need to log in before you can comment on or make changes to this bug.