Red Hat Bugzilla – Bug 465781
MD RAID1 error handler deadlock (raid1d / make_request)
Last modified: 2010-10-23 00:58:47 EDT
Description of problem:
This problem was reported on LKML:
When handling a read error, we freeze the array to stop any other IO while
attempting to over-write with correct data.
This is done in the raid1d(raid10d) thread and must wait for all submitted IO
to complete (except for requests that failed and are sitting in the retry
queue - these are counted in ->nr_queue and will stay there during a freeze).
However write requests need attention from raid1d as bitmap updates might be
required. This can cause a deadlock as raid1 is waiting for requests to
finish that themselves need attention from raid1d.
So we create a new function 'flush_pending_writes' to give that attention, and
call it in freeze_array to be sure that we aren't waiting on raid1d.
Thanks to "K.Tanaka" <firstname.lastname@example.org> for finding and reporting this
Version-Release number of selected component (if applicable):
Difficult without using SCSI fault injection framework etc.
Steps to Reproduce:
1. Arrange for a write to be submitted to an MD RAID1 device immediately following a SCSI error on one of the member devices but before raid1d is woken up to freeze the array and deal with the error (see the timeline of the deadlock in the linked LKML posting for further details on the necessary sequence of events).
raid1d deadlocks in freeze_array() waiting for nr_pending to be decremented. This will never happen because the writer is blocked waiting for raid1d to call raid1_end_write_request() to perform the decrement.
Fixed upstream in git commit ID a35e63efa1fb18c6f20f38e3ddf3f8ffbcf0f6e7:
Created attachment 319547 [details]
backport of upstream fix for RHEL5
Backported patch for 2.6.18-92.el5
Updating PM score.
You can download this test kernel from http://people.redhat.com/dzickus/el5
Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so. However feel free
to provide a comment indicating that this fix has been verified.
Thank you for verification!
I'll forward your result to engineering.
> in kernel-2.6.18-146.el5
> You can download this test kernel from
I have tested the kernel 2.6.18-149.el5 with the attached fault injection
The patch is merged in the kernel and the problem is fixed.
This event sent from IssueTracker by email@example.com
~~ Attention - RHEL 5.4 Beta Released! ~~
RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!
If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.
Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.
Questions can be posted to this bug or your customer or partner representative.
According to comment #7, bug is verified by our customer/partner on kernel 2.6.18-149.el5. will you set this bug as verified?
patch is in kernel-2.6.18-159.el5, add SanityOnly.
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.