Description of problem: This problem was reported on LKML: http://marc.info/?l=linux-raid&m=120036652032432&w=2 When handling a read error, we freeze the array to stop any other IO while attempting to over-write with correct data. This is done in the raid1d(raid10d) thread and must wait for all submitted IO to complete (except for requests that failed and are sitting in the retry queue - these are counted in ->nr_queue and will stay there during a freeze). However write requests need attention from raid1d as bitmap updates might be required. This can cause a deadlock as raid1 is waiting for requests to finish that themselves need attention from raid1d. So we create a new function 'flush_pending_writes' to give that attention, and call it in freeze_array to be sure that we aren't waiting on raid1d. Thanks to "K.Tanaka" <k-tanaka.nec.com> for finding and reporting this problem. Version-Release number of selected component (if applicable): 2.6.18-*.el5 How reproducible: Difficult without using SCSI fault injection framework etc. Steps to Reproduce: 1. Arrange for a write to be submitted to an MD RAID1 device immediately following a SCSI error on one of the member devices but before raid1d is woken up to freeze the array and deal with the error (see the timeline of the deadlock in the linked LKML posting for further details on the necessary sequence of events). Actual results: raid1d deadlocks in freeze_array() waiting for nr_pending to be decremented. This will never happen because the writer is blocked waiting for raid1d to call raid1_end_write_request() to perform the decrement. Expected results: No deadlock. Additional info: Fixed upstream in git commit ID a35e63efa1fb18c6f20f38e3ddf3f8ffbcf0f6e7: http://marc.info/?l=git-commits-head&m=120467912402266&w=2
Created attachment 319547 [details] backport of upstream fix for RHEL5 Backported patch for 2.6.18-92.el5
Updating PM score.
in kernel-2.6.18-146.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
Hi Tanaka-san, Thank you for verification! I'll forward your result to engineering. ---- > in kernel-2.6.18-146.el5 > You can download this test kernel from http://people.redhat.com/dzickus/el5 I have tested the kernel 2.6.18-149.el5 with the attached fault injection tool. The patch is merged in the kernel and the problem is fixed. ---- Regards, Masaki Furuta This event sent from IssueTracker by mfuruta issue 195150
~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative.
Hi Chris, According to comment #7, bug is verified by our customer/partner on kernel 2.6.18-149.el5. will you set this bug as verified?
patch is in kernel-2.6.18-159.el5, add SanityOnly.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html