Bug 165989 - The msync(MS_SYNC) call should fail after cable pulled from scsi disk
Summary: The msync(MS_SYNC) call should fail after cable pulled from scsi disk
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Peter Staubach
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks: 168424
TreeView+ depends on / blocked
 
Reported: 2005-08-15 15:10 UTC by Wendy Cheng
Modified: 2007-11-30 22:07 UTC (History)
7 users (show)

Fixed In Version: RHSA-2006-0144
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-03-15 16:24:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Test program to demonstrate the problem. (1.48 KB, text/plain)
2005-08-15 15:10 UTC, Wendy Cheng
no flags Details
Proposed patch (544 bytes, patch)
2005-08-22 18:19 UTC, Peter Staubach
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2006:0144 0 qe-ready SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 7 2006-03-15 05:00:00 UTC

Description Wendy Cheng 2005-08-15 15:10:16 UTC
Description of problem:

Customer expects the fix added into bugzilla 116900 can also solve this issue
and has tried out 2.4.21-34.EL without success. Look like the msync(MS_SYNC)
code path needs to get re-examined after the pull-cable-from-the-scsi-disk test.

The problem can be demonstrated by running the uploaded test program which does
msyncs in a loop and then pulling the cables (or disabling the SAN ports). It is
observed that the -34 kernel remount the FS RO and msyncs get read only FS error
 (so far so good, until..) then the for loop continues and subsequent msyncs
succeed. Shouldn't they all get read only FS errors ?

Version-Release number of selected component (if applicable):
2.4.21-34.EL kernel

Steps to Reproduce:
Use the uploaded test program (run as "mmapwrite <filename> <iterations>").

Comment 1 Wendy Cheng 2005-08-15 15:10:17 UTC
Created attachment 117753 [details]
Test program to demonstrate the problem.

Comment 3 Peter Staubach 2005-08-17 18:06:20 UTC
I've been looking at the test program.  What is the typical number of iterations
passed through the second command line argument?

Comment 5 Peter Staubach 2005-08-18 13:36:39 UTC
1) Thanx for the information!
2) Makes sense to me.  Once we figure out what needs to be done, then we can
   figure out where to put it.

Comment 8 Peter Staubach 2005-08-22 15:01:49 UTC
Yes, Stephen's comment had to do with ways of reproducing problems associated
with issues in the storage subsystem, _without_ having to physically change the
hardware.

I have prototyped a solution and am currently discussing it with some of the
engineers to get some feedback on it.

Comment 9 Peter Staubach 2005-08-22 18:19:17 UTC
Created attachment 117974 [details]
Proposed patch

Comment 11 Peter Staubach 2005-08-26 14:34:22 UTC
The msync(2) code in RHEL-3 works by walking a list of dirty pages and
arranging to have them written out to storage.  When the i/o is done on
each page, they are moved to a clean_pages list.  However, when ext3
finds that the file system is readonly, it sets the PG_dirty bit in
the page struct again.  Thus, the page is marked as dirty, ie. needs to
be written to storage, but is on the clean_pages list.  Having the
dirty bit already set prevents the page from moving from the clean_pages
to the dirty_pages list, thus preventing msync from finding the page
again.  Since ext3 is not called into again, the readonly file system
error is not returned again.

The solution is to place the page on the dirty_pages list instead of
the clean_pages list if PG_dirty is set in the page struct.  Thus,
msync will continue to find the page and attempt to write it out.  This
will call into ext3 and ext3 can then return the readonly file system
error.

Please note that RHEL-4 did not suffer from this issue.

Comment 17 Ernie Petrides 2005-09-13 03:27:50 UTC
A fix for this problem has just been committed to the RHEL3 U7
patch pool this evening (in kernel version 2.4.21-37.1.EL).


Comment 30 Red Hat Bugzilla 2006-03-15 16:24:40 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html



Note You need to log in before you can comment on or make changes to this bug.