Red Hat Bugzilla – Bug 198749
Data corruption after IO error on swap
Last modified: 2018-10-19 15:13:52 EDT
IT# 97208 When an IO error occurs while writing a page to swap, the io completion functiton marks the page with SetPageError() but fails to re-mark the page with SetPageDirty(). Since the writeback was unsuccessful this is in error. The page may subsequently be discarded from memory as it is now clean, resulting in incorrect data being read when the page is later faulted back in. In the read case, we need to check PageUptodate to ensure the IO completed without error and return VM_FAULT_SIGBUS if it did not. In the write case, an additional call to SetPageDirty() is placed immediately after the call to SetPageError(). This is fixed upstream by this changeset: http://linux.bkbits.net:8080/linux-2.6/cset@1.3031 Customer reported the issue in the above IT and provided a patch based on this changeset.
Created attachment 132357 [details] Patch to correct swap IO error handling
Created attachment 133147 [details] upstream patch to solve the issue there. RHEL4 also needs the patch from the bk link I'll take the upstream part, but the addition is not quite nice. It really needs a big fat warning printed to user-space. Also the way to keep the page dirty is not quite OK. New patch is against upstream, that is, RHEL4 also needs the patch from the BK-link.
Created attachment 133307 [details] update the warning and add it on the read side. On request I also added the bio->bi_sector field to the warning, and added a equivalent msg to the read side of things.
Could we get devel_ack here ? Peter created a fix patch and Fujitsu has already verified it.
QE ack for 4.5.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
committed in stream U5 build 42.20. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/
I can confirm that the patch is in, however I don't really have any to test this. Do we have any test results from partners who might have equipment to create disk errors etc?
We can test this using device-mapper. There is infrastructure for injecting faults in a block device. Let me know if this is needed and I'll put something together.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0304.html