Bug 185782 - [RHEL4 U3] device-mapper mirror: Data corruption if the default mirror fails during recovery.
[RHEL4 U3] device-mapper mirror: Data corruption if the default mirror fails ...
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
All Linux
medium Severity medium
: ---
: ---
Assigned To: Alasdair Kergon
:
Depends On:
Blocks: 181409 186476
  Show dependency treegraph
 
Reported: 2006-03-17 18:15 EST by Kiyoshi Ueda
Modified: 2013-04-02 19:51 EDT (History)
11 users (show)

See Also:
Fixed In Version: RHSA-2006-0575
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-08-10 18:46:54 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Don't switch default mirror if mirror set is out-of-sync (863 bytes, patch)
2006-03-29 11:27 EST, Jun'ichi NOMURA
no flags Details | Diff
Read retry even if default isn't ok (1.35 KB, patch)
2006-03-29 11:36 EST, Jun'ichi NOMURA
no flags Details | Diff

  None (edit)
Description Kiyoshi Ueda 2006-03-17 18:15:12 EST
Description of problem:
Silent data corruption occurs if the default mirror fails during recovery.
It happens in this way:
  - During recovery, all mirrors can be out-of-sync except for
    the default mirror.
    (For RH_RECOVERING region, writes are done only to the default mirror.)
  - If the default mirror fails, the other mirror is chosen as new default.
  - As this point, writes done to the original default mirror are lost.


Version-Release number of selected component:
kernel-2.6.9-34.EL


How reproducible:
Always


Steps to Reproduce:
 1. Prepare some PVs (more than 3) and create VG from them.
    Example)
      - /dev/sda, /dev/sdb, /dev/sdc, /dev/sdd as PVs
      - vg0 contains these 4 PVs
 2. Create a mirror LV and activate it.
      # lvcreate -L 200M -n lv0 -m 2 vg0
 3. Make filesystem on the mirror LV.
      # mke2fs -j /dev/mapper/vg0-lv0
 4. Disconnect the default mirror PV of the mirror LV.
    Example) If /dev/sda is used for the default mirror of the vg0-lv0:
      # echo offline > /sys/block/sda/device/state
    This step must be completed before the recovery has been finished.
 5. Wait the recovery has been finished.
 6. Remove the failed PV if it isn't automatically removed.
      # vgreduce --removemissing vg0
 7. Check if the filesystem is fine.
      # e2fsck -f /dev/mapper/vg0-lv0


Actual results:
e2fsck complains file system errors.
In general, out-of-sync device is used as new default mirror
and corrupts data silently.


Expected results:
e2fsck should not detect any error.
In general, data lost should not occur as long as possible.
If it's not possible, the mirror should emit error and stop operation.


Additional info:
Comment 1 Jun'ichi NOMURA 2006-03-29 11:27:12 EST
Created attachment 127000 [details]
Don't switch default mirror if mirror set is out-of-sync

At least, we can avoid data corruption by not switching
default mirror when the set is out-of-sync.

To complete this fix, we need to fix error handling
during recovery (BZ#185785).

Additionally, to recover from failure of default mirror,
we need support for multiple master mirror.
Comment 2 Jun'ichi NOMURA 2006-03-29 11:36:15 EST
Created attachment 127001 [details]
Read retry even if default isn't ok

When we don't allow to switch default mirror in case of out-of-sync,
there can be a situation that "default isn't ok but the region
is in-sync, so we can read from other mirror".

If default fails in out-of-sync mirror, we lost data anyway.
But reading as much data as possible may help salvaging.

This patch will do that.
Comment 5 Jason Baron 2006-05-09 13:11:28 EDT
committed in stream U4 build 34.26. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 8 Red Hat Bugzilla 2006-08-10 18:46:56 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0575.html

Note You need to log in before you can comment on or make changes to this bug.