Bug 502927 - dm-raid1 can return write request as finished and later revert the data
Summary: dm-raid1 can return write request as finished and later revert the data
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: 5.5
Assignee: LVM and device-mapper development team
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
: 517117 (view as bug list)
Depends On: 543298
Blocks: 537251 557597
TreeView+ depends on / blocked
 
Reported: 2009-05-27 19:51 UTC by Mikuláš Patočka
Modified: 2010-03-30 07:36 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-03-30 07:36:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The patch for RHEL 5.5 (9.13 KB, patch)
2009-12-02 04:39 UTC, Mikuláš Patočka
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2010:0178 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update 2010-03-29 12:18:21 UTC

Description Mikuláš Patočka 2009-05-27 19:51:40 UTC
If some of dm-raid1 legs fail, dmeventd daemon removes the failed leg from on-disk metadata. It then reloads the table and the kernel stops using the failed disk.

If the computer crashes after a primary raid disk fails but before dmeventd conversion occurs ... and the failed disk goes back online on the next reboot, there is a possibility that write that was already returned as successful is reverted and the data is not there.

The scenario:
* primary dm-raid1 disk fails
* a write request is submitted to dm-raid1 device, the request is written on the secondary disk but not on the primary disk
* because the request was successfully written to at least one leg, it is signaled as completed to the upper layers
* a system crash occurs (before dmeventd could handle the situation)
* on the next reboot, the failed disk is back online (except that it doesn't contain data from the failed write requests)
* dm-raid1 sees that the dirty bit for the appropriate chunk is set, copies data from the primary disk to the secondary disk. This reverts the write that was already signalled as successful.

For discussion, see the thread at https://www.redhat.com/archives/dm-devel/2009-April/msg00178.html

Comment 1 Mikuláš Patočka 2009-12-02 04:39:18 UTC
Created attachment 375319 [details]
The patch for RHEL 5.5

This is the patch for RHEL 5.5. When attempting to test it, I found out that dmeventd doesn't work at all and lvconvert --repair can't remove the failed leg neither. So I'm holding the patch until userspace bugs are examined and fixed.

Comment 3 RHEL Program Management 2009-12-11 23:01:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Mikuláš Patočka 2009-12-12 05:23:12 UTC
That userspace problem was resolved, it was my configuration error. The patch is already posted.

Comment 6 Mikuláš Patočka 2009-12-14 18:57:22 UTC
How to reproduce the bug:

- you need some disks that will fail when you want it --- you can use either hot-pluggable disk and unplug it or you can put your PVs on another device mapper device and reload that device with "error" target.

- set up a VG on these failable devices, create a mirror in the VG. You don't need to create a filesystem in the mirror.

- stop dmeventd with killall -STOP dmeventd

- use "dmsetup table" to see were the mirror legs are placed and unplug the disk with the primary mirror leg (it has _mimage_0 suffix).

- write to the mirror with some method that bypasses write cache --- O_DIRECT, O_SYNC or normal write and fsync() --- the write succeeds immediatelly.

- reset the computer (with reset button)

- after the reboot, the written data is not there.

After the patch is applied:

- the write is held until dmeventd does its job (but since dmeventd is stopped, the write is held indefinitely).

- you can resume dmeventd with -CONT and see that the failed mirror leg is destroyed and the write is successfully completed. If you reset the computer after it, the written data will be there.

- or you can reset the computer without resuming dmeventd, in this case the write is lost, but it is OK, because the application hasn't been informed about finished write.

Comment 7 Mikuláš Patočka 2009-12-14 19:01:08 UTC
... btw ... when you boot the computer after the reset, you need to plug the failed disk back.

The data will be incorrectly synchronized from the previously failed disk to the valid disk and this causes the data loss problem.

Comment 8 Don Zickus 2009-12-16 19:01:25 UTC
in kernel-2.6.18-182.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 11 Mikuláš Patočka 2010-01-22 03:31:49 UTC
Chris Ward: bug #517117 is actually the same bug as this. Except that the patch with superblocks linked in #517117 is very complicated and won't be included in the kernel. You can resolve bug #517117 as a duplicate of this bug.

Comment 12 Larry Troan 2010-02-06 01:18:02 UTC
*** Bug 517117 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2010-03-30 07:36:26 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html


Note You need to log in before you can comment on or make changes to this bug.