If some of dm-raid1 legs fail, dmeventd daemon removes the failed leg from on-disk metadata. It then reloads the table and the kernel stops using the failed disk. If the computer crashes after a primary raid disk fails but before dmeventd conversion occurs ... and the failed disk goes back online on the next reboot, there is a possibility that write that was already returned as successful is reverted and the data is not there. The scenario: * primary dm-raid1 disk fails * a write request is submitted to dm-raid1 device, the request is written on the secondary disk but not on the primary disk * because the request was successfully written to at least one leg, it is signaled as completed to the upper layers * a system crash occurs (before dmeventd could handle the situation) * on the next reboot, the failed disk is back online (except that it doesn't contain data from the failed write requests) * dm-raid1 sees that the dirty bit for the appropriate chunk is set, copies data from the primary disk to the secondary disk. This reverts the write that was already signalled as successful. For discussion, see the thread at https://www.redhat.com/archives/dm-devel/2009-April/msg00178.html
Created attachment 375319 [details] The patch for RHEL 5.5 This is the patch for RHEL 5.5. When attempting to test it, I found out that dmeventd doesn't work at all and lvconvert --repair can't remove the failed leg neither. So I'm holding the patch until userspace bugs are examined and fixed.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
That userspace problem was resolved, it was my configuration error. The patch is already posted.
How to reproduce the bug: - you need some disks that will fail when you want it --- you can use either hot-pluggable disk and unplug it or you can put your PVs on another device mapper device and reload that device with "error" target. - set up a VG on these failable devices, create a mirror in the VG. You don't need to create a filesystem in the mirror. - stop dmeventd with killall -STOP dmeventd - use "dmsetup table" to see were the mirror legs are placed and unplug the disk with the primary mirror leg (it has _mimage_0 suffix). - write to the mirror with some method that bypasses write cache --- O_DIRECT, O_SYNC or normal write and fsync() --- the write succeeds immediatelly. - reset the computer (with reset button) - after the reboot, the written data is not there. After the patch is applied: - the write is held until dmeventd does its job (but since dmeventd is stopped, the write is held indefinitely). - you can resume dmeventd with -CONT and see that the failed mirror leg is destroyed and the write is successfully completed. If you reset the computer after it, the written data will be there. - or you can reset the computer without resuming dmeventd, in this case the write is lost, but it is OK, because the application hasn't been informed about finished write.
... btw ... when you boot the computer after the reset, you need to plug the failed disk back. The data will be incorrectly synchronized from the previously failed disk to the valid disk and this causes the data loss problem.
in kernel-2.6.18-182.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
Chris Ward: bug #517117 is actually the same bug as this. Except that the patch with superblocks linked in #517117 is very complicated and won't be included in the kernel. You can resolve bug #517117 as a duplicate of this bug.
*** Bug 517117 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html