Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 613829

Summary: device failure on a mirror containing snapshot volumes doesn't work
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Jonathan Earl Brassow <jbrassow>
Status: CLOSED ERRATA QA Contact: Corey Marthaler <cmarthal>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: agk, coughlan, ddumas, dwysocha, heinzm, jbrassow, joe.thornber, mbroz, prajnoha, prockai, syeghiay, zkabelac
Target Milestone: rcKeywords: FutureFeature
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.82-1.el6 Doc Type: Enhancement
Doc Text:
LVM Snapshots of Mirrors The LVM snapshot feature provides the ability to create backup images of a logical volume at a particular instant without causing a service interruption. When a change is made to the original device (the origin) after a snapshot is taken, the snapshot feature makes a copy of the changed data area as it was prior to the change so that it can reconstruct the state of the device. Red Hat Enterprise Linux 6 introduces the ability to take a snapshot of a mirrored logical volume. A known issue exists with this Technology Preview. I/O might hang if a device failure in the mirror is encountered. Note, that this issue is related to a failure of the mirror log device, and that no work around is currently known.
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-05-19 14:26:15 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 655920    
Attachments:
Description Flags
Patch to add monitoring of mirrors under snapshots
none
Patch to disallow scanning of snapshot-origin devices
none
log from taft-01 none

Description Corey Marthaler 2010-07-12 21:54:27 UTC
Description of problem:
This seem very similar to (and may be a dup of) bug 596453. In this case however, it only takes one device failing to cause this this lock-up.

When a mirror is an origin congaing snapshot volumes, basic device failure will not work. In this case the mirror log device was failed.

Scenario: Kill log

********* Mirror info for this scenario *********
* mirrors:            origin
* leg devices:        /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
* log devices:        /dev/sdh1
* failpv(s):          /dev/sdh1
* failnode(s):        taft-01
* leg fault policy:   remove
* log fault policy:   remove
*************************************************

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----

<start name="taft-01_origin" pid="21777" time="Mon Jul 12 16:29:36 2010" type="cmd" />

Sleeping 10 seconds to get some outsanding EXT I/O locks before the failure
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----

checkit starting with:
VERIFY
Verify XIOR Stream: /tmp/checkit_origin
Working dir:        /mnt/origin/checkit

Disabling device sdh on taft-01

Attempting I/O to cause mirror down conversion(s) on taft-01
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.184992 s, 227 MB/s



[root@taft-01 ~]# lvs -a -o +devices
  /dev/dm-2: read failed after 0 of 4096 at 4128768: Input/output error
  [...]
  /dev/sdh1: read failed after 0 of 512 at 4096: Input/output error
  Couldn't find device with uuid z0avO9-HL8V-3mA9-qt92-rxa8-DYiJ-oxiJME.
  LV                VG        Attr   LSize   Origin Snap%  Move Log         Copy%  Convert Devices
  origin            taft      owi-ao   2.00g                    origin_mlog 100.00         origin_mimage_0(0),origin_mimage_1(0),origin_mimage_2(0),origin_mimage_3(0)
  [origin_mimage_0] taft      iwi-ao   2.00g                                               /dev/sdb1(0)
  [origin_mimage_1] taft      iwi-ao   2.00g                                               /dev/sdc1(0)
  [origin_mimage_2] taft      iwi-ao   2.00g                                               /dev/sdd1(0)
  [origin_mimage_3] taft      iwi-ao   2.00g                                               /dev/sde1(0)
  [origin_mlog]     taft      lwi-ao   4.00m                                               unknown device(0)
  snap1             taft      swi-a- 200.00m origin  80.35                                 /dev/sdg1(50)
  snap2             taft      swi-a- 200.00m origin  80.35                                 /dev/sdg1(0)

Version-Release number of selected component (if applicable):
lvm2-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-libs-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
lvm2-cluster-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
udev-147-2.18.el6    BUILT: Fri Jun 11 07:47:21 CDT 2010
device-mapper-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
device-mapper-event-libs-1.02.51-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010
cmirror-2.02.69-2.el6    BUILT: Fri Jul  2 07:26:01 CDT 2010

Comment 1 Corey Marthaler 2010-07-12 21:59:18 UTC
Unlike bug 596453, the i/o to cause a down convert worked, it was the sync cmd that appeared to hang. Also, no Repair was ever attempted.

Jul 12 21:30:25 taft-01 qarshd[2097]: Running cmdline: echo offline > /sys/block/sdh/device/state &
Jul 12 21:30:25 taft-01 xinetd[1855]: EXIT: qarsh status=0 pid=2097 duration=0(sec)
Jul 12 21:30:29 taft-01 kernel: sd 2:0:0:7: rejecting I/O to offline device
Jul 12 21:30:29 taft-01 kernel: sd 2:0:0:7: rejecting I/O to offline device
Jul 12 21:30:29 taft-01 kernel: sd 2:0:0:7: rejecting I/O to offline device
Jul 12 21:30:29 taft-01 kernel: sd 2:0:0:7: rejecting I/O to offline device
Jul 12 21:30:30 taft-01 xinetd[1855]: START: qarsh pid=2100 from=::ffff:10.15.80.47
Jul 12 21:30:30 taft-01 qarshd[2100]: Talking to peer 10.15.80.47:34834
Jul 12 21:30:30 taft-01 qarshd[2100]: Running cmdline: dd if=/dev/zero of=/mnt/origin/ddfile count=10 bs=4M
Jul 12 21:30:30 taft-01 xinetd[1855]: EXIT: qarsh status=0 pid=2100 duration=0(sec)
Jul 12 21:30:30 taft-01 xinetd[1855]: START: qarsh pid=2102 from=::ffff:10.15.80.47
Jul 12 21:30:30 taft-01 qarshd[2102]: Talking to peer 10.15.80.47:34835
Jul 12 21:30:30 taft-01 qarshd[2102]: Running cmdline: sync
Jul 12 21:30:30 taft-01 kernel: sd 2:0:0:7: rejecting I/O to offline device
Jul 12 21:30:30 taft-01 kernel: sd 2:0:0:7: rejecting I/O to offline device

Comment 2 Jonathan Earl Brassow 2010-08-13 18:33:08 UTC
The problem here is that the mirror device is not being monitored.

The code that needs updating begins in: LVM2/lib/activate/activate.c:monitor_dev_for_events

You can see there that because the snapshot is over the mirror, the mirror underneath is ignored.

Comment 3 Jonathan Earl Brassow 2010-08-13 20:07:48 UTC
Created attachment 438750 [details]
Patch to add monitoring of mirrors under snapshots

Fixes initial problem where mirror is not monitored.  However, failing a mirror leg while there is I/O running to the origin seems to cause an I/O hang - no I/O can flow through the mirror.

Perhaps there is an issue with the way the suspend is being performed on the mirror (is it using noflush?)

Comment 4 Alasdair Kergon 2010-08-17 22:47:52 UTC
The fundamental problem was that (for historical reasons) in the internal LVM data model used during activation the same LV refers both to the mirror and to any snapshots of it.  When snapshots are stacked on top of a mirror the code is unable to distinguish between activation requests referring to the whole device stack (snapshots + mirror) and activation requests referring only to the mirror.

When the mirror recovery code runs, it needs to manipulate *just* the mirror - and ignore the snapshots.  If it tries to manipulate the whole tree, when it suspends the snapshots it has to wait for outstanding I/O to return from the already-suspended mirror and this leads to deadlock.

The fix is to add an 'origin_only' parameter to the activation code which instructs it to ignore any attached snapshots and only manipulate the origin.  The specific repair code which hits this deadlock now specifies this new internal parameter.

Patches have been committed upstream and so far have passed our tests.

Comment 6 Corey Marthaler 2010-08-19 18:17:04 UTC
[...]
Aug 19 13:09:25 taft-02 kernel: sd 3:0:0:5: rejecting I/O to offline device
Aug 19 13:09:25 taft-02 kernel: sd 3:0:0:5: rejecting I/O to offline device
Aug 19 13:09:28 taft-02 lvm[1271]: device-mapper: waitevent ioctl failed: Interrupted system call
Aug 19 13:09:28 taft-02 lvm[1271]: Another thread is handling an event. Waiting...
Aug 19 13:11:09 taft-02 kernel: sd 3:0:0:5: rejecting I/O to offline device
Aug 19 13:12:12 taft-02 kernel: INFO: task dmeventd:3830 blocked for more than 120 seconds.
Aug 19 13:12:12 taft-02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 19 13:12:12 taft-02 kernel: dmeventd      D 0000000000000002     0  3830      1 0x00000080
Aug 19 13:12:12 taft-02 kernel: ffff880213dd3ac8 0000000000000082 0000000000000000 0000000000000000
Aug 19 13:12:12 taft-02 kernel: ffffea0007098940 0000000000000003 0000000000008450 0000000000000282
Aug 19 13:12:12 taft-02 kernel: ffff88021163fa58 ffff880213dd3fd8 0000000000010518 ffff88021163fa58
Aug 19 13:12:12 taft-02 kernel: Call Trace:
Aug 19 13:12:12 taft-02 kernel: [<ffffffff8109bf79>] ? ktime_get_ts+0xa9/0xe0
Aug 19 13:12:12 taft-02 kernel: [<ffffffff8110cdc0>] ? sync_page+0x0/0x50
Aug 19 13:12:12 taft-02 kernel: [<ffffffff814c4253>] io_schedule+0x73/0xc0
Aug 19 13:12:12 taft-02 kernel: [<ffffffff8110cdfd>] sync_page+0x3d/0x50
[...]

Still doesn't work with the latest rpms. 

2.6.32-59.1.el6.x86_64

lvm2-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
lvm2-libs-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
lvm2-cluster-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
udev-147-2.22.el6    BUILT: Fri Jul 23 07:21:33 CDT 2010
device-mapper-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-libs-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-event-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
device-mapper-event-libs-1.02.53-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010
cmirror-2.02.72-8.el6    BUILT: Wed Aug 18 10:41:52 CDT 2010

Comment 10 Jonathan Earl Brassow 2010-08-23 14:20:12 UTC
I need more information from corey.  My tests work with and without dmeventd, killing legs or the log, and while running I/O.

I'm using the same rpms as corey except the following:
udev-147-2.26.el6.x86_64
libudev-147-2.26.el6.x86_64
... but I wouldn't think that has anything to do with his problems.

I can run the QE tests, but please specify which one you are running.  Please also report whether the problem you are seeing happens all the time, once in a while, or what.

Comment 11 Jonathan Earl Brassow 2010-08-23 16:59:53 UTC
I can't hit the bug by killing a mirror leg.
I /can/ hit the bug by killing a mirror log - happens ~25% of the times I try.

The patch seems to work, but there is a new issue - label scans.  A label scan is taking place on the mirror while it is blocking I/O.  The sub-devices are successfully skipped, but the mirror is not avoided in this case and it should be.  Here is the backtrace:

#0  0x0000003a6d2d41a0 in __read_nocancel () from /lib64/libc.so.6
#1  0x000000000043e1ae in read (where=0x7fff59ab4db0, buffer=<value optimized out>,
    should_write=0) at /usr/include/bits/unistd.h:45
#2  _io (where=0x7fff59ab4db0, buffer=<value optimized out>, should_write=0)
    at device/dev-io.c:94
#3  0x000000000043e42b in _aligned_io (where=<value optimized out>, buffer=0x7fff59ab4e4c,
    should_write=0) at device/dev-io.c:200
#4  0x000000000043e758 in dev_read (dev=<value optimized out>,
    offset=<value optimized out>, len=<value optimized out>, buffer=<value optimized out>)
    at device/dev-io.c:609
#5  0x0000000000480acc in _dev_has_md_magic (dev=0x24e85d8, sb=0x0) at device/dev-md.c:36
#6  dev_is_md (dev=0x24e85d8, sb=0x0) at device/dev-md.c:112
#7  0x0000000000443bf2 in _ignore_md (f=<value optimized out>, dev=0x24e85d8)
    at filters/filter-md.c:30
#8  0x00000000004420f4 in _and_p (f=<value optimized out>, dev=0x24e85d8)
    at filters/filter-composite.c:26
#9  0x000000000043c492 in dev_iter_get (iter=0x25180c0) at device/dev-cache.c:839
#10 0x0000000000434425 in lvmcache_label_scan (cmd=0x24d3000, full_scan=2)
    at cache/lvmcache.c:583
#11 0x0000000000435bf5 in device_from_pvid (cmd=0x24d3000, pvid=0x253f770,
    scan_done_once=0x7fff59ab50cc) at cache/lvmcache.c:765
#12 0x000000000044dee5 in _read_pv (fid=0x25191f0, mem=0x24fd830, vg=0x253ed80,
    pvn=0x25346e0, vgn=<value optimized out>, pv_hash=0x250cdf0, lv_hash=0x0,
    scan_done_once=0x7fff59ab50cc, report_missing_devices=1)
    at format_text/import_vsn1.c:196
#13 0x000000000044d466 in _read_sections (fid=0x25191f0, section=<value optimized out>,
    fn=0x44de10 <_read_pv>, mem=0x24fd830, vg=0x253ed80, vgn=0x2532e00, pv_hash=0x250cdf0,
    lv_hash=0x0, optional=0, scan_done_once=0x7fff59ab50cc)
    at format_text/import_vsn1.c:642
#14 0x000000000044ee6c in _read_vg (fid=0x25191f0, cft=<value optimized out>,
    use_cached_pvs=<value optimized out>) at format_text/import_vsn1.c:763
#15 0x000000000044d0cf in text_vg_import_fd (fid=0x25191f0, file=<value optimized out>,
    dev=0x2503778, offset=<value optimized out>, size=3330, offset2=4608, size2=0,
    checksum_fn=0x46c170 <calc_crc>, checksum=3030878087, when=0x7fff59ab51b8,
    desc=0x7fff59ab51b0) at format_text/import.c:114
#16 0x0000000000449974 in _vg_read_raw_area (fid=0x25191f0, vgname=0x25191e8 "vg",
    area=0x25192e0, precommitted=1) at format_text/format-text.c:515
#17 0x0000000000449b4d in _vg_read_precommit_raw (fid=0x25191f0, vgname=0x25191e8 "vg",
    mda=<value optimized out>) at format_text/format-text.c:562
#18 0x000000000045fc99 in _vg_read (cmd=0x24d3000, vgname=0x25191e8 "vg",
    vgid=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", consistent=0x7fff59ab534c, precommitted=1) at metadata/metadata.c:2985
#19 0x0000000000460caf in _vg_read_by_vgid (cmd=0x24d3000,
    lvid_s=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", precommitted=1) at metadata/metadata.c:3349
#20 lv_from_lvid (cmd=0x24d3000,
    lvid_s=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", precommitted=1) at metadata/metadata.c:3411
#21 0x00000000004334d1 in _lv_suspend (cmd=0x24d3000,
    lvid_s=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", origin_only=1) at activate/activate.c:886
#22 lv_suspend_if_active (cmd=0x24d3000,
    lvid_s=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", origin_only=1) at activate/activate.c:955
#23 0x00000000004839ab in _file_lock_resource (cmd=0x24d3000,
    resource=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", flags=<value optimized out>) at locking/file_locking.c:301
#24 0x0000000000451698 in _lock_vol (cmd=0x24d3000,
    resource=0x7fff59ab6530 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1", flags=572, lv_op=LV_SUSPEND) at locking/locking.c:377
#25 0x00000000004521b2 in lock_vol (cmd=0x24d3000,
    vol=0x25147b0 "25ugZLTQHg0p9oIWmsM3lVHSqIlmMdGyMV2f82UBt8xt3vAnyxNcNc6eotu8mIs1",
    flags=572) at locking/locking.c:449
#26 0x0000000000467baa in _remove_mirror_images (lv=<value optimized out>,
    num_removed=<value optimized out>, is_removable=0x466dc0 <is_mirror_image_removable>,
    removable_baton=<value optimized out>, remove_log=1, collapse=0,
    removed=0x7fff59ab673c) at metadata/mirror.c:961
#27 0x00000000004687ac in remove_mirror_images (lv=0x25147b0,
    num_mirrors=<value optimized out>, is_removable=0x466dc0 <is_mirror_image_removable>,
    removable_baton=0x2515500, remove_log=1) at metadata/mirror.c:1055
#28 0x0000000000468bd6 in remove_mirror_log (cmd=0x24d3000, lv=0x25147b0,
    removable_pvs=0x2515500) at metadata/mirror.c:1639
#29 0x00000000004129d3 in _lv_update_log_type (cmd=0x24d3000, lp=0x7fff59ab6950,
    lv=0x25147b0, operable_pvs=0x2515500, log_count=0) at lvconvert.c:746
#30 0x00000000004140f8 in _lvconvert_mirrors_repair (cmd=0x24d3000, lv=0x25147b0,
    lp=0x7fff59ab6950) at lvconvert.c:1266
#31 _lvconvert_mirrors (cmd=0x24d3000, lv=0x25147b0, lp=0x7fff59ab6950) at lvconvert.c:1346
#32 0x0000000000414962 in _lvconvert_single (cmd=0x24d3000, lv=0x25147b0,
    handle=0x7fff59ab6950) at lvconvert.c:1581
#33 0x00000000004159c2 in lvconvert_single (cmd=0x24d3000, argc=<value optimized out>,
    argv=<value optimized out>) at lvconvert.c:1663
#34 lvconvert (cmd=0x24d3000, argc=<value optimized out>, argv=<value optimized out>)
    at lvconvert.c:1744
#35 0x0000000000419623 in lvm_run_command (cmd=0x24d3000, argc=1, argv=0x7fff59ab6d28)
    at lvmcmdline.c:1082
#36 0x000000000041be42 in lvm2_main (argc=5, argv=0x7fff59ab6d08) at lvmcmdline.c:1439
#37 0x0000003a6d21ec5d in __libc_start_main () from /lib64/libc.so.6
#38 0x000000000040fd89 in _start ()

Comment 12 Jonathan Earl Brassow 2010-08-23 17:14:31 UTC
From lib/activate/dev_manager.c:device_is_usable()

/* FIXME Also check for mirror block_on_error and mpath no paths */
/* For now, we exclude all mirrors */
	
do {
	next = dm_get_next_target(dmt, next, &start, &length,
				  &target_type, &params);
	/* Skip if target type doesn't match */
	if (target_type && !strcmp(target_type, "mirror") && 
		ignore_suspended_devices()) {
		log_debug("%s: Mirror device %s not usable.",
			dev_name(dev), name);
		goto out;
	}
} while (next);


The code here doesn't properly take into account that vg-lv-real may be a mirror, so the calling code thinks it's ok to access the vg-lv device, which causes the mirror to be updated while I/O is being blocked in the kernel.

Comment 13 Jonathan Earl Brassow 2010-08-23 18:34:41 UTC
Created attachment 440448 [details]
Patch to disallow scanning of snapshot-origin devices

This is a heavy-handed fix for the problem.  Disallow scanning of snapshot devices (if ignore_suspended_devices()) - just like we do for mirror devices.

While this fixes the problem, a more gentle approach would be to check if there is a mirror under the snapshot origin and make the further check if the mirror is currently blocking I/O.

Comment 16 Jonathan Earl Brassow 2010-08-27 19:29:22 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
There is a known issue regarding LVM snapshots of LVM mirrors that can cause I/O to hang in the event of a device failure in the mirror.  The issue is tied in particular to a failure of the mirror log device.

Comment 18 Tom Coughlan 2010-08-30 19:55:23 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1,15 @@
-There is a known issue regarding LVM snapshots of LVM mirrors that can cause I/O to hang in the event of a device failure in the mirror.  The issue is tied in particular to a failure of the mirror log device.+Ryan,
+
+We are going to move the new feature "LVM Snapshot of a Mirror" to Tech Preview status.  This is in part as a result of this bug, and in part because we believe more testing is required.
+
+This means that you need to move the section: 
+
+"4.3.1.1. Snapshots of Mirrors"
+
+in the "Storage" chapter of the Release Notes to the "Technology Preview" chapter in the Technical Notes manual. 
+
+In that same section, you can add the following:
+
+There is a known issue regarding LVM snapshots of LVM mirrors that can cause I/O to hang in the event of a device failure in the mirror.  The issue is tied in particular to a failure of the mirror log device. There is no work-around for this at the current time. 
+
+Tom

Comment 21 Ryan Lerch 2010-10-11 05:51:44 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1,15 +1,4 @@
-Ryan,
+LVM Snapshots of Mirrors
+The LVM snapshot feature provides the ability to create backup images of a logical volume at a particular instant without causing a service interruption. When a change is made to the original device (the origin) after a snapshot is taken, the snapshot feature makes a copy of the changed data area as it was prior to the change so that it can reconstruct the state of the device. Red Hat Enterprise Linux 6 introduces the ability to take a snapshot of a mirrored logical volume.
 
-We are going to move the new feature "LVM Snapshot of a Mirror" to Tech Preview status.  This is in part as a result of this bug, and in part because we believe more testing is required.
+A known issue exists with this Technology Preview. I/O might hang if a device failure in the mirror is encountered. Note, that this issue is related to a failure of the mirror log device, and that no work around is currently known.-
-This means that you need to move the section: 
-
-"4.3.1.1. Snapshots of Mirrors"
-
-in the "Storage" chapter of the Release Notes to the "Technology Preview" chapter in the Technical Notes manual. 
-
-In that same section, you can add the following:
-
-There is a known issue regarding LVM snapshots of LVM mirrors that can cause I/O to hang in the event of a device failure in the mirror.  The issue is tied in particular to a failure of the mirror log device. There is no work-around for this at the current time. 
-
-Tom

Comment 23 Jonathan Earl Brassow 2010-11-29 21:54:37 UTC
Tech notes are good - clearing needinfo flag.

Comment 27 Corey Marthaler 2011-02-09 23:53:07 UTC
This bug still appears in the latest rpms. Like mentioned in comment #11, this appears to only happen when killing the log device.

Scenario: Kill disk log of synced 2 leg mirror(s)

********* Mirror hash info for this scenario *********
* names:              syncd_log_2legs_1
* sync:               1
* leg devices:        /dev/sdd1 /dev/sdg1
* log devices:        /dev/sdf1
* no MDA devices:     
* failpv(s):          /dev/sdf1
* failnode(s):        taft-01
* additional snap:    /dev/sdd1
* leg fault policy:   remove
* log fault policy:   allocate
******************************************************

Creating mirror(s) on taft-01...
taft-01: lvcreate -m 1 -n syncd_log_2legs_1 -L 600M helter_skelter /dev/sdd1:0-1000 /dev/sdg1:0-1000 /dev/sdf1:0-150
Creating a snapshot volume of each of the mirrors

Waiting until all mirrors become fully syncd...
   0/1 mirror(s) are fully synced: ( 96.83% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on taft-01...
mke2fs 1.41.12 (17-May-2010)
Mounting mirrored ext filesystems on taft-01...

Writing verification files (checkit) to mirror(s) on...
        ---- taft-01 ----

Sleeping 10 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- taft-01 ----

Disabling device sdf on taft-01
[DEADLOCK]

[root@taft-01 ~]# lvs -a -o +devices
[DEADLOCK]


Looks like dmevetd died during this:
Feb  9 16:56:58 taft-01 abrt[6704]: saved core dump of pid 1222 (/sbin/dmeventd) to /var/spool/abrt/ccpp-1297292218-1222.new/coredump (32448512 bytes)
Feb  9 16:56:58 taft-01 abrtd: Directory 'ccpp-1297292218-1222' creation detected
Feb  9 16:56:58 taft-01 abrtd: Registered Database plugin 'SQLite3'
Feb  9 16:56:58 taft-01 abrtd: New crash /var/spool/abrt/ccpp-1297292218-1222, processing

2.6.32-94.el6.x86_64

lvm2-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-libs-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
lvm2-cluster-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
udev-147-2.31.el6    BUILT: Wed Jan 26 05:39:15 CST 2011
device-mapper-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
device-mapper-event-libs-1.02.62-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011
cmirror-2.02.83-2.el6    BUILT: Tue Feb  8 10:10:57 CST 2011

Comment 28 Corey Marthaler 2011-02-09 23:54:25 UTC
Created attachment 477928 [details]
log from taft-01

Comment 30 Jonathan Earl Brassow 2011-02-17 23:08:51 UTC
is it hung, or does it simply take a while?

I didn't get any core dumps when I tried this, but I did get an LVM hang for a while.  I'm not sure if the I/O got interrupted (delayed) or not, but it did complete and everything converted fine.  [Again, the lvm commands where hung for a while - I think waiting for dmeventd to repair the mirror.]

Comment 31 Corey Marthaler 2011-02-18 18:00:56 UTC
Comment #27 must be a different (and difficult to reproduce) bug. I currently can't reproduce any issues wrt single machine mirrors containing snapshots. There are however still issues with exclusively locked cmirrors, but separate bugs for those should exist. 

I'll mark this bug verified and open a new bug for comment #27.

Comment 32 errata-xmlrpc 2011-05-19 14:26:15 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0772.html